Sr. Staff Production Engineer
New
Z
ZscalerCybersecurity
California, USAFull-TimeStaff
Salary140000 - 200000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- AWSPythonGCPAzureGoGrafanaPrometheusLinuxTerraformAnsible
Requirements
- 8+ years of experience managing reliability, scalability, and availability for large-scale production services
- Deep expertise in programming (e.g., Python, Go, or C/C++)
- Strong background in networking protocols
- Strong background in Linux/FreeBSD systems
- Strong background in distributed architecture
- Experience in high-stakes incident management
- Participation in a 24/7 on-call rotation
- Proficiency in leveraging ITIL frameworks
- Proficiency in using incident data to drive service maturity through systematic problem management
- Proficiency in using incident data to drive service maturity through technical operability reviews
- Extensive experience with public cloud (AWS, Azure, GCP) (Preferred)
- Experience with Infrastructure-as-Code (Ansible, Terraform) (Preferred)
- Experience with chaos engineering and disaster recovery planning at scale (Preferred)
- Expertise in global routing (BGP) (Preferred)
- Expertise in traffic tunneling (GRE, IPSec) (Preferred)
- Deep understanding of L7 proxy architectures (HAProxy) (Preferred)
- Deep understanding of DNS at scale (Preferred)
- Deep understanding of OS networking stack internals (Preferred)
Responsibilities
- Provide technical vision and hands-on execution to drive an "automation-first" culture
- Mature observability and architectural standards to reduce Mean Time to Mitigate (MTTM)
- Shape the scalability of globally distributed, multi-cloud infrastructure
- Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
- Write code (Python/Go) to eliminate manual toil and build self-healing systems
- Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
- Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
- Partner with Engineering and partner teams to conduct operability reviews
View Full Description & ApplyYou'll be redirected to the employer's site