Sr. Staff Production Engineer

New
Z
ZscalerCybersecurity
California, USAFull-TimeStaff
Salary140000 - 200000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
8+ years
Required Skills
AWSPythonGCPAzureGoGrafanaPrometheusLinuxTerraformAnsible

Requirements

  • 8+ years of experience managing reliability, scalability, and availability for large-scale production services
  • Deep expertise in programming (e.g., Python, Go, or C/C++)
  • Strong background in networking protocols
  • Strong background in Linux/FreeBSD systems
  • Strong background in distributed architecture
  • Experience in high-stakes incident management
  • Participation in a 24/7 on-call rotation
  • Proficiency in leveraging ITIL frameworks
  • Proficiency in using incident data to drive service maturity through systematic problem management
  • Proficiency in using incident data to drive service maturity through technical operability reviews
  • Extensive experience with public cloud (AWS, Azure, GCP) (Preferred)
  • Experience with Infrastructure-as-Code (Ansible, Terraform) (Preferred)
  • Experience with chaos engineering and disaster recovery planning at scale (Preferred)
  • Expertise in global routing (BGP) (Preferred)
  • Expertise in traffic tunneling (GRE, IPSec) (Preferred)
  • Deep understanding of L7 proxy architectures (HAProxy) (Preferred)
  • Deep understanding of DNS at scale (Preferred)
  • Deep understanding of OS networking stack internals (Preferred)

Responsibilities

  • Provide technical vision and hands-on execution to drive an "automation-first" culture
  • Mature observability and architectural standards to reduce Mean Time to Mitigate (MTTM)
  • Shape the scalability of globally distributed, multi-cloud infrastructure
  • Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
  • Write code (Python/Go) to eliminate manual toil and build self-healing systems
  • Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
  • Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
  • Partner with Engineering and partner teams to conduct operability reviews
View Full Description & ApplyYou'll be redirected to the employer's site
140000 - 200000 USD per year
Apply Now