Principal Production Engineer
Z
ZscalerCybersecurity
Remote - California, USA; San Jose, California, USAFull-TimePrincipal
Salary164,500 - 235,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 10+ years
- Required Skills
- AWSPythonGCPKubernetesGoGrafanaPrometheusLinuxTerraform
Requirements
- 10+ years of experience managing reliability, scalability, and availability for large-scale production services
- Deep expertise in programming (e.g., Python, Go, or C/C++)
- Strong background in networking protocols, Linux/RHEL systems, and distributed architecture
- Experience in high-stakes incident management and participation in a 24/7 on-call rotation
- Proficiency in leveraging ITIL frameworks and incident data
- Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform, Helm, Temporal)
- Expertise in global routing (BGP), traffic tunneling (GRE, IPSec), L7 proxy (HAProxy), and DNS at scale
Responsibilities
- Design and implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments
- Drive an automation-first culture by writing code to eliminate manual toil and build self-healing systems
- Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry)
- Define SLIs/SLOs and establish error budgets
- Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct post-incident analyses
- Partner with Engineering and partner teams to conduct operability reviews
View Full Description & ApplyYou'll be redirected to the employer's site