Senior Site Reliability Engineer
New
W
Wikimedia FoundationTechnology/Non-profit
Please note that we are currently able to hire in the following: US States: Arizona, California, Colorado, Connecticut, District of Columbia*, Florida, Georgia, Idaho, Illinois, Indiana, Iowa, Maryland, Massachusetts, Michigan, Minnesota, Missouri, New Jersey, New Mexico, New York, North Carolina, Ohio, Oklahoma, Oregon, Pennsylvania, Puerto Rico*, Rhode Island, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin and Wyoming (*US Territory or Federal District) Countries: Brazil, Canada, Colombia, Germany, Ghana, India, Indonesia, Italy, Kenya*, Mexico, Morocco, Netherlands, Poland, Singapore*, South Africa, Spain, Switzerland and the United Kingdom.Full-TimeSenior
Salary116,633 - 181,243 USD per year
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSPythonKubernetesGoPrometheusTerraformAnsibleGitLab
Requirements
- Experience with Infrastructure as Code (Terraform, Ansible)
- Proficiency in at least one programming language (e.g., Python, Go)
- Experience operating and optimizing cloud-based systems (AWS, Azure, or GCP)
- Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD)
- Experience with incident response, on-call practices, and postmortems
- Strong understanding of SRE best practices (SLOs, SLIs, error budgets)
- Experience with observability tools (e.g., Prometheus, OpenTelemetry)
- Ability to work effectively in a distributed, cross-functional environment
- Strong documentation and communication skills
Responsibilities
- Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets
- Build and enhance observability systems (metrics, logs, and distributed tracing)
- Drive reliability engineering practices, including capacity planning, load testing, and resilience validation
- Improve developer experience (DevEx) by enabling self-service infrastructure
- Design, implement, and optimize CI/CD and GitOps workflows
- Implement secure-by-default infrastructure and enforce best practices
- Continuously optimize infrastructure cost and efficiency using FinOps principles
- Establish and track operational metrics such as MTTR, MTTD, and incident frequency
- Reduce operational toil through automation-first solutions
- Collaborate with a global team and mentor peers
View Full Description & ApplyYou'll be redirected to the employer's site