Senior Site Reliability Engineer

New

Wikimedia FoundationTechnology/Non-profit

Please note that we are currently able to hire in the following: US States: Arizona, California, Colorado, Connecticut, District of Columbia*, Florida, Georgia, Idaho, Illinois, Indiana, Iowa, Maryland, Massachusetts, Michigan, Minnesota, Missouri, New Jersey, New Mexico, New York, North Carolina, Ohio, Oklahoma, Oregon, Pennsylvania, Puerto Rico*, Rhode Island, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin and Wyoming (*US Territory or Federal District) Countries: Brazil, Canada, Colombia, Germany, Ghana, India, Indonesia, Italy, Kenya*, Mexico, Morocco, Netherlands, Poland, Singapore*, South Africa, Spain, Switzerland and the United Kingdom.Full-TimeSenior

Salary116,633 - 181,243 USD per year

Apply NowOpens the employer's application page

Job Details

Required Skills: AWSPythonKubernetesGoPrometheusTerraformAnsibleGitLab

Requirements

Experience with Infrastructure as Code (Terraform, Ansible)
Proficiency in at least one programming language (e.g., Python, Go)
Experience operating and optimizing cloud-based systems (AWS, Azure, or GCP)
Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD)
Experience with incident response, on-call practices, and postmortems
Strong understanding of SRE best practices (SLOs, SLIs, error budgets)
Experience with observability tools (e.g., Prometheus, OpenTelemetry)
Ability to work effectively in a distributed, cross-functional environment
Strong documentation and communication skills

Responsibilities

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets
Build and enhance observability systems (metrics, logs, and distributed tracing)
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation
Improve developer experience (DevEx) by enabling self-service infrastructure
Design, implement, and optimize CI/CD and GitOps workflows
Implement secure-by-default infrastructure and enforce best practices
Continuously optimize infrastructure cost and efficiency using FinOps principles
Establish and track operational metrics such as MTTR, MTTD, and incident frequency
Reduce operational toil through automation-first solutions
Collaborate with a global team and mentor peers

View Full Description & ApplyYou'll be redirected to the employer's site