Staff Site Reliability Engineer

Posted 12 days agoViewed

💎 Seniority level: Staff, 7+ years

📍 Location: Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay, UTC -5 to UTC +3

💸 Salary: 129347.0 - 200824.0 USD per year

🔍 Industry: Software Development

🏢 Company: Wikimedia Foundation👥 251-500💰 $2,100,000 Grant about 5 years ago

🗣️ Languages: English

⏳ Experience: 7+ years

🪄 Skills: AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Requirements:

7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).

Responsibilities:

Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

Apply

Related Jobs

Apply

🔥 Staff Site Reliability Engineer

Posted about 1 month ago

📍 United States

🧭 Full-Time

💸 200000.0 - 250000.0 USD per year

🔍 Health Data Exchange

🏢 Company: Datavant

🔧 Requirements

Expertise in managing Kubernetes (EKS), CI/CD tools (e.g., ArgoCD, GitHub Actions), and observability platforms (e.g., Datadog).
Proficiency in automating platform deployment and maintenance tasks (e.g., cluster upgrades, CI/CD workflows).
Familiarity with integrating tools like Terraform, Elasticsearch, Kafka, Cassandra, and Databricks into the broader platform.
Knowledge of scaling, failover, and platform reliability best practices.
Ability to work with Embedded Teams to meet workload-specific needs.

💡 Responsibilities

Increase our cloud efficiency
Ship
Lead

DockerPythonSQLAWS EKSCloud ComputingElasticSearchJenkinsKafkaKubernetesCassandraCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesAnsibleScripting

Posted about 1 month ago

Apply

🔥 Staff Site Reliability Engineer

Posted about 1 month ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

5+ years experience with AWS and Kubernetes
Experience with Terraform
Experience designing scalable database solutions, ideally PostgreSQL
Strong engineering background

💡 Responsibilities

Provision infrastructure and tooling for the Assured platform
Create automated tooling for configuration and maintenance
Build methods for monitoring and scaling platforms
Lead and mentor engineers

AWSPostgreSQLKubernetesTerraform

Posted about 1 month ago

Apply

🔥 Staff/Senior Staff Site Reliability Engineer

Posted about 2 months ago

📍 United States

🧭 Full-Time

💸 183000.0 - 304000.0 USD per year

🔍 AI and Quantum Technology

🏢 Company: SandboxAQ👥 101-250💰 $25,000,000 Grant 5 months agoArtificial Intelligence (AI)SaaS Information Technology Cyber Security

🔧 Requirements

10+ years in Site Reliability Engineering or similar roles
Strong experience with cloud platforms (AWS, GCP, Azure)
Experience with containerization (Docker, Kubernetes)
Proficiency in scripting languages (Python, Go, Bash)
Experience with microservices architectures and CI/CD pipelines
Strong knowledge of monitoring tools (Prometheus, Grafana, etc.)

💡 Responsibilities

Lead efforts in incident response and root cause analysis
Analyze system performance and create capacity plans
Design and maintain monitoring and alerting solutions
Collaborate with engineering teams on system design
Identify opportunities for infrastructure cost optimization
Build and improve automation tools and deployment pipelines
Mentor junior and mid-level engineers
Participate in on-call rotation for system outages

AWSDockerPostgreSQLPythonBashGCPKafkaKubernetesMySQLAzureGoGrafanaPrometheusCI/CDTerraformMicroservicesAnsible

Posted about 2 months ago

Apply

🔥 Senior / Staff Site Reliability Engineer

Posted about 2 months ago

📍 United States, Canada, UK, Hong Kong

🧭 Full-Time

🔍 Blockchain/Decentralized Finance

🏢 Company: Scroll.io👥 51-100💰 $50,000,000 about 2 years agoCryptocurrency Blockchain Information Technology

🔧 Requirements

5+ years of experience as a DevOps, Infrastructure, Site Reliability or Cloud Engineer
3+ years of experience as Backend Developer
Familiarity with hybrid cloud environments (AWS, Azure, GCP)
Proficiency in modern programming languages (Go, Rust, Python)
Linux administration experience
Experience with configuration management tools (Terraform, Ansible)
Experience with containers in production systems

💡 Responsibilities

Design, build, and maintain internal developer tools
Design, provision, and maintain cloud environments
Implement observability solutions for performance insights
Operate and maintain GPU-based zk provers

AWSBackend DevelopmentDockerPythonBlockchainGCPKubernetesGoRustLinuxDevOpsTerraformMicroservicesAnsible

Posted about 2 months ago

Apply

🔥 Senior / Staff Site Reliability Engineer

Posted about 2 months ago

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: xLabs👥 1-10 Education Digital Marketing E-Learning Training

🔧 Requirements

Experience with Infrastructure-as-Code and GitOps.
Experience with workload orchestration solutions, such as (but not limited to) HashiCorp Nomad or Kubernetes.
Can reason about software design and distributed systems.
Proficient in at least one programming language.
Experience running blockchains, particularly validators and remote signers/multisig. (nice to have)

💡 Responsibilities

Manage compute infrastructure and the applications that depend on it.
Document what we learn, including what went wrong and how to prevent future issues.

Software DevelopmentBlockchainKubernetes

Posted about 2 months ago

Apply

🔥 Staff Site Reliability Engineer, Infrastructure

Posted about 2 months ago

📍 United States, Canada

🧭 Full-Time

💸 185000.0 - 250000.0 USD per year

🔍 Software Development

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data Services Information Technology Network Security Software

🔧 Requirements

5+ years of relevant experience
Experience running and scaling Postgres or Kafka
Proficiency in Go or Node.js
Experience with internal monitoring and alerts
Clear communication skills

💡 Responsibilities

Own, scale, and maintain core infrastructure
Streamline and simplify the tech stack
Define and refine platform offerings
Participate in on-call rotation
Collaborate with hardware and network teams

Node.jsPostgreSQLKafkaGoGrafanagRPCPrometheusRedisLinux

Posted about 2 months ago

Apply

🔥 Staff Site Reliability Engineer

Posted 2 months ago

📍 United States, Canada

🧭 Full-Time

💸 145000.0 - 175000.0 USD per year

🔍 Information Technology

🔧 Requirements

8+ years experience in Information Technology
5+ years in desktop systems engineering and administration
Proficient in Ansible, Bash, Perl, Python, and PowerShell
Experience with Intune, AutoPilot, SCCM, Active Directory
Familiarity with AWS Workspace and AzureAD

💡 Responsibilities

Vet vendor solutions
Design and execute initiatives
Research and document new technologies
Provide training on new products and processes

AWSPythonAndroidBashAnsible

Posted 2 months ago

Apply

🔥 Sr Staff Site Reliability Engineer (SRE), Cloud

Posted 3 months ago

📍 Canada

🧭 Full-Time

🔍 Observability and data management

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise-scale continuous delivery environments.
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
Background in Linux Systems Engineering.
Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
Experience with sustainable incident response in a blameless environment.
Comfortable with a high level of autonomy and working with a distributed team.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
Lead efforts enabling shift-left monitoring in the organization.
Help identify and drive down toil with creative innovation and automation.
On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted 3 months ago

Apply

🔥 Staff Site Reliability Engineer, PaaS

Posted 5 months ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Algolia👥 501-1000💰 $150,000,000 Series D over 3 years agoSemantic Search Search Engine Cloud Computing Vertical Search

🔧 Requirements

Strong knowledge of Golang and Python
Experience designing API Management and Kubernetes architecture
Experience with distributed systems
Experience on CI/CD setup and architecture
Knowledge of Public Cloud Providers (GCP, AWS, Azure)
Excellent communication and organization skills

💡 Responsibilities

Design and deploy a cloud-native API Management
Spearhead the design of a robust CI/CD toolchain
Lead development of observability standards
Drive the evolution of a Kubernetes-based architecture
Provide guidance and mentorship to SRE team members
Establish and enforce engineering processes
Collaborate with senior leadership on cloud infrastructure

AWSPythonGCPKubernetesMicrosoft AzureCI/CD

Posted 5 months ago

Apply