Apply

Staff Site Reliability Engineer

Posted 12 days agoViewed

View full description

πŸ’Ž Seniority level: Staff, 7+ years

πŸ“ Location: Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay, UTC -5 to UTC +3

πŸ’Έ Salary: 129347.0 - 200824.0 USD per year

πŸ” Industry: Software Development

🏒 Company: Wikimedia FoundationπŸ‘₯ 251-500πŸ’° $2,100,000 Grant about 5 years ago

πŸ—£οΈ Languages: English

⏳ Experience: 7+ years

πŸͺ„ Skills: AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Requirements:
  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
  • Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).
Responsibilities:
  • Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
  • Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
  • Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
  • Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
  • Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
  • Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.
Apply

Related Jobs

Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 200000.0 - 250000.0 USD per year

πŸ” Health Data Exchange

🏒 Company: Datavant

  • Expertise in managing Kubernetes (EKS), CI/CD tools (e.g., ArgoCD, GitHub Actions), and observability platforms (e.g., Datadog).
  • Proficiency in automating platform deployment and maintenance tasks (e.g., cluster upgrades, CI/CD workflows).
  • Familiarity with integrating tools like Terraform, Elasticsearch, Kafka, Cassandra, and Databricks into the broader platform.
  • Knowledge of scaling, failover, and platform reliability best practices.
  • Ability to work with Embedded Teams to meet workload-specific needs.
  • Increase our cloud efficiency
  • Ship
  • Lead

DockerPythonSQLAWS EKSCloud ComputingElasticSearchJenkinsKafkaKubernetesCassandraCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesAnsibleScripting

Posted about 1 month ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 100000.0 - 120000.0 USD per year

πŸ” Software Development

🏒 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • 5+ years experience with AWS and Kubernetes
  • Experience with Terraform
  • Experience designing scalable database solutions, ideally PostgreSQL
  • Strong engineering background
  • Provision infrastructure and tooling for the Assured platform
  • Create automated tooling for configuration and maintenance
  • Build methods for monitoring and scaling platforms
  • Lead and mentor engineers

AWSPostgreSQLKubernetesTerraform

Posted about 1 month ago
Apply
Apply

πŸ“ United States

🧭 Full-Time

πŸ’Έ 183000.0 - 304000.0 USD per year

πŸ” AI and Quantum Technology

🏒 Company: SandboxAQπŸ‘₯ 101-250πŸ’° $25,000,000 Grant 5 months agoArtificial Intelligence (AI)SaaSInformation TechnologyCyber Security

  • 10+ years in Site Reliability Engineering or similar roles
  • Strong experience with cloud platforms (AWS, GCP, Azure)
  • Experience with containerization (Docker, Kubernetes)
  • Proficiency in scripting languages (Python, Go, Bash)
  • Experience with microservices architectures and CI/CD pipelines
  • Strong knowledge of monitoring tools (Prometheus, Grafana, etc.)
  • Lead efforts in incident response and root cause analysis
  • Analyze system performance and create capacity plans
  • Design and maintain monitoring and alerting solutions
  • Collaborate with engineering teams on system design
  • Identify opportunities for infrastructure cost optimization
  • Build and improve automation tools and deployment pipelines
  • Mentor junior and mid-level engineers
  • Participate in on-call rotation for system outages

AWSDockerPostgreSQLPythonBashGCPKafkaKubernetesMySQLAzureGoGrafanaPrometheusCI/CDTerraformMicroservicesAnsible

Posted about 2 months ago
Apply
Apply

πŸ“ United States, Canada, UK, Hong Kong

🧭 Full-Time

πŸ” Blockchain/Decentralized Finance

🏒 Company: Scroll.ioπŸ‘₯ 51-100πŸ’° $50,000,000 about 2 years agoCryptocurrencyBlockchainInformation Technology

  • 5+ years of experience as a DevOps, Infrastructure, Site Reliability or Cloud Engineer
  • 3+ years of experience as Backend Developer
  • Familiarity with hybrid cloud environments (AWS, Azure, GCP)
  • Proficiency in modern programming languages (Go, Rust, Python)
  • Linux administration experience
  • Experience with configuration management tools (Terraform, Ansible)
  • Experience with containers in production systems
  • Design, build, and maintain internal developer tools
  • Design, provision, and maintain cloud environments
  • Implement observability solutions for performance insights
  • Operate and maintain GPU-based zk provers

AWSBackend DevelopmentDockerPythonBlockchainGCPKubernetesGoRustLinuxDevOpsTerraformMicroservicesAnsible

Posted about 2 months ago
Apply
Apply

πŸ“ Europe

🧭 Full-Time

πŸ” Software Development

🏒 Company: xLabsπŸ‘₯ 1-10EducationDigital MarketingE-LearningTraining

  • Experience with Infrastructure-as-Code and GitOps.
  • Experience with workload orchestration solutions, such as (but not limited to) HashiCorp Nomad or Kubernetes.
  • Can reason about software design and distributed systems.
  • Proficient in at least one programming language.
  • Experience running blockchains, particularly validators and remote signers/multisig. (nice to have)
  • Manage compute infrastructure and the applications that depend on it.
  • Document what we learn, including what went wrong and how to prevent future issues.

Software DevelopmentBlockchainKubernetes

Posted about 2 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 185000.0 - 250000.0 USD per year

πŸ” Software Development

🏒 Company: KentikπŸ‘₯ 101-250πŸ’° $40,000,000 Series C over 3 years agoCloud Data ServicesInformation TechnologyNetwork SecuritySoftware

  • 5+ years of relevant experience
  • Experience running and scaling Postgres or Kafka
  • Proficiency in Go or Node.js
  • Experience with internal monitoring and alerts
  • Clear communication skills
  • Own, scale, and maintain core infrastructure
  • Streamline and simplify the tech stack
  • Define and refine platform offerings
  • Participate in on-call rotation
  • Collaborate with hardware and network teams

Node.jsPostgreSQLKafkaGoGrafanagRPCPrometheusRedisLinux

Posted about 2 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ’Έ 145000.0 - 175000.0 USD per year

πŸ” Information Technology

  • 8+ years experience in Information Technology
  • 5+ years in desktop systems engineering and administration
  • Proficient in Ansible, Bash, Perl, Python, and PowerShell
  • Experience with Intune, AutoPilot, SCCM, Active Directory
  • Familiarity with AWS Workspace and AzureAD
  • Vet vendor solutions
  • Design and execute initiatives
  • Research and document new technologies
  • Provide training on new products and processes

AWSPythonAndroidBashAnsible

Posted 2 months ago
Apply
Apply

πŸ“ Canada

🧭 Full-Time

πŸ” Observability and data management

🏒 Company: CriblπŸ‘₯ 251-500πŸ’° $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise-scale continuous delivery environments.
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible.
  • Knowledge of cloud platforms (prefer AWS and Azure, GCP is nice to have) and container + orchestration technologies.
  • Extensive experience designing and implementing Observability platforms based on OpenSource tools like Grafana, Prometheus, OpenSearch.
  • Experience mentoring engineers and acting as Subject Matter Expert in areas of Monitoring and Observability.
  • Experience with native monitoring services in AWS, Azure and other popular Cloud Platforms.
  • Background in Linux Systems Engineering.
  • Experience with Incident response tools, e.g., PagerDuty, FireHydrant.
  • Experience with sustainable incident response in a blameless environment.
  • Comfortable with a high level of autonomy and working with a distributed team.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with an eye towards availability, latency, and overall system health.
  • Design observability systems for different types of applications, using Cribl products and other OpenSource tools.
  • Seek out the cause of errors and instability in production cloud services and drive teams towards better operational excellence.
  • Engage with product and platform teams to evolve systems by lobbying for changes that improve reliability, resilience, and observability.
  • Lead efforts enabling shift-left monitoring in the organization.
  • Help identify and drive down toil with creative innovation and automation.
  • On-call responsibilities.

AWSDockerNode.jsGCPJavascriptTypeScriptAzureGrafanaPrometheusLinuxTerraform

Posted 3 months ago
Apply
Apply

πŸ“ United States, Canada

🧭 Full-Time

πŸ” Software Development

🏒 Company: AlgoliaπŸ‘₯ 501-1000πŸ’° $150,000,000 Series D over 3 years agoSemantic SearchSearch EngineCloud ComputingVertical Search

  • Strong knowledge of Golang and Python
  • Experience designing API Management and Kubernetes architecture
  • Experience with distributed systems
  • Experience on CI/CD setup and architecture
  • Knowledge of Public Cloud Providers (GCP, AWS, Azure)
  • Excellent communication and organization skills
  • Design and deploy a cloud-native API Management
  • Spearhead the design of a robust CI/CD toolchain
  • Lead development of observability standards
  • Drive the evolution of a Kubernetes-based architecture
  • Provide guidance and mentorship to SRE team members
  • Establish and enforce engineering processes
  • Collaborate with senior leadership on cloud infrastructure

AWSPythonGCPKubernetesMicrosoft AzureCI/CD

Posted 5 months ago
Apply