Apply

Staff Site Reliability Engineer

Posted about 1 month agoViewed

View full description

💎 Seniority level: Staff

📍 Location: United States

💸 Salary: 200000.0 - 250000.0 USD per year

🔍 Industry: Software Development

🏢 Company: Datavant

🗣️ Languages: English

🪄 Skills: DockerPythonSQLAWS EKSCloud ComputingElasticSearchJenkinsKafkaKubernetesCassandraCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesAnsibleScripting

Requirements:
  • Expertise in managing Kubernetes (EKS), CI/CD tools (e.g., ArgoCD, GitHub Actions), and observability platforms (e.g., Datadog).
  • Proficiency in automating platform deployment and maintenance tasks (e.g., cluster upgrades, CI/CD workflows).
  • Familiarity with integrating tools like Terraform, Elasticsearch, Kafka, Cassandra, and Databricks into the broader platform.
  • Knowledge of scaling, failover, and platform reliability best practices.
  • Ability to work with Embedded Teams to meet workload-specific needs.
Responsibilities:
  • Increase our cloud efficiency
  • Deliver on the Cloud Engineering - Service’s charter
  • Actively collaborate with the team of your peers, keep your pod focused and engaged, contribute to engineering-wide decisions on technical strategy, product strategy, and organizational strategy
  • Analyze and improve the efficiency, scalability, and reliability of our backend systems
  • Build and mature automation tools for robust continuous integration and deployment pipelines
  • Build scalable, secure, and measurable infrastructure with code
  • Facilitate capacity planning
  • Champion code health, rigorous testing, and maintainability standards
  • Create automation of engineering deployments
  • Create scalable and reliable monitoring and alerting that works
  • Create actionable documentation and playbooks, and when possible automation, to resolve recurring issues and proactively address issues before impact is felt
  • Design, build, and upkeep tools, systems, and self-service options to elevate engineering team productivity and reduce toil
  • Maintain a stable, scalable, and secure development environment while keeping abreast of the latest DevOps innovations
  • Support disaster recovery design, implementation, and testing
  • Support engineering teams in implementing system reliability
  • When things go bad, perform advanced troubleshooting of our systems
Apply

Related Jobs

Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 129347.0 - 200824.0 USD per year

🔍 Software Development

🏢 Company: Wikimedia Foundation👥 251-500💰 $2,100,000 Grant over 5 years ago

  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
  • Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).
  • Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
  • Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
  • Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
  • Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
  • Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
  • Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Posted 15 days ago
Apply
Apply
🔥 Staff Site Reliability Engineer
Posted about 2 months ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • 5+ years experience with AWS and Kubernetes
  • Experience with Terraform
  • Experience designing scalable database solutions, ideally PostgreSQL
  • Strong engineering background
  • Provision infrastructure and tooling for the Assured platform
  • Create automated tooling for configuration and maintenance
  • Build methods for monitoring and scaling platforms
  • Lead and mentor engineers

AWSPostgreSQLKubernetesTerraform

Posted about 2 months ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 183000.0 - 304000.0 USD per year

🔍 AI and Quantum Technology

🏢 Company: SandboxAQ👥 101-250💰 $25,000,000 Grant 5 months agoArtificial Intelligence (AI)SaaSInformation TechnologyCyber Security

  • 10+ years in Site Reliability Engineering or similar roles
  • Strong experience with cloud platforms (AWS, GCP, Azure)
  • Experience with containerization (Docker, Kubernetes)
  • Proficiency in scripting languages (Python, Go, Bash)
  • Experience with microservices architectures and CI/CD pipelines
  • Strong knowledge of monitoring tools (Prometheus, Grafana, etc.)
  • Lead efforts in incident response and root cause analysis
  • Analyze system performance and create capacity plans
  • Design and maintain monitoring and alerting solutions
  • Collaborate with engineering teams on system design
  • Identify opportunities for infrastructure cost optimization
  • Build and improve automation tools and deployment pipelines
  • Mentor junior and mid-level engineers
  • Participate in on-call rotation for system outages

AWSDockerPostgreSQLPythonBashGCPKafkaKubernetesMySQLAzureGoGrafanaPrometheusCI/CDTerraformMicroservicesAnsible

Posted about 2 months ago
Apply
Apply

📍 United States, Canada, UK, Hong Kong

🧭 Full-Time

🔍 Blockchain/Decentralized Finance

🏢 Company: Scroll.io👥 51-100💰 $50,000,000 about 2 years agoCryptocurrencyBlockchainInformation Technology

  • 5+ years of experience as a DevOps, Infrastructure, Site Reliability or Cloud Engineer
  • 3+ years of experience as Backend Developer
  • Familiarity with hybrid cloud environments (AWS, Azure, GCP)
  • Proficiency in modern programming languages (Go, Rust, Python)
  • Linux administration experience
  • Experience with configuration management tools (Terraform, Ansible)
  • Experience with containers in production systems
  • Design, build, and maintain internal developer tools
  • Design, provision, and maintain cloud environments
  • Implement observability solutions for performance insights
  • Operate and maintain GPU-based zk provers

AWSBackend DevelopmentDockerPythonBlockchainGCPKubernetesGoRustLinuxDevOpsTerraformMicroservicesAnsible

Posted about 2 months ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 185000.0 - 250000.0 USD per year

🔍 Software Development

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data ServicesInformation TechnologyNetwork SecuritySoftware

  • 5+ years of relevant experience
  • Experience running and scaling Postgres or Kafka
  • Proficiency in Go or Node.js
  • Experience with internal monitoring and alerts
  • Clear communication skills
  • Own, scale, and maintain core infrastructure
  • Streamline and simplify the tech stack
  • Define and refine platform offerings
  • Participate in on-call rotation
  • Collaborate with hardware and network teams

Node.jsPostgreSQLKafkaGoGrafanagRPCPrometheusRedisLinux

Posted about 2 months ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Algolia👥 501-1000💰 $150,000,000 Series D over 3 years agoSemantic SearchSearch EngineCloud ComputingVertical Search

  • Strong knowledge of Golang and Python
  • Experience designing API Management and Kubernetes architecture
  • Experience with distributed systems
  • Experience on CI/CD setup and architecture
  • Knowledge of Public Cloud Providers (GCP, AWS, Azure)
  • Excellent communication and organization skills
  • Design and deploy a cloud-native API Management
  • Spearhead the design of a robust CI/CD toolchain
  • Lead development of observability standards
  • Drive the evolution of a Kubernetes-based architecture
  • Provide guidance and mentorship to SRE team members
  • Establish and enforce engineering processes
  • Collaborate with senior leadership on cloud infrastructure

AWSPythonGCPKubernetesMicrosoft AzureCI/CD

Posted 5 months ago
Apply