Staff Site Reliability Engineer

Posted about 1 month agoViewed

View full description

💎 Seniority level: Staff

📍 Location: United States

💸 Salary: 200000.0 - 250000.0 USD per year

🔍 Industry: Software Development

🏢 Company: Datavant

🗣️ Languages: English

🪄 Skills: DockerPythonSQLAWS EKSCloud ComputingElasticSearchJenkinsKafkaKubernetesCassandraCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesAnsibleScripting

Requirements:

Expertise in managing Kubernetes (EKS), CI/CD tools (e.g., ArgoCD, GitHub Actions), and observability platforms (e.g., Datadog).
Proficiency in automating platform deployment and maintenance tasks (e.g., cluster upgrades, CI/CD workflows).
Familiarity with integrating tools like Terraform, Elasticsearch, Kafka, Cassandra, and Databricks into the broader platform.
Knowledge of scaling, failover, and platform reliability best practices.
Ability to work with Embedded Teams to meet workload-specific needs.

Responsibilities:

Increase our cloud efficiency
Deliver on the Cloud Engineering - Service’s charter
Actively collaborate with the team of your peers, keep your pod focused and engaged, contribute to engineering-wide decisions on technical strategy, product strategy, and organizational strategy
Analyze and improve the efficiency, scalability, and reliability of our backend systems
Build and mature automation tools for robust continuous integration and deployment pipelines
Build scalable, secure, and measurable infrastructure with code
Facilitate capacity planning
Champion code health, rigorous testing, and maintainability standards
Create automation of engineering deployments
Create scalable and reliable monitoring and alerting that works
Create actionable documentation and playbooks, and when possible automation, to resolve recurring issues and proactively address issues before impact is felt
Design, build, and upkeep tools, systems, and self-service options to elevate engineering team productivity and reduce toil
Maintain a stable, scalable, and secure development environment while keeping abreast of the latest DevOps innovations
Support disaster recovery design, implementation, and testing
Support engineering teams in implementing system reliability
When things go bad, perform advanced troubleshooting of our systems

Apply

Related Jobs

Apply

🔥 Staff Site Reliability Engineer

Posted 15 days ago

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 129347.0 - 200824.0 USD per year

🔍 Software Development

🏢 Company: Wikimedia Foundation👥 251-500💰 $2,100,000 Grant over 5 years ago

🔧 Requirements

7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).

💡 Responsibilities

Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Posted 15 days ago

Apply

🔥 Staff Site Reliability Engineer

Posted about 2 months ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

5+ years experience with AWS and Kubernetes
Experience with Terraform
Experience designing scalable database solutions, ideally PostgreSQL
Strong engineering background

💡 Responsibilities

Provision infrastructure and tooling for the Assured platform
Create automated tooling for configuration and maintenance
Build methods for monitoring and scaling platforms
Lead and mentor engineers

AWSPostgreSQLKubernetesTerraform

Posted about 2 months ago

Apply

🔥 Staff/Senior Staff Site Reliability Engineer

Posted about 2 months ago

📍 United States

🧭 Full-Time

💸 183000.0 - 304000.0 USD per year

🔍 AI and Quantum Technology

🏢 Company: SandboxAQ👥 101-250💰 $25,000,000 Grant 5 months agoArtificial Intelligence (AI)SaaS Information Technology Cyber Security

🔧 Requirements

10+ years in Site Reliability Engineering or similar roles
Strong experience with cloud platforms (AWS, GCP, Azure)
Experience with containerization (Docker, Kubernetes)
Proficiency in scripting languages (Python, Go, Bash)
Experience with microservices architectures and CI/CD pipelines
Strong knowledge of monitoring tools (Prometheus, Grafana, etc.)

💡 Responsibilities

Lead efforts in incident response and root cause analysis
Analyze system performance and create capacity plans
Design and maintain monitoring and alerting solutions
Collaborate with engineering teams on system design
Identify opportunities for infrastructure cost optimization
Build and improve automation tools and deployment pipelines
Mentor junior and mid-level engineers
Participate in on-call rotation for system outages

AWSDockerPostgreSQLPythonBashGCPKafkaKubernetesMySQLAzureGoGrafanaPrometheusCI/CDTerraformMicroservicesAnsible

Posted about 2 months ago

Apply

🔥 Senior / Staff Site Reliability Engineer

Posted about 2 months ago

📍 United States, Canada, UK, Hong Kong

🧭 Full-Time

🔍 Blockchain/Decentralized Finance

🏢 Company: Scroll.io👥 51-100💰 $50,000,000 about 2 years agoCryptocurrency Blockchain Information Technology

🔧 Requirements

5+ years of experience as a DevOps, Infrastructure, Site Reliability or Cloud Engineer
3+ years of experience as Backend Developer
Familiarity with hybrid cloud environments (AWS, Azure, GCP)
Proficiency in modern programming languages (Go, Rust, Python)
Linux administration experience
Experience with configuration management tools (Terraform, Ansible)
Experience with containers in production systems

💡 Responsibilities