Senior Site Reliability Engineer

Posted 15 days agoViewed

💎 Seniority level: Senior, 3+ years

💸 Salary: 117725.0 - 162900.0 USD per year

🔍 Industry: Mental Health

🏢 Company: Modern Health👥 251-500💰 $74,000,000 Series D about 4 years agoMental Health Therapeutics mHealth Wellness Health Care Software

🗣️ Languages: English

⏳ Experience: 3+ years

Requirements:

3+ years of experience in software engineering, with 2 years experience in DevOps

Cloud Provider (AWS, GCP, Azure) experience on managing resources through Infrastructure As Code (Terraform)

Container Orchestration (ECS or K8s) experience to confidently build, test, and release containerized applications for multiple environments and regions

Knowledge of Observability best practices across common cloud resources (EC2, ECS, RDS, DynamoDB, S3, SQS, Eventbridge) with experience on rolling out enhancements across a distributed platform with scale in mind

Experience with shell scripting for *nix systems

Experience with Networking for web applications

Effective at communicating ideas through writing and diagramming

Comfortable working with a distributed development and ops team

Familiarity with AWS: ECS and cloud hosting, Gitlab: CI/CD, Python: Django, Flask, aiohttp, Bash, Data: PostgreSQL, Redis, Monitoring: Datadog and Sentry, IaC: Terraform, Packer

Responsibilities:

Manage and orchestrate Cloud Resource (AWS) configuration using Infrastructure As Code (Terraform) to empower engineering staff to embrace a DevOps culture of Self Service Ownership

Develop and govern Observability (Datadog) best practices for tracking platform performance and health trends to meet customer SLAs and lead technical decisions with strong supporting evidence

Create solutions that dynamically scale based on demand with enough flexibility to pivot for fast changing project requirements while maintaining a balance of good versus perfect

Provide strong and consistent communication updates on technical progress or blockers to keep stakeholders informed while additionally creating appropriate documentation on technical design to spread knowledge and reduce information silos

Participate and respond to 24/7 on-call critical alerts and follow documented incident investigation procedures to reestablish customer facing feature availability

Maintain HIPAA, GDPR, SOC-2 compliance and general security through best practice implementation

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted about 7 hours ago

📍 USA

🧭 Full-Time

💸 140000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Juniper Square

🔧 Requirements

5+ years of experience in SRE, DevOps, or Infrastructure Engineering with a proven track record of ownership and initiative.
Strong experience with Kubernetes, Helm, and CNIs, including networking and security.
Proficiency in AWS services such as RDS, Aurora, IAM, VPC, EKS, and EC2.
Experience in PostgreSQL administration, including performance tuning and high availability in RDS/Aurora.
Hands-on experience with GitHub Actions and ArgoCD for secure and scalable CI/CD automation.
Strong background in Infrastructure as Code (IaC) with Crossplane and Terraform.
Deep understanding of observability and monitoring with Datadog.
Experience with Kyverno for Kubernetes policy-based security enforcement.
Proficiency in Python and Bash scripting for automation and system management.
Strong understanding of CI/CD security best practices and ability to implement controls for securing deployments.

💡 Responsibilities

Own reliability and scalability initiatives—identify, prioritize, and implement solutions before issues escalate.
Participate in an on-call rotation, responding to incidents, performing root cause analysis, and driving long-term fixes.
Design, deploy, and manage Kubernetes clusters using Helm charts, Cilium, and Karpenter to optimize performance and cost.
Architect and maintain AWS infrastructure with a focus on RDS/Aurora PostgreSQL, networking, and scaling best practices.
Implement GitHub Actions CI/CD pipelines, integrating security best practices and automation.
Define and enforce policy-based security for Kubernetes using Kyverno.
Automate infrastructure provisioning with Crossplane and Terraform to ensure consistency and scalability.
Enhance observability and monitoring using Datadog to proactively detect and resolve issues.
Improve security and reliability by identifying risks in CI/CD, cloud environments, and Kubernetes, then implementing necessary safeguards.
Lead post-incident reviews, drive lessons learned into long-term improvements, and document best practices in Confluence.

AWSPostgreSQLPythonBashKubernetesCI/CDDevOpsTerraform

Posted about 7 hours ago

Apply

🔥 Senior Site Reliability Engineer

Posted 2 days ago

🔍 Software Development

🏢 Company: ABBYY👥 1001-5000💰 almost 4 years agoCommunications Infrastructure Analytics Data Visualization Software

🔧 Requirements

3 - 6+ years of experience in an Infrastructure, SRE, DevOps, CloudOps role
Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
Experience with Terraform, Ansible, or any similar programming language
Experience with at least one cloud technology - AWS or Azure. Preferably Azure
Experience with cloud-performant microservices and event-driven architectures
Experience with Kubernetes administration is an added advantage.
Understanding of information security concepts and terminology
Distributed monitoring experience: logging, metrics, tracing, etc.
Strong knowledge of software development methodologies and passion for creating high-standard tool sets for infrastructure-as-code
Ability to analyze problems quickly and find suitable solutions based on available resources
A proactive and open-minded individual with a clear client focus and structured approach
Experience in leading and managing a team

💡 Responsibilities

Сo-own critical production service designs to ensure high reliability is achievable and measurable
Drive reliability and observability improvements in the services within the engineering verticals
Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
Build and improve internal tools and automation software to make maintaining production services easier and safer
Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
Developing Infrastructure as a Code.
You will build SRE dashboards from SLIs to measure SLO adherence
Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
Point of contact for production application issues, working closely with engineering leadership

Posted 2 days ago

Apply

🔥 Senior Site Reliability Engineer, Performance

Posted 7 days ago

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

🔧 Requirements

5+ years of Ruby/Rail Experience
3+ years of AWS Experience
Kubernetes experience
Experience with profiling and benchmarking source code
Effective at code review, and identifying potential performance problems before they reach production
Experience with Datadog or other APM tools
Excellent written and verbal communication skills

💡 Responsibilities

Proactively identify, triage, and resolve performance issues
Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
Optimize performance through instance configuration and monitoring
Collaborate with other SREs to proactively identify and address performance bottlenecks
Lead database capacity planning and upgrade initiatives
Manage the database-specific components of disaster recovery planning and execution
Oversee backup systems and pre-production databases
Create and maintain infrastructure and operations documentation
Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 7 days ago

Apply

🔥 Senior Site Reliability Engineer - Identity Platform

Posted 9 days ago

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🔧 Requirements

5+ years of experience building, iterating upon, and maintaining corporate IAM systems
5+ years of experience with operational procedures and application development
Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
Strong understanding of CI/CD workflows, automation frameworks, and best practices
Clear communication—demonstrate ability to explain technical concepts simply
Self starter—possess a continuous learning mindset
Demonstrate critical thinking under pressure

💡 Responsibilities

Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
Deliver configurations and maintain state using configuration management tools
Facilitate incident response, conduct root cause analysis, and blameless retrospectives
Define metrics and bolster monitoring/observability across corporate IAM systems
Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 9 days ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 10 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 10 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 14 days ago

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

🔧 Requirements

5+ years of experience as a Site Reliability Engineer or similar role.
Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
Strong problem-solving and analytical skills.
Excellent communication and collaboration skills.
Ability to work independently and take ownership of complex tasks.
Passion for technology and a desire to learn and grow.
Experience with Jenkins, PostgreSQL, and MongoDB.
Experience with cloud cost optimization, security best practices and tools.
Experience working in a fast-paced, agile environment.
Experience Rancher, Cattleprod, and TeamCity a plus.

💡 Responsibilities

Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
Implement automated monitoring and alerting systems for early detection of potential problems.
Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
Help onboard and mentor junior team members, sharing your knowledge and expertise.
Stay up to date on the latest cloud technologies and best practices for SRE.
Participate in a well-structured on-call rotation with other Site Reliability Engineers.
Explore new technologies and innovative solutions to improve service quality and speed to market.
Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 14 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 15 days ago

📍 Brazil

🧭 Full-Time

🔍 Data Integration Technology

🏢 Company: Supermetrics👥 251-500💰 $47,174,818 Series B over 4 years agoSaaS Analytics B2B Marketing Enterprise Software Software

🔧 Requirements

7+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles
In-depth understanding of containers and experience operating Kubernetes clusters at scale.
Experience operating databases in production
Proficient in database concepts with practical experience in both relational and NoSQL databases.
In-depth knowledge of Linux systems and Terraform.
In-depth experience and understanding of AWS and GCP
Solid understanding of modern observability practices and tools
Automation mindset with the ability to automate repetitive tasks using scripting languages such as Python or Bash.
Collaborative approach to working with others
Willing to take on-call rotations during non-business hours
Good communication skills, in particular in writing (documentation, but able to write good PRs too)
Skilled problem-solving abilities with a keen interest in the tools, technologies and problems in this space
A developer background and the ability to write CLIs and other tools in Go, Python, or Rust.
Security mindset with experience implementing security best practices in platform and operational contexts.
Experience in creating and managing Helm charts.
Expert knowledge of continuous integration and continuous deployment (CI/CD) systems and processes and experience developing and maintaining GitHub Actions.

💡 Responsibilities

Write Terraform configuration and modules that bootstrap a Kubernetes cluster, or review PRs with contributions from other members, making sure that our modules are truly reusable and well-defined, improving how we test and release them.
Write (using Golang, for example) and maintain or improve our tooling, ensuring it facilitates platform utilization by engineering teams.
Develop and maintain Helm charts for internal deployments and third-party software.
Respond to an incident with our production environment.
Support our pre-sales team and assist them in answering potential customers' questions on our architecture and how we guarantee data security or consistency or ensure uptime.
Review an architecture change involving a new database and take part in the meetings discussing the pros and cons of such an approach.
Rewrite a Github Action to improve how we deploy to Kubernetes using GitOps.
Fix technical issues as they arise.
Participate in our on-call rotations to provide support, respond to incidents, or handle internal users' questions.

AWSDockerPostgreSQLPythonSQLBashElasticSearchGCPKubernetesMySQLClickhouseGrafanaPrometheusRedisCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesAnsibleSoftware Engineering

Posted 15 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 17 days ago

📍 Australia, New Zealand

🔍 Software Development

🔧 Requirements

Full working rights and residency in Australia or New Zealand.

💡 Responsibilities

Significantly impact the developer experience, enabling developers across our R&D teams to ship scalable applications with high speed, quality, and performance.

DockerKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformNodeJSScripting

Posted 17 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 21 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

🔧 Requirements

Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
Experience with infrastructure-as-code and orchestration tools.
Strong understanding of system performance, debugging, and optimization across diverse environments.
Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
Solid foundation in computer science fundamentals and system design.
Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
Experience with distributed systems and managing large-scale, high-availability environments.
Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
Experience with bare-metal infrastructure.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with monitoring and observability tools for infrastructure performance.
Familiarity with cloud cost management and performance improvement strategies.
Strong analytical and troubleshooting skills.
Experience working across multiple time zones.

💡 Responsibilities

Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 21 days ago

Apply

🔥 Senior Site Reliability Engineer - (Remote - Europe)

Posted 22 days ago

📍 Germany, Spain, Portugal

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

🔧 Requirements

5+ years of experience in a Site Reliability Engineer or similar role.
3+ years of experience with AWS services and container orchestration tools.
2+ years of Kubernetes experience.
Strong knowledge of observability tools and principles (monitoring, logging, tracing).
Hands-on experience with Terraform for infrastructure as code.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Experience in incident management, postmortem analysis, and risk mitigation.
Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.

💡 Responsibilities

Develop and maintain systems that are reliable, scalable, and efficient.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
Automate operational tasks, incident responses, and contribute to system performance optimizations.
Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
Continuously evaluate and improve system performance, capacity, and cost efficiency.
Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted 22 days ago

Apply

Why remote work is such a nice opportunity?

Posted about 1 month ago

Why is remote work so nice? Let's try to see!

Remote Job Certifications and Courses to Boost Your Career

Posted 8 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

How to Balance Work and Life While Working Remotely

Posted 8 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

How to Onboard Remote Employees Successfully

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.

Senior Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

Related Articles

Why remote work is such a nice opportunity?

Remote Job Certifications and Courses to Boost Your Career

How to Balance Work and Life While Working Remotely

Weekly Digest: Remote Jobs News and Trends (August 11 - August 18, 2024)

How to Onboard Remote Employees Successfully