Apply

Senior Site Reliability Engineer

Posted 15 days agoViewed

View full description

💎 Seniority level: Senior, 3+ years

💸 Salary: 117725.0 - 162900.0 USD per year

🔍 Industry: Mental Health

🏢 Company: Modern Health👥 251-500💰 $74,000,000 Series D about 4 years agoMental HealthTherapeuticsmHealthWellnessHealth CareSoftware

🗣️ Languages: English

⏳ Experience: 3+ years

Requirements:
  • 3+ years of experience in software engineering, with 2 years experience in DevOps
  • Cloud Provider (AWS, GCP, Azure) experience on managing resources through Infrastructure As Code (Terraform)
  • Container Orchestration (ECS or K8s) experience to confidently build, test, and release containerized applications for multiple environments and regions
  • Knowledge of Observability best practices across common cloud resources (EC2, ECS, RDS, DynamoDB, S3, SQS, Eventbridge) with experience on rolling out enhancements across a distributed platform with scale in mind
  • Experience with shell scripting for *nix systems
  • Experience with Networking for web applications
  • Effective at communicating ideas through writing and diagramming
  • Comfortable working with a distributed development and ops team
  • Familiarity with AWS: ECS and cloud hosting, Gitlab: CI/CD, Python: Django, Flask, aiohttp, Bash, Data: PostgreSQL, Redis, Monitoring: Datadog and Sentry, IaC: Terraform, Packer
Responsibilities:
  • Manage and orchestrate Cloud Resource (AWS) configuration using Infrastructure As Code (Terraform) to empower engineering staff to embrace a DevOps culture of Self Service Ownership
  • Develop and govern Observability (Datadog) best practices for tracking platform performance and health trends to meet customer SLAs and lead technical decisions with strong supporting evidence
  • Create solutions that dynamically scale based on demand with enough flexibility to pivot for fast changing project requirements while maintaining a balance of good versus perfect
  • Provide strong and consistent communication updates on technical progress or blockers to keep stakeholders informed while additionally creating appropriate documentation on technical design to spread knowledge and reduce information silos
  • Participate and respond to 24/7 on-call critical alerts and follow documented incident investigation procedures to reestablish customer facing feature availability
  • Maintain HIPAA, GDPR, SOC-2 compliance and general security through best practice implementation
Apply

Related Jobs

Apply

📍 USA

🧭 Full-Time

💸 140000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Juniper Square

  • 5+ years of experience in SRE, DevOps, or Infrastructure Engineering with a proven track record of ownership and initiative.
  • Strong experience with Kubernetes, Helm, and CNIs, including networking and security.
  • Proficiency in AWS services such as RDS, Aurora, IAM, VPC, EKS, and EC2.
  • Experience in PostgreSQL administration, including performance tuning and high availability in RDS/Aurora.
  • Hands-on experience with GitHub Actions and ArgoCD for secure and scalable CI/CD automation.
  • Strong background in Infrastructure as Code (IaC) with Crossplane and Terraform.
  • Deep understanding of observability and monitoring with Datadog.
  • Experience with Kyverno for Kubernetes policy-based security enforcement.
  • Proficiency in Python and Bash scripting for automation and system management.
  • Strong understanding of CI/CD security best practices and ability to implement controls for securing deployments.
  • Own reliability and scalability initiatives—identify, prioritize, and implement solutions before issues escalate.
  • Participate in an on-call rotation, responding to incidents, performing root cause analysis, and driving long-term fixes.
  • Design, deploy, and manage Kubernetes clusters using Helm charts, Cilium, and Karpenter to optimize performance and cost.
  • Architect and maintain AWS infrastructure with a focus on RDS/Aurora PostgreSQL, networking, and scaling best practices.
  • Implement GitHub Actions CI/CD pipelines, integrating security best practices and automation.
  • Define and enforce policy-based security for Kubernetes using Kyverno.
  • Automate infrastructure provisioning with Crossplane and Terraform to ensure consistency and scalability.
  • Enhance observability and monitoring using Datadog to proactively detect and resolve issues.
  • Improve security and reliability by identifying risks in CI/CD, cloud environments, and Kubernetes, then implementing necessary safeguards.
  • Lead post-incident reviews, drive lessons learned into long-term improvements, and document best practices in Confluence.

AWSPostgreSQLPythonBashKubernetesCI/CDDevOpsTerraform

Posted about 7 hours ago
Apply
Apply

🔍 Software Development

🏢 Company: ABBYY👥 1001-5000💰 almost 4 years agoCommunications InfrastructureAnalyticsData VisualizationSoftware

  • 3 - 6+ years of experience in an Infrastructure, SRE, DevOps, CloudOps role
  • Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
  • Experience with Terraform, Ansible, or any similar programming language
  • Experience with at least one cloud technology - AWS or Azure. Preferably Azure
  • Experience with cloud-performant microservices and event-driven architectures
  • Experience with Kubernetes administration is an added advantage.
  • Understanding of information security concepts and terminology
  • Distributed monitoring experience: logging, metrics, tracing, etc.
  • Strong knowledge of software development methodologies and passion for creating high-standard tool sets for infrastructure-as-code
  • Ability to analyze problems quickly and find suitable solutions based on available resources
  • A proactive and open-minded individual with a clear client focus and structured approach
  • Experience in leading and managing a team
  • Сo-own critical production service designs to ensure high reliability is achievable and measurable
  • Drive reliability and observability improvements in the services within the engineering verticals
  • Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
  • Build and improve internal tools and automation software to make maintaining production services easier and safer
  • Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
  • Developing Infrastructure as a Code.
  • You will build SRE dashboards from SLIs to measure SLO adherence
  • Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
  • Point of contact for production application issues, working closely with engineering leadership
Posted 2 days ago
Apply
Apply

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

  • 5+ years of Ruby/Rail Experience
  • 3+ years of AWS Experience
  • Kubernetes experience
  • Experience with profiling and benchmarking source code
  • Effective at code review, and identifying potential performance problems before they reach production
  • Experience with Datadog or other APM tools
  • Excellent written and verbal communication skills
  • Proactively identify, triage, and resolve performance issues
  • Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
  • Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
  • Optimize performance through instance configuration and monitoring
  • Collaborate with other SREs to proactively identify and address performance bottlenecks
  • Lead database capacity planning and upgrade initiatives
  • Manage the database-specific components of disaster recovery planning and execution
  • Oversee backup systems and pre-production databases
  • Create and maintain infrastructure and operations documentation
  • Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 7 days ago
Apply
Apply

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

  • 5+ years of experience building, iterating upon, and maintaining corporate IAM systems
  • 5+ years of experience with operational procedures and application development
  • Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
  • Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
  • Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
  • Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
  • Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
  • Strong understanding of CI/CD workflows, automation frameworks, and best practices
  • Clear communication—demonstrate ability to explain technical concepts simply
  • Self starter—possess a continuous learning mindset
  • Demonstrate critical thinking under pressure
  • Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
  • Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
  • Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
  • Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
  • Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
  • Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
  • Deliver configurations and maintain state using configuration management tools
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives
  • Define metrics and bolster monitoring/observability across corporate IAM systems
  • Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 9 days ago
Apply
Apply

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

  • An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
  • Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
  • A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
  • Knowledge and experience in managing configuration at scale.
  • Experience with CI/CD pipeline, version control best practices.
  • Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
  • Strong knowledge of cloud security and IAM policies is required.
  • SIEM and threat management experience.
  • Must know how to secure Mac and Linux endpoints.
  • Python and bash experience is a must.
  • Participate in on-call roster to support our trading operations.
  • Maintain and improve our global infrastructure with high performance and reliability requirements.
  • Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
  • Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
  • Development of internal tools and automation to accomplish the team’s goals.
  • Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
  • Active participation in various trading and infrastructure projects.
  • Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 10 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

  • 5+ years of experience as a Site Reliability Engineer or similar role.
  • Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
  • Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
  • Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
  • Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
  • Strong problem-solving and analytical skills.
  • Excellent communication and collaboration skills.
  • Ability to work independently and take ownership of complex tasks.
  • Passion for technology and a desire to learn and grow.
  • Experience with Jenkins, PostgreSQL, and MongoDB.
  • Experience with cloud cost optimization, security best practices and tools.
  • Experience working in a fast-paced, agile environment.
  • Experience Rancher, Cattleprod, and TeamCity a plus.
  • Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
  • Implement automated monitoring and alerting systems for early detection of potential problems.
  • Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
  • Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
  • Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
  • Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
  • Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
  • Help onboard and mentor junior team members, sharing your knowledge and expertise.
  • Stay up to date on the latest cloud technologies and best practices for SRE.
  • Participate in a well-structured on-call rotation with other Site Reliability Engineers.
  • Explore new technologies and innovative solutions to improve service quality and speed to market.
  • Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 14 days ago
Apply
Apply

📍 Brazil

🧭 Full-Time

🔍 Data Integration Technology

🏢 Company: Supermetrics👥 251-500💰 $47,174,818 Series B over 4 years agoSaaSAnalyticsB2BMarketingEnterprise SoftwareSoftware

  • 7+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles
  • In-depth understanding of containers and experience operating Kubernetes clusters at scale.
  • Experience operating databases in production
  • Proficient in database concepts with practical experience in both relational and NoSQL databases.
  • In-depth knowledge of Linux systems and Terraform.
  • In-depth experience and understanding of AWS and GCP
  • Solid understanding of modern observability practices and tools
  • Automation mindset with the ability to automate repetitive tasks using scripting languages such as Python or Bash.
  • Collaborative approach to working with others
  • Willing to take on-call rotations during non-business hours
  • Good communication skills, in particular in writing (documentation, but able to write good PRs too)
  • Skilled problem-solving abilities with a keen interest in the tools, technologies and problems in this space
  • A developer background and the ability to write CLIs and other tools in Go, Python, or Rust.
  • Security mindset with experience implementing security best practices in platform and operational contexts.
  • Experience in creating and managing Helm charts.
  • Expert knowledge of continuous integration and continuous deployment (CI/CD) systems and processes and experience developing and maintaining GitHub Actions.
  • Write Terraform configuration and modules that bootstrap a Kubernetes cluster, or review PRs with contributions from other members, making sure that our modules are truly reusable and well-defined, improving how we test and release them.
  • Write (using Golang, for example) and maintain or improve our tooling, ensuring it facilitates platform utilization by engineering teams.
  • Develop and maintain Helm charts for internal deployments and third-party software.
  • Respond to an incident with our production environment.
  • Support our pre-sales team and assist them in answering potential customers' questions on our architecture and how we guarantee data security or consistency or ensure uptime.
  • Review an architecture change involving a new database and take part in the meetings discussing the pros and cons of such an approach.
  • Rewrite a Github Action to improve how we deploy to Kubernetes using GitOps.
  • Fix technical issues as they arise.
  • Participate in our on-call rotations to provide support, respond to incidents, or handle internal users' questions.

AWSDockerPostgreSQLPythonSQLBashElasticSearchGCPKubernetesMySQLClickhouseGrafanaPrometheusRedisCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesAnsibleSoftware Engineering

Posted 15 days ago
Apply
Apply

📍 Australia, New Zealand

🔍 Software Development

Full working rights and residency in Australia or New Zealand.
Significantly impact the developer experience, enabling developers across our R&D teams to ship scalable applications with high speed, quality, and performance.

DockerKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformNodeJSScripting

Posted 17 days ago
Apply
Apply

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

  • Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
  • Experience with infrastructure-as-code and orchestration tools.
  • Strong understanding of system performance, debugging, and optimization across diverse environments.
  • Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
  • Solid foundation in computer science fundamentals and system design.
  • Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
  • 5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
  • Experience with distributed systems and managing large-scale, high-availability environments.
  • Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
  • Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
  • Experience with bare-metal infrastructure.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash.
  • Experience with monitoring and observability tools for infrastructure performance.
  • Familiarity with cloud cost management and performance improvement strategies.
  • Strong analytical and troubleshooting skills.
  • Experience working across multiple time zones.
  • Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
  • Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
  • Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
  • Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
  • Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 21 days ago
Apply
Apply

📍 Germany, Spain, Portugal

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

  • 5+ years of experience in a Site Reliability Engineer or similar role.
  • 3+ years of experience with AWS services and container orchestration tools.
  • 2+ years of Kubernetes experience.
  • Strong knowledge of observability tools and principles (monitoring, logging, tracing).
  • Hands-on experience with Terraform for infrastructure as code.
  • Proficiency in at least one programming language (e.g., Python, Go, Java).
  • Experience in incident management, postmortem analysis, and risk mitigation.
  • Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.
  • Develop and maintain systems that are reliable, scalable, and efficient.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
  • Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
  • Automate operational tasks, incident responses, and contribute to system performance optimizations.
  • Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
  • Continuously evaluate and improve system performance, capacity, and cost efficiency.
  • Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted 22 days ago
Apply

Related Articles

Posted about 1 month ago

Why remote work is such a nice opportunity?

Why is remote work so nice? Let's try to see!

Posted 8 months ago

Insights into the evolving landscape of remote work in 2024 reveal the importance of certifications and continuous learning. This article breaks down emerging trends, sought-after certifications, and provides practical solutions for enhancing your employability and expertise. What skills will be essential for remote job seekers, and how can you navigate this dynamic market to secure your dream role?

Posted 8 months ago

Explore the challenges and strategies of maintaining work-life balance while working remotely. Learn about unique aspects of remote work, associated challenges, historical context, and effective strategies to separate work and personal life.

Posted 8 months ago

Google is gearing up to expand its remote job listings, promising more opportunities across various departments and regions. Find out how this move can benefit job seekers and impact the market.

Posted 8 months ago

Learn about the importance of pre-onboarding preparation for remote employees, including checklist creation, documentation, tools and equipment setup, communication plans, and feedback strategies. Discover how proactive pre-onboarding can enhance job performance, increase retention rates, and foster a sense of belonging from day one.