Senior Site Reliability Engineer

Posted 3 months agoInactiveViewed

View full description

💎 Seniority level: Senior, Five years in software development as a DevOps and/or SRE

📍 Location: US and Canada

💸 Salary: 150000 - 200000 USD per year

🔍 Industry: Healthcare

🏢 Company: Synthesis Health👥 51-100💰 Seed about 2 years agoMedical Wellness Health Care

🗣️ Languages: English

⏳ Experience: Five years in software development as a DevOps and/or SRE

🪄 Skills: LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Requirements:

Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
At least one year of experience as a Python developer transitioning to an SRE role.
Five years of experience in software development as a DevOps and/or SRE.
Two years of experience in an SRE role with Kubernetes, preferably GKE.
Experience using ArgoCD for rollouts and deployments.
One year experience with service mesh like Istio in a GKE environment.
Proficiency in scripting languages like Python and automation tools like Terraform.
Solid understanding of security best practices for pipelines and cloud environments.
Familiarity with compliance standards like SOC 2, HIPAA.
Strong expertise in CI/CD pipeline management.

Responsibilities:

Design and implement automated application deployment processes.
Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
Manage development, testing, staging, pre-production and production environments.
Automate repetitive deployment tasks to improve productivity.
Select, develop, and monitor CI/CD systems.
Oversee software automation across GCP.
Containerize services to optimize resources and deployment speed.
Manage and optimize cloud infrastructure for cost and performance.
Ensure compliance with security standards and maintain disaster recovery plans.
Collaborate with cross-functional teams to improve software delivery.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted about 9 hours ago

📍 United States, Europe

🧭 Full-Time

🔍 Biotechnology

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

Experience in cloud infrastructure
Strong incident management skills
Technical skills in software reliability

💡 Responsibilities

Design, build, and maintain scalable cloud infrastructure
Develop and enforce SLIs and SLOs
Create CI/CD pipelines
Lead Incident Management process

AWSDockerCI/CDLinuxTerraform

Posted about 9 hours ago

Apply

🔥 Senior Site Reliability Engineer

Posted 5 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 5 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 16 days ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted 16 days ago

Apply

🔥 Senior Site Reliability Engineer, Database Operations:Clickhouse

Posted 20 days ago

📍 California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

💸 117600.0 - 252000.0 USD per year

🔍 Software Development

🏢 Company: GitLab👥 1001-5000💰 $268,000,000 Series E over 5 years ago🫂 Last layoff about 2 years agoDeveloper Tools DevOps Open Source SaaS Cloud Security

🔧 Requirements

Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
Solid experience with at least one programming language: Go, Ruby or Python.
Advanced experience with Linux.
Extensive on-call experience as an SRE supporting mission critical systems.
Solid incident management experience across all phases.
Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.

💡 Responsibilities

Design, build, and maintain ClickHouse and PostgreSQL clusters.
Provision cloud infrastructure using configuration management and IaC tools.
Implement high-availability ClickHouse solutions.
Optimize PostgreSQL clusters for core applications.
Build monitoring and alerting tools to ensure resource optimization.
Respond to platform alerts and user emergencies.
Enhance infrastructure security and partner with compliance assessors.
Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 20 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 28 days ago

📍 Colombia, USA

🧭 Contract

🔍 Software Development

🏢 Company: Teravision Technologies👥 251-500💰 about 13 years agoAndroid iOS Mobile Apps Information Technology Software

🔧 Requirements

Experience managing and maintaining Kubernetes (K8s) infrastructure
Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube
Hands-on experience with AWS services such as S3, Route 53, and others
Strong understanding of backend systems and infrastructure management
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments
Prior experience in an on-call role
Knowledge of monitoring and alerting tools

💡 Responsibilities

Proven experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
Prior experience in an on-call role and knowledge of monitoring and alerting tools to support on-call responsibilities.

AWSKubernetesCI/CDTroubleshootingDebugging

Posted 28 days ago

Apply

🔥 Senior Site Reliability Engineer | Remote US

Posted about 1 month ago

📍 United States

🔍 Cybersecurity

🔧 Requirements

Must be a self-starter with a passion for cloud technology.
Strong problem-solving abilities are essential.
Experience in major public clouds and automation is required.

💡 Responsibilities

As a Senior Site Reliability Engineer within the Cloud Services group, you will be responsible for operating cutting-edge offerings from Cloud Service Providers.
You will directly support leading cloud software companies to enhance the reliability and scalability of their SaaS products.
This role entails problem-solving and ensuring seamless service to large enterprises and government agencies.

AWSDockerPythonCloud ComputingKubernetesDevOpsTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted 3 months ago

📍 Canada, Chile

🔍 Technology

🏢 Company: Launchpad Technologies

🔧 Requirements

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
Familiarity with monitoring tools and systems.
Proficient in scripting languages such as Python, Bash, or Ruby.
Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
Excellent troubleshooting and analytical skills.
Strong communication skills and the ability to work effectively within a team.

💡 Responsibilities

Develop, maintain, and improve automated deployment, certification, and validation pipelines.
Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
Manage third-party services and technologies used to support the SRE discipline.
Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
Define and implement an observability framework to provide insights into system performance and behavior.
Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
Own operational incident management, providing support to related teams and individuals during incident resolution.
Identify and implement best practices for system reliability, security, scalability, and performance.
Participate in on-call rotations for system support, troubleshooting, and resolution.
Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
Develop and maintain documentation for systems, processes, and procedures.

AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Posted 3 months ago

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 3 months ago

📍 United States, Canada

🧭 Contract

🔍 Site Reliability Engineering

🔧 Requirements

5-7 years in Site Reliability Engineering
Experience with DFR, FMEA, MTBF methodologies
Proficiency with monitoring tools like DataDog, PagerDuty
Strong coding skills in languages used in SRE

💡 Responsibilities

Identify and resolve complex bugs
Write and maintain code for system reliability
Investigate complex system issues
Design and build fault-tolerant systems
Develop and maintain reliability standards

PythonDebugging

Posted 3 months ago

Apply

🔥 Senior Site Reliability Engineer - Platform

Posted 4 months ago

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

🔧 Requirements

At least 5+ years of software engineering experience.
Strong understanding of data structures and algorithms related to performance and reliability.
Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
Strong skills around observability, debugging, and performance tuning.
Ability to debug complex systems and willingness to understand and improve any layer of the stack.
Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
Strong communication skills and ability to explain technical concepts clearly.
Demonstrated critical thinking under pressure.

💡 Responsibilities

Build automation and improve systems to eliminate toil and operations work.
Improve observability, reliability, and availability by defining and measuring key metrics.
Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
Collaborate with product teams to reduce service disruptions and automate incident response.
Proactively find and analyze reliability problems and design software for improvements.
Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 4 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 4 months ago

📍 United States

💸 130000 - 170000 USD per year

🔍 Data-Powered Marketing Cloud

🏢 Company: Zeta Global👥 1001-5000💰 $105,263,174 Post-IPO Equity 6 months agoInformation Services Advertising Analytics Marketing

🔧 Requirements

7+ years of experience as an SRE.
3+ years of software development experience, emphasizing automation.
Hands-on experience with Infrastructure as Code (IaC) tools.
Experience with distributed systems and microservices architecture.
Production experience with distributed tracing.
Proficiency in Python and Bash scripting.
Solid understanding of SLIs, SLOs, and error budgets.
Experience with CI/CD platforms like GitOps or Jenkins.
Expertise in incident management and root cause analysis.
Knowledge of modern deployment strategies like Canary and Blue-Green.
Familiarity with resiliency patterns such as circuit breakers and load balancing.
Experience with SQL and NoSQL databases in distributed systems.
Proficiency in statistical analysis related to metrics.
Experience with high-performance and low-latency systems.
Experience with cloud cost optimization strategies.
Familiarity with distributed messaging systems like Kafka.
Strong understanding of security and compliance standards in SRE.

💡 Responsibilities

Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
Analyze historical data to identify areas for improvement.
Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
Reduce toil through runbook automation and record key MTTx metrics.
Lead design sessions focusing on capacity planning and automation.
Collaborate with product teams to enhance reliability and engage in strategic initiatives.

PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Posted 4 months ago

Apply