Senior Site Reliability Engineer

Posted 21 days agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: Continental US or Canada, EST, CST

🔍 Industry: Financial services

🏢 Company: Reach Financial👥 51-100 Financial Services Banking Payments

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSDockerPythonGrafanaPrometheusCI/CD

Requirements:

5+ years of experience in Site Reliability Engineering, Product Engineering, or a similar role.
Experience with monitoring and observability tools such as Datadog, Open Telemetry, Prometheus, Grafana, or similar.
Strong coding skills in at least one language (Python, Javascript, Typescript, APEX, or similar).
Proficiency with CI/CD tools such as GitHub Actions, or similar.
Experience with containerization (Docker) and orchestration tools like AWS ECS.
Experience working with serverless architectures and event-driven systems.
A collaborative mindset with excellent communication skills.

Responsibilities:

Design and implement monitoring, alerting, and observability systems to ensure high system uptime and fast incident identification and resolution.
Define, implement, and monitor SLI/SLOs and error budgets in collaboration with engineering teams to ensure optimal service reliability.
Collaborate with development teams to design and optimize application and system performance, helping improve scalability and fault tolerance.
Lead incident response efforts, perform root cause analyses, and foster blameless postmortems to prevent recurrence.
Reduce toil by automating repetitive tasks to improve team efficiency and reduce manual intervention.
Manage and scale cloud infrastructure (Salesforce or AWS preferred) for critical systems.
Partner with Security teams to ensure compliance and best practices are integrated across all systems and processes.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted about 10 hours ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 10 hours ago

Apply

🔥 Senior Site Reliability Engineer, Database Operations:Clickhouse

Posted 5 days ago

📍 California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

💸 117600.0 - 252000.0 USD per year

🔍 Software Development

🏢 Company: GitLab👥 1001-5000💰 $268,000,000 Series E over 5 years ago🫂 Last layoff almost 2 years agoDeveloper Tools DevOps Open Source SaaS Cloud Security

🔧 Requirements

Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
Solid experience with at least one programming language: Go, Ruby or Python.
Advanced experience with Linux.
Extensive on-call experience as an SRE supporting mission critical systems.
Solid incident management experience across all phases.
Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.

💡 Responsibilities

Design, build, and maintain ClickHouse and PostgreSQL clusters.
Provision cloud infrastructure using configuration management and IaC tools.
Implement high-availability ClickHouse solutions.
Optimize PostgreSQL clusters for core applications.
Build monitoring and alerting tools to ensure resource optimization.
Respond to platform alerts and user emergencies.
Enhance infrastructure security and partner with compliance assessors.
Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 5 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 13 days ago

📍 Colombia, USA

🧭 Contractor

🔍 Software outsourcing

🏢 Company: Teravision Technologies👥 251-500💰 about 13 years agoAndroid iOS Mobile Apps Information Technology Software

🔧 Requirements

Proven experience managing the Kubernetes infrastructure.
Familiarity with CI/CD pipelines, particularly TeamCity and tools like SonarQube.
Hands-on experience with AWS services such as S3, Route 53, etc.
Strong understanding of backend systems and infrastructure management.
Excellent English communication skills and a Bachelor’s Degree in Computer Science or equivalent work experience.

💡 Responsibilities

Proven experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
Prior experience in an on-call role and knowledge of monitoring and alerting tools to support on-call responsibilities.

AWSKubernetesCI/CDTroubleshootingDebugging

Posted 13 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 United States, Estonia

🔍 B2B tech

🏢 Company: Pactum👥 51-100💰 Grant 7 months agoArtificial Intelligence (AI)

🔧 Requirements

Experienced in managing Cloud Infrastructure (GCP) via Infrastructure as Code (Terraform).
Excellent problem-solving skills and experience debugging complex systems and network issues.
Experienced in using and setting up observability tools like Opentelemetry and Grafana.
Proficient in programming languages such as nodejs, bash, kotlin, and python; open to learning more and writing production code.
Excellent English communication skills.

💡 Responsibilities

Work on cloud-based infrastructure ensuring high availability.
Maintain infrastructure deployment and CI/CD processes.
Improve developer experience for local product development.
Secure access to infrastructure and services.
Continuously improve observability stack.
Support negotiation infrastructure for ML and AI technologies.
Manage PostgreSQL and respond to escalated database issues.
Implement SRE concepts including SLI/SLO and production readiness.

Node.jsPostgreSQLPythonBashGCPKotlinCI/CDTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted 2 months ago

📍 Canada, Chile

🔍 Technology

🏢 Company: Launchpad Technologies

🔧 Requirements

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or similar roles.
Familiarity with monitoring tools and systems.
Proficient in scripting languages such as Python, Bash, or Ruby.
Experience with infrastructure automation tools such as Terraform, Ansible, or Chef.
Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
Strong knowledge of cloud platforms such as AWS, GCP, or Azure.
Excellent troubleshooting and analytical skills.
Strong communication skills and the ability to work effectively within a team.

💡 Responsibilities

Develop, maintain, and improve automated deployment, certification, and validation pipelines.
Define, implement, and monitor service level objectives (SLOs), service level agreements (SLAs), and service level indicators (SLIs).
Lead efforts to optimize, improve, and maintain the reliability and performance of the SaaS platform.
Manage third-party services and technologies used to support the SRE discipline.
Collaborate with senior management and the engineering team to lead SRE initiatives and provide updates.
Define and implement an observability framework to provide insights into system performance and behavior.
Implement proactive monitors and alerts to ensure system reliability and performance meet customer expectations.
Own operational incident management, providing support to related teams and individuals during incident resolution.
Identify and implement best practices for system reliability, security, scalability, and performance.
Participate in on-call rotations for system support, troubleshooting, and resolution.
Conduct post-mortem reviews of incidents, identify root cause, and implement remediation steps.
Develop and maintain documentation for systems, processes, and procedures.

AWSDockerLeadershipPythonBashGCPKubernetesRubyAzureCommunication SkillsAnalytical SkillsDevOpsTerraformDocumentationComplianceTroubleshooting

Posted 2 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 2 months ago

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health👥 51-100💰 Seed about 2 years agoMedical Wellness Health Care

🔧 Requirements

Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
At least one year of experience as a Python developer transitioning to an SRE role.
Five years of experience in software development as a DevOps and/or SRE.
Two years of experience in an SRE role with Kubernetes, preferably GKE.
Experience using ArgoCD for rollouts and deployments.
One year experience with service mesh like Istio in a GKE environment.
Proficiency in scripting languages like Python and automation tools like Terraform.
Solid understanding of security best practices for pipelines and cloud environments.
Familiarity with compliance standards like SOC 2, HIPAA.
Strong expertise in CI/CD pipeline management.

💡 Responsibilities

Design and implement automated application deployment processes.
Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
Manage development, testing, staging, pre-production and production environments.
Automate repetitive deployment tasks to improve productivity.
Select, develop, and monitor CI/CD systems.
Oversee software automation across GCP.
Containerize services to optimize resources and deployment speed.
Manage and optimize cloud infrastructure for cost and performance.
Ensure compliance with security standards and maintain disaster recovery plans.
Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 2 months ago

Apply

🔥 Senior Site Reliability Engineer - Platform

Posted 3 months ago

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

🔧 Requirements

At least 5+ years of software engineering experience.
Strong understanding of data structures and algorithms related to performance and reliability.
Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
Strong skills around observability, debugging, and performance tuning.
Ability to debug complex systems and willingness to understand and improve any layer of the stack.
Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
Strong communication skills and ability to explain technical concepts clearly.
Demonstrated critical thinking under pressure.

💡 Responsibilities

Build automation and improve systems to eliminate toil and operations work.
Improve observability, reliability, and availability by defining and measuring key metrics.
Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
Collaborate with product teams to reduce service disruptions and automate incident response.
Proactively find and analyze reliability problems and design software for improvements.
Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 3 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 3 months ago

📍 United States

💸 130000 - 170000 USD per year

🔍 Data-Powered Marketing Cloud

🏢 Company: Zeta Global👥 1001-5000💰 $105,263,174 Post-IPO Equity 5 months agoInformation Services Advertising Analytics Marketing

🔧 Requirements

7+ years of experience as an SRE.
3+ years of software development experience, emphasizing automation.
Hands-on experience with Infrastructure as Code (IaC) tools.
Experience with distributed systems and microservices architecture.
Production experience with distributed tracing.
Proficiency in Python and Bash scripting.
Solid understanding of SLIs, SLOs, and error budgets.
Experience with CI/CD platforms like GitOps or Jenkins.
Expertise in incident management and root cause analysis.
Knowledge of modern deployment strategies like Canary and Blue-Green.
Familiarity with resiliency patterns such as circuit breakers and load balancing.
Experience with SQL and NoSQL databases in distributed systems.
Proficiency in statistical analysis related to metrics.
Experience with high-performance and low-latency systems.
Experience with cloud cost optimization strategies.
Familiarity with distributed messaging systems like Kafka.
Strong understanding of security and compliance standards in SRE.

💡 Responsibilities

Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
Analyze historical data to identify areas for improvement.
Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
Reduce toil through runbook automation and record key MTTx metrics.
Lead design sessions focusing on capacity planning and automation.
Collaborate with product teams to enhance reliability and engage in strategic initiatives.

PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Posted 3 months ago

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 3 months ago

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

🔧 Requirements

Proficiency in programming languages such as Python, Go, Javascript.
5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
Strong understanding of Linux/Unix systems and networking.
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
Willingness to collaborate and share knowledge with colleagues.
Ability to take responsibility for work and demonstrate accountability.

💡 Responsibilities

Develop and maintain monitoring and alerting solutions.
Respond to incidents, troubleshoot issues, and perform root cause analysis.
Automate repetitive tasks and improve deployment processes.
Develop and maintain tools to support infrastructure and applications.
Analyze system performance and implement optimizations to improve efficiency and reduce latency.
Ensure systems are secure and compliant with relevant standards and regulations.
Maintain comprehensive documentation of systems and processes.
Share knowledge and best practices with team members.
Ensure the reliability, performance, and scalability of databases.
Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 3 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 5 months ago

📍 United States, Canada

🧭 Full-Time

💸 $139,000 - $218,000 per year

🔍 Web Development

🔧 Requirements

Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
Enthusiasm for mentoring and sponsoring less-experienced engineers.

💡 Responsibilities

Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 5 months ago

Apply