Senior Site Reliability Engineer

Posted 4 months agoViewed

💎 Seniority level: Senior, 7+ years as an SRE, 3+ years in software development

📍 Location: United States

💸 Salary: 130000 - 170000 USD per year

🔍 Industry: Data-Powered Marketing Cloud

🏢 Company: Zeta Global👥 1001-5000💰 $105,263,174 Post-IPO Equity 6 months agoInformation Services Advertising Analytics Marketing

⏳ Experience: 7+ years as an SRE, 3+ years in software development

🪄 Skills: PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Requirements:

7+ years of experience as an SRE.
3+ years of software development experience, emphasizing automation.
Hands-on experience with Infrastructure as Code (IaC) tools.
Experience with distributed systems and microservices architecture.
Production experience with distributed tracing.
Proficiency in Python and Bash scripting.
Solid understanding of SLIs, SLOs, and error budgets.
Experience with CI/CD platforms like GitOps or Jenkins.
Expertise in incident management and root cause analysis.
Knowledge of modern deployment strategies like Canary and Blue-Green.
Familiarity with resiliency patterns such as circuit breakers and load balancing.
Experience with SQL and NoSQL databases in distributed systems.
Proficiency in statistical analysis related to metrics.
Experience with high-performance and low-latency systems.
Experience with cloud cost optimization strategies.
Familiarity with distributed messaging systems like Kafka.
Strong understanding of security and compliance standards in SRE.

Responsibilities:

Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
Analyze historical data to identify areas for improvement.
Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
Reduce toil through runbook automation and record key MTTx metrics.
Lead design sessions focusing on capacity planning and automation.
Collaborate with product teams to enhance reliability and engage in strategic initiatives.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted about 9 hours ago

📍 United States, Europe

🧭 Full-Time

🔍 Biotechnology

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

Experience in cloud infrastructure
Strong incident management skills
Technical skills in software reliability

💡 Responsibilities

Design, build, and maintain scalable cloud infrastructure
Develop and enforce SLIs and SLOs
Create CI/CD pipelines
Lead Incident Management process

AWSDockerCI/CDLinuxTerraform

Posted about 9 hours ago

Apply

🔥 Senior Site Reliability Engineer

Posted 5 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 5 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 16 days ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted 16 days ago

Apply

🔥 Senior Site Reliability Engineer, Database Operations:Clickhouse

Posted 20 days ago

📍 California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

💸 117600.0 - 252000.0 USD per year

🔍 Software Development

🏢 Company: GitLab👥 1001-5000💰 $268,000,000 Series E over 5 years ago🫂 Last layoff about 2 years agoDeveloper Tools DevOps Open Source SaaS Cloud Security

🔧 Requirements

Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
Solid experience with at least one programming language: Go, Ruby or Python.
Advanced experience with Linux.
Extensive on-call experience as an SRE supporting mission critical systems.
Solid incident management experience across all phases.
Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.

💡 Responsibilities

Design, build, and maintain ClickHouse and PostgreSQL clusters.
Provision cloud infrastructure using configuration management and IaC tools.
Implement high-availability ClickHouse solutions.
Optimize PostgreSQL clusters for core applications.
Build monitoring and alerting tools to ensure resource optimization.
Respond to platform alerts and user emergencies.
Enhance infrastructure security and partner with compliance assessors.
Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 20 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 28 days ago

📍 Colombia, USA

🧭 Contract

🔍 Software Development

🏢 Company: Teravision Technologies👥 251-500💰 about 13 years agoAndroid iOS Mobile Apps Information Technology Software

🔧 Requirements

Experience managing and maintaining Kubernetes (K8s) infrastructure
Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube
Hands-on experience with AWS services such as S3, Route 53, and others
Strong understanding of backend systems and infrastructure management
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments
Prior experience in an on-call role
Knowledge of monitoring and alerting tools

💡 Responsibilities

Proven experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
Prior experience in an on-call role and knowledge of monitoring and alerting tools to support on-call responsibilities.

AWSKubernetesCI/CDTroubleshootingDebugging

Posted 28 days ago

Apply

🔥 Senior Site Reliability Engineer | Remote US

Posted about 1 month ago

📍 United States

🔍 Cybersecurity

🔧 Requirements

Must be a self-starter with a passion for cloud technology.
Strong problem-solving abilities are essential.
Experience in major public clouds and automation is required.

💡 Responsibilities

As a Senior Site Reliability Engineer within the Cloud Services group, you will be responsible for operating cutting-edge offerings from Cloud Service Providers.
You will directly support leading cloud software companies to enhance the reliability and scalability of their SaaS products.
This role entails problem-solving and ensuring seamless service to large enterprises and government agencies.

AWSDockerPythonCloud ComputingKubernetesDevOpsTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 3 months ago

📍 United States, Canada

🧭 Contract

🔍 Site Reliability Engineering

🔧 Requirements

5-7 years in Site Reliability Engineering
Experience with DFR, FMEA, MTBF methodologies
Proficiency with monitoring tools like DataDog, PagerDuty
Strong coding skills in languages used in SRE

💡 Responsibilities

Identify and resolve complex bugs
Write and maintain code for system reliability
Investigate complex system issues
Design and build fault-tolerant systems
Develop and maintain reliability standards

PythonDebugging

Posted 3 months ago

Apply

🔥 Senior Site Reliability Engineer - Platform

Posted 4 months ago

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

🔧 Requirements

At least 5+ years of software engineering experience.
Strong understanding of data structures and algorithms related to performance and reliability.
Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
Strong skills around observability, debugging, and performance tuning.
Ability to debug complex systems and willingness to understand and improve any layer of the stack.
Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
Strong communication skills and ability to explain technical concepts clearly.
Demonstrated critical thinking under pressure.

💡 Responsibilities

Build automation and improve systems to eliminate toil and operations work.
Improve observability, reliability, and availability by defining and measuring key metrics.
Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
Collaborate with product teams to reduce service disruptions and automate incident response.
Proactively find and analyze reliability problems and design software for improvements.
Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 4 months ago

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 4 months ago

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

🔧 Requirements

Proficiency in programming languages such as Python, Go, Javascript.
5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
Strong understanding of Linux/Unix systems and networking.
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
Willingness to collaborate and share knowledge with colleagues.
Ability to take responsibility for work and demonstrate accountability.

💡 Responsibilities

Develop and maintain monitoring and alerting solutions.
Respond to incidents, troubleshoot issues, and perform root cause analysis.
Automate repetitive tasks and improve deployment processes.
Develop and maintain tools to support infrastructure and applications.
Analyze system performance and implement optimizations to improve efficiency and reduce latency.
Ensure systems are secure and compliant with relevant standards and regulations.
Maintain comprehensive documentation of systems and processes.
Share knowledge and best practices with team members.
Ensure the reliability, performance, and scalability of databases.
Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 4 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 5 months ago

📍 North America

🧭 Full-Time

🔍 Incident Management Platform

🏢 Company: Rootly👥 11-50💰 $12,000,000 Series A over 1 year agoDeveloper Tools Developer Platform Productivity Tools SaaS Information Technology Software

🔧 Requirements

You have 5+ years of experience in an SRE or Infrastructure Engineering role.
5+ years of experience writing software as a SWE or Software heavy SRE role.
You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
You’ve supported services at web or RPC services at a significant scale.
You have experience solving infrastructure problems by writing software.
You have a big-picture perspective on systems and tools.
You can collaborate with other Engineering teams to understand their systems and help to improve them.

💡 Responsibilities

Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
Build tools to support our processes.
Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 5 months ago

Apply