Senior Site Reliability Engineer

Posted 8 days agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: USA

🔍 Industry: Software Development

🏢 Company: Dandy👥 501-1000 Food and Beverage Food Processing

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: GraphQLNode.jsPostgreSQLCloud ComputingGCPKubernetesTypeScriptNest.jsCI/CDDevOpsTerraformSoftware Engineering

Requirements:

5+ years of software engineering experience, preferably in a high growth startup environment
An expert in Google Cloud Platform and Google Kubernetes Engine
Experience with infrastructure as code platforms (Terraform, Pulumi)
Experience creating and maintaining fully automated CI/CD build processes for multiple environments
Experience designing the architecture and automation of infrastructure within a cloud environment

Responsibilities:

Develop and maintain infrastructure, systems, and tooling to support Dandy’s products in a secure, well-tested, and performant way.
Reinvent an analog experience and disrupt a legacy industry through novel and scalable system design.
Collaborate with Product Engineers and other stakeholders within Engineering, Product and Data to maintain a high bar for quality in a fast-paced, iterative environment.
Advocate for improvements to infrastructure quality, security, and performance.
Craft code that meets our internal standards for style, maintainability, and best practices.
Recognize impediments to our efficiency as a team ("technical debt"), propose and implement solutions.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted 2 days ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Fetch

🔧 Requirements

1+ year(s) of experience in a software development-oriented role (e.g. Software Engineer, DevOps Engineer, Site Reliability Engineer)
Experience with one or more high-level programming languages (e.g. Java, Python, Go, C/C++)
Experience with cloud infrastructure (AWS strongly preferred)
Experience with containerization technologies (Docker, Kubernetes preferred)
Experience building CI/CD pipelines
Experience with Unix/Linux operating system internals and networking
Experience with analyzing and troubleshooting systems
Experience monitoring and supporting microservice architectures
Bachelor's or higher degree in Computer Science, related technical field, or equivalent practical experience

💡 Responsibilities

Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and readiness reviews
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
Practice sustainable incident response and blameless postmortems by participating in the on-call rotation
Build and support AWS multi-account and multi-region infrastructure using a mix of managed services (e.g. S3, Lambda, RDS, etc.) and containerized infrastructure (e.g. EKS, ECS)
Grow the SRE team by mentoring engineers and participating in the hiring process

AWSDockerPythonSoftware DevelopmentSQLAmazon RDSAWS EKSBashCloud ComputingElasticSearchGitJavaKubernetesAPI testingGoJava SpringCI/CDRESTful APIsLinuxTerraformMicroservicesTroubleshootingAnsibleScriptingDebugging

Posted 2 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 3 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Biotechnology

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

Experience in cloud infrastructure
Strong incident management skills
Technical skills in software reliability

💡 Responsibilities

Design, build, and maintain scalable cloud infrastructure
Develop and enforce SLIs and SLOs
Create CI/CD pipelines
Lead Incident Management process

AWSDockerCI/CDLinuxTerraform

Posted 3 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 7 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 7 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 18 days ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted 18 days ago

Apply

🔥 Senior Site Reliability Engineer, Database Operations:Clickhouse

Posted 23 days ago

📍 California, Colorado, Hawaii, New Jersey, New York, Washington, DC, Illinois, Minnesota

💸 117600.0 - 252000.0 USD per year

🔍 Software Development

🏢 Company: GitLab👥 1001-5000💰 $268,000,000 Series E over 5 years ago🫂 Last layoff about 2 years agoDeveloper Tools DevOps Open Source SaaS Cloud Security

🔧 Requirements

Advanced database platform management experience, preferably using Postgres and Clickhouse at scale.
Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes.
Solid experience with at least one programming language: Go, Ruby or Python.
Advanced experience with Linux.
Extensive on-call experience as an SRE supporting mission critical systems.
Solid incident management experience across all phases.
Solid experience implementing monitoring at scale, preferably Prometheus and Grafana.

💡 Responsibilities

Design, build, and maintain ClickHouse and PostgreSQL clusters.
Provision cloud infrastructure using configuration management and IaC tools.
Implement high-availability ClickHouse solutions.
Optimize PostgreSQL clusters for core applications.
Build monitoring and alerting tools to ensure resource optimization.
Respond to platform alerts and user emergencies.
Enhance infrastructure security and partner with compliance assessors.
Collaborate with engineering teams for service rollouts and architectural improvements.

PostgreSQLPythonKubernetesRubyClickhouseGoGrafanaPrometheusLinuxTerraformAnsible

Posted 23 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 Colombia, USA

🧭 Contract

🔍 Software Development

🏢 Company: Teravision Technologies👥 251-500💰 about 13 years agoAndroid iOS Mobile Apps Information Technology Software

🔧 Requirements

Experience managing and maintaining Kubernetes (K8s) infrastructure, including updates, patching, and software configuration management.
Familiarity with CI/CD pipelines, particularly TeamCity, and integrating tools like SonarQube.
Hands-on experience with AWS services such as S3, Route 53, and others.
Strong understanding of backend systems and infrastructure management.
Proficiency in troubleshooting, debugging, and ensuring system reliability in production environments.
Prior experience in an on-call role.
Knowledge of monitoring and alerting tools to support on-call responsibilities.
Bachelor’s Degree in Computer Science or equivalent work experience.

💡 Responsibilities

NOT STATED

AWSKubernetesCI/CDTroubleshootingDebugging

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer (SRE) - Disaster Recovery Specialist (m/f/x)

Posted 3 months ago

📍 United States, Canada

🧭 Full-Time

🔍 Software Development

🔧 Requirements

Degree in Computer Science or related field
5+ years experience in site reliability engineering
Proficiency in AWS, Azure, or Google Cloud
Experience with IaC tools like Terraform or CloudFormation

💡 Responsibilities

Develop and document disaster recovery plans and procedures
Collaborate with teams to identify and mitigate risks
Monitor system performance and enhance reliability

AWSAzureTerraform

Posted 3 months ago

Apply

🔥 Senior Site Reliability Engineer - Platform

Posted 4 months ago

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

🔧 Requirements

At least 5+ years of software engineering experience.
Strong understanding of data structures and algorithms related to performance and reliability.
Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
Strong skills around observability, debugging, and performance tuning.
Ability to debug complex systems and willingness to understand and improve any layer of the stack.
Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
Strong communication skills and ability to explain technical concepts clearly.
Demonstrated critical thinking under pressure.

💡 Responsibilities

Build automation and improve systems to eliminate toil and operations work.
Improve observability, reliability, and availability by defining and measuring key metrics.
Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
Collaborate with product teams to reduce service disruptions and automate incident response.
Proactively find and analyze reliability problems and design software for improvements.
Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGoCommunication SkillsLinuxTerraform

Posted 4 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted 4 months ago

📍 United States

💸 130000 - 170000 USD per year

🔍 Data-Powered Marketing Cloud

🏢 Company: Zeta Global👥 1001-5000💰 $105,263,174 Post-IPO Equity 6 months agoInformation Services Advertising Analytics Marketing

🔧 Requirements

7+ years of experience as an SRE.
3+ years of software development experience, emphasizing automation.
Hands-on experience with Infrastructure as Code (IaC) tools.
Experience with distributed systems and microservices architecture.
Production experience with distributed tracing.
Proficiency in Python and Bash scripting.
Solid understanding of SLIs, SLOs, and error budgets.
Experience with CI/CD platforms like GitOps or Jenkins.
Expertise in incident management and root cause analysis.
Knowledge of modern deployment strategies like Canary and Blue-Green.
Familiarity with resiliency patterns such as circuit breakers and load balancing.
Experience with SQL and NoSQL databases in distributed systems.
Proficiency in statistical analysis related to metrics.
Experience with high-performance and low-latency systems.
Experience with cloud cost optimization strategies.
Familiarity with distributed messaging systems like Kafka.
Strong understanding of security and compliance standards in SRE.

💡 Responsibilities

Implement and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
Lead and promote postmortems, driving robust root cause analysis for continuous system improvement.
Analyze historical data to identify areas for improvement.
Implement full observability using tools like OpenTelemetry, Honeycomb, New Relic, or Datadog.
Reduce toil through runbook automation and record key MTTx metrics.
Lead design sessions focusing on capacity planning and automation.
Collaborate with product teams to enhance reliability and engage in strategic initiatives.

PythonSoftware DevelopmentSQLBashJenkinsKafkaNosqlCI/CDDevOpsMicroservicesCompliance

Posted 4 months ago

Apply

🔥 Senior Site Reliability Engineer (SRE)

Posted 4 months ago

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

🔧 Requirements

Proficiency in programming languages such as Python, Go, Javascript.
5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
Strong understanding of Linux/Unix systems and networking.
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
Willingness to collaborate and share knowledge with colleagues.
Ability to take responsibility for work and demonstrate accountability.

💡 Responsibilities

Develop and maintain monitoring and alerting solutions.
Respond to incidents, troubleshoot issues, and perform root cause analysis.
Automate repetitive tasks and improve deployment processes.
Develop and maintain tools to support infrastructure and applications.
Analyze system performance and implement optimizations to improve efficiency and reduce latency.
Ensure systems are secure and compliant with relevant standards and regulations.
Maintain comprehensive documentation of systems and processes.
Share knowledge and best practices with team members.
Ensure the reliability, performance, and scalability of databases.
Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureGoGrafanaPrometheusRedisNosqlCI/CD

Posted 4 months ago

Apply