Senior Site Reliability Engineer

Posted about 2 hours agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: United States, Europe, EST, GMT, CEST

🔍 Industry: Software Development

🏢 Company: Dune👥 101-250

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Requirements:

Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
Experience with infrastructure-as-code and orchestration tools.
Strong understanding of system performance, debugging, and optimization across diverse environments.
Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
Solid foundation in computer science fundamentals and system design.
Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
Experience with distributed systems and managing large-scale, high-availability environments.
Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
Experience with bare-metal infrastructure.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with monitoring and observability tools for infrastructure performance.
Familiarity with cloud cost management and performance improvement strategies.
Strong analytical and troubleshooting skills.
Experience working across multiple time zones.

Responsibilities:

Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted about 3 hours ago

📍 United States

🧭 Full-Time

💸 128350.0 - 192100.0 USD per year

🔍 Software Development

🏢 Company: ClickHouse👥 101-250💰 Series B over 2 years agoDatabase Artificial Intelligence (AI)Big Data Analytics Software

🔧 Requirements

At least 8 years of experience in Site Reliability Engineering or a related field.
Previous experience using ClickHouse in production.
Coding experience with Go and/or Python.
Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm.
Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.

💡 Responsibilities

Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
Continuously improve the reliability and performance of our ClickHouse services.
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

AWSDockerPythonSQLCloud ComputingKubernetesCross-functional Team LeadershipClickhouseGoREST APICommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformExcellent communication skillsTeamworkStrong communication skillsAnsibleDebugging

Posted about 3 hours ago

Apply

🔥 Senior Site Reliability Engineer - (Remote - Europe)

Posted 1 day ago

📍 Germany, Spain, Portugal

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

🔧 Requirements

5+ years of experience in a Site Reliability Engineer or similar role.
3+ years of experience with AWS services and container orchestration tools.
2+ years of Kubernetes experience.
Strong knowledge of observability tools and principles (monitoring, logging, tracing).
Hands-on experience with Terraform for infrastructure as code.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Experience in incident management, postmortem analysis, and risk mitigation.
Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.

💡 Responsibilities

Develop and maintain systems that are reliable, scalable, and efficient.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
Automate operational tasks, incident responses, and contribute to system performance optimizations.
Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
Continuously evaluate and improve system performance, capacity, and cost efficiency.
Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted 1 day ago

Apply

🔥 Senior Site Reliability Engineer - Midnight

Posted 1 day ago

📍 United States

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

7+ years of experience in SRE, DevOps, or a related role.
Understanding of SRE best practices, architectures, and methods.
Good knowledge on resiliency patterns and cloud security.
Strong programming proficiency in Python, Golang, or Javascript.
Demonstrated experience with AWS and modern cloud architectures.
Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
Hands-on experience with Kubernetes/EKS and GitOps methodologies.
Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
Ability to engage in technical discussions and be part of the decision making process
Strong problem-solving skills and capability to work on complex systems
Experience in working within an Agile environment
Experience in working with a distributed team
Strong communication and collaboration abilities to work seamlessly across different teams.
A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.

💡 Responsibilities

Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
Leverage GitOps principles to automate deployments and manage container orchestration.
Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
Develop automation tools and scripts to improve operational efficiency.
Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
Collaborate with dev teams to define and implement SLOs/SLIs
Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 1 day ago

Apply

🔥 Senior Site Reliability Engineer

Posted 3 days ago

📍 United Kingdom

🔍 Software Development

🏢 Company: StarRez👥 251-500💰 Private about 3 years agoConsulting SaaS Property Management Software

🔧 Requirements

1+ years experience working on a SaaS platform
Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
Proficiency in at least one (or more) object-oriented programming language (C# preferable)
Production experience operating containerization technologies (Kubernetes).
Proficiency with one or more public cloud providers such as Azure, AWS or GCP
Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
Proven track record of maintaining highly-available and performant production environments.
Ability to identify and implement effective mitigation strategies and operational playbooks.

💡 Responsibilities

Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
Participate in on-call rotations to ensure system reliability and rapid incident response.
Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
Conduct performance tests to identify and remediate bottlenecks
Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
Monitor, review and tune databases to ensure high availability and performance
Collaborate with product engineering teams to design/build fit-for-purpose and observable software
Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 3 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 6 days ago

📍 United States, Canada, Latin America

🧭 Full-Time

💸 160000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Superhuman👥 51-200💰 $75,000,000 Series C over 3 years ago🫂 Last layoff almost 3 years agoSoftware Development

🔧 Requirements

6+ years of experience in SRE, DevOps, or systems engineering roles.
Proven experience managing high-availability, mission-critical systems.
Strong proficiency with cloud platforms (GCP, AWS, or Azure).
Hands-on experience with containers and orchestration tools (Docker, Kubernetes).
Expertise in monitoring, logging, and alerting tools (e.g., Metabase, Datadog, Prometheus, Grafana, etc).
Proficiency in scripting/programming languages (Python, Go, Bash, etc.).
Knowledge of database management systems (SQL/NoSQL).
Strong knowledge of networking, security, and distributed systems.
Experience with Infrastructure as Code (Terraform, Ansible, Chef, or Puppet).
Familiarity with version control systems (Git) and CI/CD pipelines (Jenkins, GitLab CI, etc.).
Strong communication skills and ability to work collaboratively across teams.
Problem-solving mindset with a focus on root cause analysis.
Proactive, self-driven, and able to handle high-pressure environments.

💡 Responsibilities

Collaborate with software engineers to design scalable, fault-tolerant systems and services.
Proactively monitor service health, availability, and performance.
Respond to and troubleshoot production issues.
Perform capacity planning and scaling activities.
Automate repetitive tasks to enhance efficiency.
Design and implement disaster recovery plans and high availability strategies.
Collaborate with our security team to ensure infrastructure adheres to best practices and compliance requirements.
Build, maintain, and enhance CI/CD pipelines.
Manage and automate infrastructure provisioning and configuration.
Work closely with development teams to ensure best practices in deployment and release processes.
Champion DevOps culture by mentoring and guiding other engineers in the use of tools and best practices.

AWSDockerPythonSQLBashCloud ComputingGCPGitJenkinsKubernetesAzureGoGrafanaPrometheusNosqlCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingAnsibleScripting

Posted 6 days ago

Apply

🔥 Senior Site Reliability Engineer - Linux

Posted 13 days ago

📍 United Kingdom, Canada

🔍 Software Development

🏢 Company: GoDaddy👥 5001-10000💰 $800,000,000 Post-IPO Equity about 3 years ago🫂 Last layoff over 1 year agoWeb Hosting Domain Registrar Web Development Online Portals

🔧 Requirements

Experience with REST APIs
Experience with testing code
Experience with Docker and other container-related technologies
Experience with Python or similar languages
Experience with Hashicorp Vault or other similar tooling

💡 Responsibilities

Engage with engineers and partners to solve problems
Lead by example with high coding standards
Improve the observability of production services
Share expertise by training and guiding other engineers

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 13 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 22 days ago

📍 United States, European timezones

🧭 Full-Time

🔍 Software Development

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData Management SaaS Application Performance Management

🔧 Requirements

NOT STATED

💡 Responsibilities

Design, build, and maintain scalable and secure cloud infrastructure as code
Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
Enable cost transparency and optimize infrastructure spending
Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
Build and maintain robust CI/CD pipelines that accelerate time from code to customer
Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
Lead and continuously improve our Incident Management process
Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 22 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 22 days ago

📍 Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Sanity👥 51-200💰 Corporate over 2 years agoSoftware Development

🔧 Requirements

Proven experience with SRE/DevOps tools, processes, and culture.
Proficient in programming languages like Python, Go, and TypeScript.
5+ years of experience participating in an SRE on-call rotation.
Analytical mindset for designing, diagnosing, and optimizing infrastructure.
Skilled in managing scalable, highly available, cloud-based applications.
Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
Strong database management skills, particularly with PostgreSQL.
Experience with infrastructure as code, using tools like Terraform.
Proficient in building and maintaining CI/CD pipelines.
Familiarity with observability tools like Prometheus and similar stacks.
Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
Open-minded yet discerning when it comes to exploring new technologies.

💡 Responsibilities

Plan and implement a global platform for delivering our software as a service.
Diagnose and troubleshoot complex distributed systems.
Ensure observability and analyze the behavior of our stack.
Orchestration, deployment, monitoring, automation.
Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 22 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 26 days ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data Services B2B Cloud Security Cyber Security

🔧 Requirements

Experience in a start-up environment
Design and maintain highly available database solutions, ideally PostgreSQL
Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
Strong engineering background
Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)

💡 Responsibilities

Provision infrastructure and tooling
Create automated tooling to maintain the platform
Build methods for monitoring and scaling services
Implement security compliance strategies
Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 26 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 month ago

📍 USA, CAN, MEX

🔍 Transportation technology

🏢 Company: Fleetio

🔧 Requirements

5+ years of AWS Experience.
3+ years Kubernetes Experience.
Ruby on Rails experience.
Expert at profiling and benchmarking source code.
Effective at code review, and identifying potential performance problems before they reach production.
Experience with Datadog or other APM tools.
Excellent written and verbal communication skills.

💡 Responsibilities

Manage cloud infrastructure using Infrastructure as Code.
Manage and scale a Ruby on Rails stack.
Implement monitoring tools to improve observability.
Perform code review of new features to ensure they meet performance requirements.
Debug production issues across all levels of the stack.
Plan for the growth of, optimize, and automate Fleetio’s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago

Apply