Site Reliability Engineer

Posted 3 months agoViewed

View full description

📍 Location: UK

🔍 Industry: Digital experience platforms

🏢 Company: Ably (UK)

🪄 Skills: AWSDockerNode.jsPostgreSQLSoftware DevelopmentCassandraGoCI/CDLinux

Requirements:

A deep technical understanding of systems and a commitment to advancing that knowledge.
Understanding of Site Reliability Engineering and infrastructure-as-code principles.
Strong technical expertise in Linux systems administration and networking.
Experience operating production systems on public cloud platforms, particularly AWS.
Proficiency in software development with a record of working in production systems.
Skills in delivering projects from initiation to completion, managing resources and timelines.

Responsibilities:

Maintain and enhance infrastructure services to ensure reliability, scalability, and performance.
Drive infrastructure-as-code practices by developing and managing infrastructure with automation.
Develop software solutions for deployment, orchestration, instance management, health monitoring, and system administration.
Monitor and improve system observability using tools for actionable insights.
Collaborate with cross-functional teams to align infrastructure initiatives with business goals.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer IOE: Cardano

Posted 7 days ago

📍 United Kingdom

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

Proficiency in Python, Bash, Terraform, Nix for DevOps services.
Extensive experience with AWS, specifically with services like EKS and RDS.
Familiarity with Container orchestration (e.g. Kubernetes) is essential.
Hands-on experience with PostgreSQL and its deployment on RDS.
Knowledge of monitoring tools (e.g., Prometheus, Grafana, Loki).
Solid troubleshooting and performance tuning capabilities.
Exceptional communication skills and team collaboration ethic.
Experience with CI/CD (e.g. Github Actions, Hydra, Earthly).

💡 Responsibilities

Design, write, and deliver tools and software primarily using Python, Bash, Terraform or Nix to improve the availability, scalability, and efficiency of our services.
Engage in and refine the whole lifecycle of services, from inception and design, through deployment, operation, and continuous improvement.
Practice sustainable incident response and promote blameless postmortems.
Collaborate with the development teams to ensure that solutions are designed with customer experience, scalability, and performance in mind.
Analyze system performance and reliability, offering recommendations for enhancement.
Develop and uphold service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for our services.
Participate in on-call rotations, responding to and mitigating service interruptions and technical challenges.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesGrafanaPrometheusCI/CDDevOpsTerraform

Posted 7 days ago

Apply

🔥 Site Reliability Engineer

Posted 8 days ago

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternet Open Source PaaS Cloud Management Software

🔧 Requirements

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 8 days ago

Apply

🔥 Site Reliability Engineer

Posted 8 days ago

📍 France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Remote Woman

🔧 Requirements

A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability
Automate Deployments and Workflows
Optimize CI/CD Pipelines
Cloud Infrastructure Management
Incident Response and Post-Mortem
Collaborate with Cross-Functional Teams
Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 8 days ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 9 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 9 days ago

Apply

🔥 Staff Site Reliability Engineer

Posted 15 days ago

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 129347.0 - 200824.0 USD per year

🔍 Software Development

🏢 Company: Wikimedia Foundation👥 251-500💰 $2,100,000 Grant over 5 years ago

🔧 Requirements

7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).

💡 Responsibilities

Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Posted 15 days ago

Apply

🔥 Site Reliability Engineer: Postgres

Posted 15 days ago

📍 Worldwide

🧭 Full-Time

🔍 Software Development

🏢 Company: Supabase👥 101-250💰 $80,000,000 Series C 6 months agoDatabase Developer Tools Artificial Intelligence (AI)Information Services Information Technology Software

🔧 Requirements

Experience in designing multi-tenant database solutions, designing for failover, fault-tolerance, and disaster recovery
Experience with orchestrating stateful workloads at scale or having used a Postgres operator like the ones from Zalando or Crunchy is a plus
Experience with tools in the Postgres ecosystem like pgbackrest, barman, Patroni, Stolon, etc
5+ years experience in SRE/DevOps/Cloud Infrastructure
3+ years of experience in building with Golang
Experience in managing large deployments on AWS
Knowledge of networking
Experience with Infrastructure as Code tools

💡 Responsibilities

Help build the Supabase Postgres offering.
Focus on improving the reliability of database backups and recovery
Implement high availability with minimal downtime failover
Help operationalize database management for our users by implementing maintenance windows, blue-green deployments as part of database upgrades, etc.
Help users self serve debug their databases by improving database observability
Improve the performance of provisioned Postgres databases and expose knobs for our users to further tune their database performance
Improve our system architecture to reduce costs while balancing security and performance.
Design CI/CD systems to speed up deployments with proper change and release management processes.
Escalated storage support tickets and sharing the on-call responsibility for the storage service.

AWSBackend DevelopmentDockerPostgreSQLSQLKubernetesCI/CDLinuxDevOpsTerraformNetworking

Posted 15 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 20 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

🔧 Requirements

Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
Experience with infrastructure-as-code and orchestration tools.
Strong understanding of system performance, debugging, and optimization across diverse environments.
Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
Solid foundation in computer science fundamentals and system design.
Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
Experience with distributed systems and managing large-scale, high-availability environments.
Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
Experience with bare-metal infrastructure.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with monitoring and observability tools for infrastructure performance.
Familiarity with cloud cost management and performance improvement strategies.
Strong analytical and troubleshooting skills.
Experience working across multiple time zones.

💡 Responsibilities

Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 20 days ago

Apply

🔥 Site Reliability Engineer

Posted 21 days ago

📍 AMER, EMEA, APAC

🧭 Full-Time

🔍 Blockchain

🏢 Company: asymmetric.re

🔧 Requirements

Excellent experience managing Linux and network infrastructure.
Experience with load balancers and other high-availability technologies (e.g., HAproxy, ALB/ELB, etc.)
Prior experience with configuration management tooling (eg. Ansible, Chef, Puppet, Saltstack, etc.)
Excellent troubleshooting fundamentals on both hardware and software.
Development experience in Golang, Python, or Rust.
Experience with continuous integration pipelines and automated deployments
Experience OSS monitoring tools (eg. Grafana, Loki, Prometheus, Alertmanager)

💡 Responsibilities

Manage a globally distributed fleet of blockchain infrastructure services
Deploy infrastructure as code deployments to both dev, staging, and production environments
Work in a globally distributed high performing team to deliver mission-critical services to the financial sector.
Design, Architect, Deploy, and Manage blockchain infrastructure services.
Adhere to the highest standards of integrity, trust, and professionalism.

AWSDockerPythonBlockchainCloud ComputingKubernetesRustCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingTroubleshootingAnsibleScripting

Posted 21 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 23 days ago

📍 United Kingdom

🧭 Full-Time

🔍 Software Development

🏢 Company: StarRez👥 251-500💰 Private about 3 years agoConsulting SaaS Property Management Software

🔧 Requirements

1+ years experience working on a SaaS platform
Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
Proficiency in at least one (or more) object-oriented programming language (C# preferable)
Production experience operating containerization technologies (Kubernetes).
Proficiency with one or more public cloud providers such as Azure, AWS or GCP
Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
Proven track record of maintaining highly-available and performant production environments.
Ability to identify and implement effective mitigation strategies and operational playbooks.

💡 Responsibilities

Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
Participate in on-call rotations to ensure system reliability and rapid incident response.
Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
Conduct performance tests to identify and remediate bottlenecks
Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
Monitor, review and tune databases to ensure high availability and performance
Collaborate with product engineering teams to design/build fit-for-purpose and observable software
Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 23 days ago

Apply

🔥 Site Reliability Engineer

Posted 28 days ago

📍 United States, UK, Philippines, Poland, South Africa

🧭 Permanent

🔍 FinTech

🏢 Company: Zepz👥 1001-5000💰 $267,000,000 Series F 6 months ago🫂 Last layoff over 1 year agoMobile Payments Financial Services Payments FinTech

🔧 Requirements

At least 5 years in SRE, DevOps or Engineer role with a keen interest in solving problems using automation.
Understand SRE and DevOps methodologies.
Experience with Grafana, Loki and Prometheus.
You have experience supporting or developing applications written in Java, Python or node.js.
You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems.
Our Cloud Native platform is hosted on AWS.
You see a problem, you fix a problem.

💡 Responsibilities

Use code to solve problems.
Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services.
Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted.
Lead or be involved in the troubleshooting of complex incidents and problems.
Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams.
Helping the team meet its strategic goals; to maintain the highest level of observability, maximize developer velocity while keeping our product reliable, and ensure that we can deliver the highest quality experience to our customers.
Growing together. You’ll review others' work and happily seek feedback on yours to ensure we build a better codebase and sharpen each other's skills.

AWSNode.jsPythonSQLAgileBashCloud ComputingGitJavaKafkaKubernetesActiveMQGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingJSONAnsibleScripting

Posted 28 days ago

Apply