Apply

Site Reliability Engineer

Posted 2 months agoViewed

View full description

📍 Location: UK

🔍 Industry: Digital experience platforms

🏢 Company: Ably (UK)

🪄 Skills: AWSDockerNode.jsPostgreSQLSoftware DevelopmentCassandraGoCI/CDLinux

Requirements:
  • A deep technical understanding of systems and a commitment to advancing that knowledge.
  • Understanding of Site Reliability Engineering and infrastructure-as-code principles.
  • Strong technical expertise in Linux systems administration and networking.
  • Experience operating production systems on public cloud platforms, particularly AWS.
  • Proficiency in software development with a record of working in production systems.
  • Skills in delivering projects from initiation to completion, managing resources and timelines.
Responsibilities:
  • Maintain and enhance infrastructure services to ensure reliability, scalability, and performance.
  • Drive infrastructure-as-code practices by developing and managing infrastructure with automation.
  • Develop software solutions for deployment, orchestration, instance management, health monitoring, and system administration.
  • Monitor and improve system observability using tools for actionable insights.
  • Collaborate with cross-functional teams to align infrastructure initiatives with business goals.
Apply

Related Jobs

Apply

📍 United States, Europe

🧭 Full-Time

🔍 Biotechnology

🏢 Company: Invert👥 11-50💰 $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

  • Experience in cloud infrastructure
  • Strong incident management skills
  • Technical skills in software reliability
  • Design, build, and maintain scalable cloud infrastructure
  • Develop and enforce SLIs and SLOs
  • Create CI/CD pipelines
  • Lead Incident Management process

AWSDockerCI/CDLinuxTerraform

Posted 2 days ago
Apply
Apply

📍 United Kingdom

🧭 Contract

🔍 SaaS platform accelerating digital transformation in the restaurant industry

NOT STATED
NOT STATED

AWSDockerPythonSQLCI/CDDevOpsMicroservices

Posted 28 days ago
Apply
Apply

📍 UK, Spain, Poland

🧭 Full-Time

💸 96000.0 - 130000.0 GBP per year

🔍 Web development

🏢 Company: Netlify👥 11-50Information TechnologyWeb DesignGraphic DesignSoftware

  • Significant history in Site Reliability Engineering or similar roles.
  • Deep expertise in cloud architecture with experience in AWS, GCP, or Azure.
  • Proven track record of large-scale technical initiatives across multiple teams.
  • Expertise in designing and managing CI/CD pipelines using appropriate tools.
  • Experience with configuration management using tools like Ansible, Chef, or Puppet.
  • Proficiency with Kafka or other messaging brokers in multi-cloud environments.
  • Strong experience in database management for scalable applications.
  • Proficiency in Python, Go, or Bash for automation solutions.
  • Strong technical leadership skills and exceptional communication skills.
  • Comprehensive understanding of reliability engineering principles.
  • Experience establishing technical standards and best practices.
  • Champion the architectural vision and technical strategy for Netlify's reliability systems.
  • Foster cross-organizational reliability initiatives and collaborate with multiple engineering teams.
  • Cultivate technical standards and best practices for reliability.
  • Act as the technical authority during major incidents and provide guidance.
  • Strengthen relationships with stakeholders to integrate reliability considerations.
  • Mentor senior engineers and tech leads in systems thinking and reliability engineering.
  • Design and lead the implementation of organization-wide reliability frameworks and tooling.
  • Lead architecture reviews and provide oversight for critical infrastructure projects.
  • Develop reliability metrics and SLO frameworks aligned with business objectives.

AWSDockerPythonBashGCPGitKafkaAzureGoRDBMSNosqlCI/CDLinuxAnsible

Posted about 1 month ago
Apply
Apply
🔥 Staff Site Reliability Engineer
Posted about 1 month ago

📍 Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

🔍 Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted about 1 month ago
Apply
Apply

📍 United States, United Kingdom, Spain, Italy, Canada

🔍 Interactive entertainment

🏢 Company: Escape Velocity Entertainment Inc

  • 5+ years of experience in a Site Reliability, DevOps, or Platform engineering role.
  • 5+ years of experience with observability, application monitoring, telemetry collection, and data visualization tools.
  • Experience with GitOps workflows and Helix Core / Perforce.
  • Experience in implementing and maintaining CI/CD systems.
  • Expertise in Infrastructure as Code design using Ansible, Terraform, and CloudFormation.
  • Experience with backend game engines like Pragma, GameLift, or Agones.
  • Experience with capacity planning and FinOps.
  • Proficiency in one or more high-level languages such as Python, Kotlin, JavaScript, or C++.
  • Strong Linux skills and understanding of public cloud services.
  • Analyze, implement, and improve complex systems responsible for delivering games to millions of fans.
  • Own the delivery, scalability, and reliability of the cloud-hosted game title.
  • Partner with game teams to advise and implement best practices.
  • Take ownership of projects from start to finish while maintaining quality.
  • Engage with product teams to diagnose and resolve operational issues.
  • Maintain relationships with internal and external partners.
  • Optimize reliability, availability, observability, and cost.
  • Participate in on-call rotations for critical incidents.

PythonJavascriptKotlinC++CI/CDLinuxTerraformData visualizationAnsible

Posted about 1 month ago
Apply
Apply

📍 United Kingdom

🔍 FinTech

🏢 Company: Valstro👥 51-100💰 $23,500,000 8 months agoFinancial ServicesFinTech

  • Strong experience in site reliability engineering, systems engineering, or a related field.
  • Proficiency in cloud-based infrastructure (e.g., AWS, Azure, Google Cloud).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana).
  • Expertise in automation and scripting (e.g., Golang, Python, Bash, Terraform).
  • Knowledge of containerization and orchestration (e.g., Docker, Kubernetes).
  • Ability to communicate effectively between stakeholders.
  • Strong troubleshooting and problem-solving skills.
  • Experience enhancing reliability engineering practices.
  • Act as key intermediary between engineering, leadership, and vendors.
  • Ensure reliability, availability, and performance of cloud-based trading solutions.
  • Develop and maintain monitoring solutions for system performance.
  • Automate operational tasks to enhance efficiency.
  • Collaborate with development teams for integration and deployment.
  • Respond to incidents and troubleshoot issues.
  • Continuously improve systems and processes.
  • Participate in on-call rotations for 24/7 support.

AWSDockerPythonBashKubernetesGrafanaPrometheusTerraform

Posted 2 months ago
Apply
Apply

📍 United Kingdom

🔍 Cybersecurity

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity over 1 year agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • 5+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
  • Expertise in designing and maintaining automated CI/CD workflows.
  • Strong knowledge of AWS and Azure services.
  • Proficiency in Infrastructure-as-Code tools like Terraform or Ansible.
  • Experience with observability platforms like Prometheus, Grafana, or Datadog.
  • Proficiency in scripting languages like Python or Bash.
  • Ability to lead incident response efforts and conduct root cause analysis.
  • Strong interpersonal skills for collaboration across teams.
  • Manage and maintain environments to ensure high availability and security.
  • Design and implement CI/CD pipelines to automate software delivery.
  • Monitor and troubleshoot system performance issues using observability tools.
  • Collaborate with development teams to align infrastructure efforts with project needs.
  • Build and maintain infrastructure as code solutions using tools like Terraform.
  • Manage AWS/Azure services, including ECS/Container Apps and storage solutions.
  • Participate in incident response and conduct post-incident reviews.
  • Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted 2 months ago
Apply
Apply

📍 United Kingdom

🧭 Full-Time

🔍 Cybersecurity

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity over 1 year agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • Bachelor’s degree in Computer Science, Information Technology, or related field.
  • 3+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
  • Expertise in designing and maintaining automated pipelines for continuous delivery.
  • Strong knowledge of AWS/Azure services.
  • Proficiency in Terraform, Ansible, or similar tools.
  • Experience with observability platforms.
  • Proficiency in Python, Bash, or other scripting languages.
  • Ability to lead incident response efforts and conduct root cause analysis.
  • Strong interpersonal skills to work effectively across teams and stakeholders.
  • Manage and maintain environments to ensure high availability and security.
  • Design and implement CI/CD pipelines to automate software delivery.
  • Monitor and troubleshoot system performance issues using observability tools.
  • Collaborate with development teams to align infrastructure with project needs.
  • Build and maintain infrastructure as code solutions.
  • Manage AWS/Azure services.
  • Participate in incident response and conduct root cause analysis.
  • Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted 2 months ago
Apply
Apply

📍 Scotland

🧭 Full-Time

🔍 Technology

🏢 Company: Ivanti👥 1001-5000💰 Private almost 4 years agoIT InfrastructureIT ManagementSoftware

  • BS or higher in Computer Science/Software Engineering or equivalent.
  • Enthusiastic self-starter with 2+ years industry experience.
  • Strong experience in designing, building, and managing fault-tolerant Linux and Windows-based web platforms in the Cloud (e.g., AWS, Azure, GCP).
  • Strong Kubernetes experience.
  • Strong Linux and Windows troubleshooting skills.
  • Strong experience with advanced networking, security, and programming.
  • Able to code in Java, Python, Ruby, advanced Shell Script, or other OOP languages.
  • Strong user of CI/CD tools like Azure DevOps, Chef, Ansible, Jenkins, Github Actions.
  • Experience with collaborative source control tools like Git or Subversion.
  • Ability to configure HA Proxy, Apache, Nginx, IIS.
  • Ability to configure monitoring tools like DataDog, New Relic, Splunk.
  • Familiar with databases like SQL Server, PostgreSQL, Redis, Kafka, MongoDB, Elasticsearch.
  • Strong understanding of DevOps practices and OOP principles.
  • Experience contributing to application/development projects.
  • Deploying, managing, and securing Ivanti’s customer-facing Software-as-a-Service (SaaS) web products in AWS and Azure.
  • Troubleshooting all infrastructure and application issues.
  • Working with geographically dispersed teams to solve problems.
  • Automating common and repetitive tasks.
  • Writing documentation and training material.
  • Training other colleagues.
  • Standing up new services in existing data centers.
  • Participating in an on-call rotation for 24 x 7 coverage.

AWSLeadershipPythonGCPJavaJenkinsKubernetesRubyAzureCollaborationCI/CDLinuxDocumentation

Posted 2 months ago
Apply
Apply

📍 Germany, United Kingdom, Netherlands

🔍 Technology/Cloud services

🏢 Company: Vercel👥 251-500💰 $150,000,000 Series D about 3 years agoInternetDeveloper PlatformAppsSoftware

  • At least 3 years experience in an SRE role, or 5 years in a related role such as platform engineering.
  • Firm grasp of SRE philosophy with practical experience in system design and improvements.
  • Strong accountability and commitment to problem solving with a desire to identify root causes.
  • Proactive engagement with development teams to influence software design and practices.
  • Ability to manage risk and make sound decisions.
  • Demonstrated ability to plan and deliver long-term projects.
  • Experience with distributed system design.
  • Experience with Containers, Virtual Machines, and Linux.
  • Bonus: Experience with Terraform and/or Golang.
  • Ensure that products are built for reliability and scale through involvement in design, development, and deployment.
  • Engage in incident management and conduct blameless postmortems to drive risk mitigation.
  • Enhance reliability, performance, and efficiency through system analysis and engineering improvements.
  • Develop automated systems for software delivery, failover, and capacity management.

DockerBashKubernetes*NixGoLinuxTerraform

Posted 3 months ago
Apply