Site Reliability Engineer

Posted 3 months agoViewed

View full description

📍 Location: United Kingdom, EU

🔍 Industry: Consultancy

🏢 Company: The Dot Collective👥 11-50 Cloud Computing Analytics Information Technology

🪄 Skills: PythonAgileCloud ComputingJavaJavascriptSCRUMCommunication SkillsAnalytical SkillsCollaboration

Requirements:

A solid understanding of the networking stack and its application in cloud environments.
Comfortable with reducing toil through re-architecting or utilizing Python tooling.

Responsibilities:

Engage with delivery teams to enable reliable production services.
Build observability solutions centered around SLAs and SLOs, maintaining a clear error budget.
Support production by actioning root cause analysis and conducting post-mortems.
Review architecture designs to ensure production stability.

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer [United Kingdom]

Posted 6 days ago

📍 United Kingdom

🧭 Contract

🔍 Restaurant industry

🔧 Requirements

NOT STATED

💡 Responsibilities

Partner with Engineering and Product Managers.
Learn and improve system availability.
Sharpen execution skills to provide an amazing experience for customers.

AWSDockerPythonSQLCI/CDDevOpsMicroservices

Posted 6 days ago

Apply

🔥 Staff Infrastructure Site Reliability Engineer (EMEA)

Posted 9 days ago

📍 UK, Spain, Poland

🧭 Full-Time

💸 96000.0 - 130000.0 GBP per year

🔍 Web development and deployment

🔧 Requirements

Significant history in Site Reliability Engineering or similar roles.
Deep expertise in cloud architecture and experience with AWS, GCP, or Azure.
Track record of driving large-scale technical initiatives.
Expertise in managing CI/CD pipelines using relevant tools.
Deep expertise in configuration management tools like Ansible, Chef, or Puppet.
Proficiency with Kafka or other messaging brokers.
Strong experience in database management.
Proficiency in programming and scripting languages like Python, Go, or Bash.
Strong technical leadership skills.
Exceptional communication skills.
Comprehensive understanding of reliability engineering principles.
Experience establishing technical standards and best practices.
Understanding of security best practices and compliance frameworks.

💡 Responsibilities

Champion the architectural vision and technical strategy for Netlify's reliability systems.
Foster cross-organizational reliability initiatives.
Cultivate and set technical standards for reliability.
Act as the technical authority during major incidents.
Build relationships with stakeholders to integrate reliability into technical strategy.
Mentor senior engineers and tech leads.
Design and implement reliability frameworks and tooling.
Lead architecture reviews and provide technical oversight.
Develop reliability metrics and SLO frameworks.

AWSPythonBashGCPJenkinsKafkaAzureGoCI/CDAnsible

Posted 9 days ago

Apply

🔥 Staff Infrastructure Site Reliability Engineer (EMEA)

Posted 9 days ago

📍 UK, Spain, Poland

🧭 Full-Time

💸 96000.0 - 130000.0 GBP per year

🔍 Web development

🏢 Company: Netlify👥 11-50 Information Technology Web Design Graphic Design Software

🔧 Requirements

Significant history in Site Reliability Engineering or similar roles.
Deep expertise in cloud architecture with experience in AWS, GCP, or Azure.
Proven track record of large-scale technical initiatives across multiple teams.
Expertise in designing and managing CI/CD pipelines using appropriate tools.
Experience with configuration management using tools like Ansible, Chef, or Puppet.
Proficiency with Kafka or other messaging brokers in multi-cloud environments.
Strong experience in database management for scalable applications.
Proficiency in Python, Go, or Bash for automation solutions.
Strong technical leadership skills and exceptional communication skills.
Comprehensive understanding of reliability engineering principles.
Experience establishing technical standards and best practices.

💡 Responsibilities

Champion the architectural vision and technical strategy for Netlify's reliability systems.
Foster cross-organizational reliability initiatives and collaborate with multiple engineering teams.
Cultivate technical standards and best practices for reliability.
Act as the technical authority during major incidents and provide guidance.
Strengthen relationships with stakeholders to integrate reliability considerations.
Mentor senior engineers and tech leads in systems thinking and reliability engineering.
Design and lead the implementation of organization-wide reliability frameworks and tooling.
Lead architecture reviews and provide oversight for critical infrastructure projects.
Develop reliability metrics and SLO frameworks aligned with business objectives.

AWSDockerPythonBashGCPGitKafkaAzureGoRDBMSNosqlCI/CDLinuxAnsible

Posted 9 days ago

Apply

🔥 Staff Site Reliability Engineer

Posted 22 days ago

📍 Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

🔍 Corporate wellness

🔧 Requirements

Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
Deep knowledge of Kubernetes and its ecosystem.
Solid knowledge of observability systems.
Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
Ability to write software for production environments.
Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
Collaboration and learning-driven mindset.
CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
AWS Certifications.
Excellent communication skills in both English and Portuguese, both verbally and in writing.

💡 Responsibilities

Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
Improve observability, reliability, and cost awareness.
Support engineering teams in the products and tools usage.
Build and maintain a modern CI/CD set of tools and services.
Keep all the Kubernetes clusters highly available and reliable.
Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 22 days ago

Apply

🔥 Site Reliability Engineer | North America | Canada | Europe | Fully Remote

Posted 22 days ago

📍 United States, United Kingdom, Spain, Italy, Canada

🔍 Interactive entertainment

🏢 Company: Escape Velocity Entertainment Inc

🔧 Requirements

5+ years of experience in a Site Reliability, DevOps, or Platform engineering role.
5+ years of experience with observability, application monitoring, telemetry collection, and data visualization tools.
Experience with GitOps workflows and Helix Core / Perforce.
Experience in implementing and maintaining CI/CD systems.
Expertise in Infrastructure as Code design using Ansible, Terraform, and CloudFormation.
Experience with backend game engines like Pragma, GameLift, or Agones.
Experience with capacity planning and FinOps.
Proficiency in one or more high-level languages such as Python, Kotlin, JavaScript, or C++.
Strong Linux skills and understanding of public cloud services.

💡 Responsibilities

Analyze, implement, and improve complex systems responsible for delivering games to millions of fans.
Own the delivery, scalability, and reliability of the cloud-hosted game title.
Partner with game teams to advise and implement best practices.
Take ownership of projects from start to finish while maintaining quality.
Engage with product teams to diagnose and resolve operational issues.
Maintain relationships with internal and external partners.
Optimize reliability, availability, observability, and cost.
Participate in on-call rotations for critical incidents.

PythonJavascriptKotlinC++CI/CDLinuxTerraformData visualizationAnsible

Posted 22 days ago

Apply

🔥 Site Reliability Engineer (SRE)

Posted about 1 month ago

📍 United Kingdom

🔍 FinTech

🏢 Company: Valstro👥 51-100💰 $23,500,000 8 months agoFinancial Services FinTech

🔧 Requirements

Strong experience in site reliability engineering, systems engineering, or a related field.
Proficiency in cloud-based infrastructure (e.g., AWS, Azure, Google Cloud).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana).
Expertise in automation and scripting (e.g., Golang, Python, Bash, Terraform).
Knowledge of containerization and orchestration (e.g., Docker, Kubernetes).
Ability to communicate effectively between stakeholders.
Strong troubleshooting and problem-solving skills.
Experience enhancing reliability engineering practices.

💡 Responsibilities

Act as key intermediary between engineering, leadership, and vendors.
Ensure reliability, availability, and performance of cloud-based trading solutions.
Develop and maintain monitoring solutions for system performance.
Automate operational tasks to enhance efficiency.
Collaborate with development teams for integration and deployment.
Respond to incidents and troubleshoot issues.
Continuously improve systems and processes.
Participate in on-call rotations for 24/7 support.

AWSDockerPythonBashKubernetesGrafanaPrometheusTerraform

Posted about 1 month ago

Apply

🔥 Snr. Site Reliability Engineer (Remote) (Position located in Sheffield, United Kingdom)

Posted about 2 months ago

📍 United Kingdom

🔍 Cybersecurity

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity over 1 year agoComputer Security Cyber Security Network Security Software

🔧 Requirements

Bachelor’s degree in Computer Science, Information Technology, or a related field.
5+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
Expertise in designing and maintaining automated CI/CD workflows.
Strong knowledge of AWS and Azure services.
Proficiency in Infrastructure-as-Code tools like Terraform or Ansible.
Experience with observability platforms like Prometheus, Grafana, or Datadog.
Proficiency in scripting languages like Python or Bash.
Ability to lead incident response efforts and conduct root cause analysis.
Strong interpersonal skills for collaboration across teams.

💡 Responsibilities

Manage and maintain environments to ensure high availability and security.
Design and implement CI/CD pipelines to automate software delivery.
Monitor and troubleshoot system performance issues using observability tools.
Collaborate with development teams to align infrastructure efforts with project needs.
Build and maintain infrastructure as code solutions using tools like Terraform.
Manage AWS/Azure services, including ECS/Container Apps and storage solutions.
Participate in incident response and conduct post-incident reviews.
Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted about 2 months ago

Apply

🔥 Site Reliability Engineer (Remote) (Position located in Sheffield, United Kingdom)

Posted about 2 months ago

📍 United Kingdom

🧭 Full-Time

🔍 Cybersecurity

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity over 1 year agoComputer Security Cyber Security Network Security Software

🔧 Requirements

Bachelor’s degree in Computer Science, Information Technology, or related field.
3+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
Expertise in designing and maintaining automated pipelines for continuous delivery.
Strong knowledge of AWS/Azure services.
Proficiency in Terraform, Ansible, or similar tools.
Experience with observability platforms.
Proficiency in Python, Bash, or other scripting languages.
Ability to lead incident response efforts and conduct root cause analysis.
Strong interpersonal skills to work effectively across teams and stakeholders.

💡 Responsibilities

Manage and maintain environments to ensure high availability and security.
Design and implement CI/CD pipelines to automate software delivery.
Monitor and troubleshoot system performance issues using observability tools.
Collaborate with development teams to align infrastructure with project needs.
Build and maintain infrastructure as code solutions.
Manage AWS/Azure services.
Participate in incident response and conduct root cause analysis.
Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted about 2 months ago

Apply

🔥 Site Reliability Engineer (Remote, Scotland)

Posted about 2 months ago

📍 Scotland

🧭 Full-Time

🔍 Technology

🏢 Company: Ivanti👥 1001-5000💰 Private almost 4 years agoIT Infrastructure IT Management Software

🔧 Requirements

BS or higher in Computer Science/Software Engineering or equivalent.
Enthusiastic self-starter with 2+ years industry experience.
Strong experience in designing, building, and managing fault-tolerant Linux and Windows-based web platforms in the Cloud (e.g., AWS, Azure, GCP).
Strong Kubernetes experience.
Strong Linux and Windows troubleshooting skills.
Strong experience with advanced networking, security, and programming.
Able to code in Java, Python, Ruby, advanced Shell Script, or other OOP languages.
Strong user of CI/CD tools like Azure DevOps, Chef, Ansible, Jenkins, Github Actions.
Experience with collaborative source control tools like Git or Subversion.
Ability to configure HA Proxy, Apache, Nginx, IIS.
Ability to configure monitoring tools like DataDog, New Relic, Splunk.
Familiar with databases like SQL Server, PostgreSQL, Redis, Kafka, MongoDB, Elasticsearch.
Strong understanding of DevOps practices and OOP principles.
Experience contributing to application/development projects.

💡 Responsibilities

Deploying, managing, and securing Ivanti’s customer-facing Software-as-a-Service (SaaS) web products in AWS and Azure.
Troubleshooting all infrastructure and application issues.
Working with geographically dispersed teams to solve problems.
Automating common and repetitive tasks.
Writing documentation and training material.
Training other colleagues.
Standing up new services in existing data centers.
Participating in an on-call rotation for 24 x 7 coverage.

AWSLeadershipPythonGCPJavaJenkinsKubernetesRubyAzureCollaborationCI/CDLinuxDocumentation

Posted about 2 months ago

Apply

🔥 Intermediate Site Reliability Engineer, Database Operations

Posted 4 months ago

📍 EMEA, APAC, AMER

🔍 DevSecOps

🏢 Company: GitLab👥 1001-5000💰 $268,000,000 Series E over 5 years ago🫂 Last layoff almost 2 years agoDeveloper Tools DevOps Open Source SaaS Cloud Security

🔧 Requirements

Advanced datastore platform management experience, preferably using Postgres at scale.
Advanced Cloud Infrastructure management, preferably using GCP.
Advanced experience with Linux.
Solid experience with infrastructure and database automation using Terraform.
Experience with orchestration tools like Chef and/or Ansible.
Experience implementing monitoring at scale using Prometheus and Grafana.
Ability to promote GitLab's CREDIT values in work.
Superior verbal and written communication skills.
Comfortable working asynchronously across timezones.

💡 Responsibilities

Build: Automating operational tasks like package updates and configuration changes.
Maintain: Develop systems for reliable maintenance tasks like library upgrades.
Plan: Create monitoring systems to predict capacity needs.
Respond: Address user emergencies and support requests.
Enhance: Update security measures for GitLab's infrastructure.
Partner: Collaborate with internal teams on compliance assessments and improvements.
Collaborate: Work with software teams to resolve architectural issues.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaboration

Posted 4 months ago

Apply