Apply

Site Reliability Engineer

Posted 3 months agoViewed

View full description

πŸ“ Location: United Kingdom, EU

πŸ” Industry: Consultancy

🏒 Company: The Dot CollectiveπŸ‘₯ 11-50Cloud ComputingAnalyticsInformation Technology

πŸͺ„ Skills: PythonAgileCloud ComputingJavaJavascriptSCRUMCommunication SkillsAnalytical SkillsCollaboration

Requirements:
  • A solid understanding of the networking stack and its application in cloud environments.
  • Comfortable with reducing toil through re-architecting or utilizing Python tooling.
Responsibilities:
  • Engage with delivery teams to enable reliable production services.
  • Build observability solutions centered around SLAs and SLOs, maintaining a clear error budget.
  • Support production by actioning root cause analysis and conducting post-mortems.
  • Review architecture designs to ensure production stability.
Apply

Related Jobs

Apply

πŸ“ United Kingdom

🧭 Contract

πŸ” Restaurant industry

NOT STATED
  • Partner with Engineering and Product Managers.
  • Learn and improve system availability.
  • Sharpen execution skills to provide an amazing experience for customers.

AWSDockerPythonSQLCI/CDDevOpsMicroservices

Posted 6 days ago
Apply
Apply

πŸ“ UK, Spain, Poland

🧭 Full-Time

πŸ’Έ 96000.0 - 130000.0 GBP per year

πŸ” Web development and deployment

  • Significant history in Site Reliability Engineering or similar roles.
  • Deep expertise in cloud architecture and experience with AWS, GCP, or Azure.
  • Track record of driving large-scale technical initiatives.
  • Expertise in managing CI/CD pipelines using relevant tools.
  • Deep expertise in configuration management tools like Ansible, Chef, or Puppet.
  • Proficiency with Kafka or other messaging brokers.
  • Strong experience in database management.
  • Proficiency in programming and scripting languages like Python, Go, or Bash.
  • Strong technical leadership skills.
  • Exceptional communication skills.
  • Comprehensive understanding of reliability engineering principles.
  • Experience establishing technical standards and best practices.
  • Understanding of security best practices and compliance frameworks.
  • Champion the architectural vision and technical strategy for Netlify's reliability systems.
  • Foster cross-organizational reliability initiatives.
  • Cultivate and set technical standards for reliability.
  • Act as the technical authority during major incidents.
  • Build relationships with stakeholders to integrate reliability into technical strategy.
  • Mentor senior engineers and tech leads.
  • Design and implement reliability frameworks and tooling.
  • Lead architecture reviews and provide technical oversight.
  • Develop reliability metrics and SLO frameworks.

AWSPythonBashGCPJenkinsKafkaAzureGoCI/CDAnsible

Posted 9 days ago
Apply
Apply

πŸ“ UK, Spain, Poland

🧭 Full-Time

πŸ’Έ 96000.0 - 130000.0 GBP per year

πŸ” Web development

🏒 Company: NetlifyπŸ‘₯ 11-50Information TechnologyWeb DesignGraphic DesignSoftware

  • Significant history in Site Reliability Engineering or similar roles.
  • Deep expertise in cloud architecture with experience in AWS, GCP, or Azure.
  • Proven track record of large-scale technical initiatives across multiple teams.
  • Expertise in designing and managing CI/CD pipelines using appropriate tools.
  • Experience with configuration management using tools like Ansible, Chef, or Puppet.
  • Proficiency with Kafka or other messaging brokers in multi-cloud environments.
  • Strong experience in database management for scalable applications.
  • Proficiency in Python, Go, or Bash for automation solutions.
  • Strong technical leadership skills and exceptional communication skills.
  • Comprehensive understanding of reliability engineering principles.
  • Experience establishing technical standards and best practices.
  • Champion the architectural vision and technical strategy for Netlify's reliability systems.
  • Foster cross-organizational reliability initiatives and collaborate with multiple engineering teams.
  • Cultivate technical standards and best practices for reliability.
  • Act as the technical authority during major incidents and provide guidance.
  • Strengthen relationships with stakeholders to integrate reliability considerations.
  • Mentor senior engineers and tech leads in systems thinking and reliability engineering.
  • Design and lead the implementation of organization-wide reliability frameworks and tooling.
  • Lead architecture reviews and provide oversight for critical infrastructure projects.
  • Develop reliability metrics and SLO frameworks aligned with business objectives.

AWSDockerPythonBashGCPGitKafkaAzureGoRDBMSNosqlCI/CDLinuxAnsible

Posted 9 days ago
Apply
Apply

πŸ“ Germany, Italy, Netherlands, Portugal, Romania, Spain, UK

πŸ” Corporate wellness

  • Proven technical experience with AWS cloud services, Kubernetes, and software engineering.
  • Deep knowledge of Kubernetes and its ecosystem.
  • Solid knowledge of observability systems.
  • Experience with operator-managed Infrastructure as Code, preferably crossplane or Kubernetes Operators.
  • Ability to write software for production environments.
  • Excellent analytical and problem-solving skills, and proven experience in identifying solutions for complex problems.
  • Collaboration and learning-driven mindset.
  • CNCF Kubernetes Certifications (e.g. CKA, CKS, or CKAD).
  • AWS Certifications.
  • Excellent communication skills in both English and Portuguese, both verbally and in writing.
  • Help to build a global, secure, scalable, and cost-effective Cloud platform using Kubernetes in AWS.
  • Develop and evolve Kubernetes operators and other cloud-native automation in Kubernetes.
  • Build products and tools enabling engineering teams to create and maintain their cloud resources autonomously.
  • Help to ensure security and compliance by delivering secure products and implementing DevSecOps integrations.
  • Improve observability, reliability, and cost awareness.
  • Support engineering teams in the products and tools usage.
  • Build and maintain a modern CI/CD set of tools and services.
  • Keep all the Kubernetes clusters highly available and reliable.
  • Contribute to our product documentation (e.g. user guide, configurations, operations, and troubleshooting procedures).
  • Participate in the definition of standards, RFCs (Request for Comments), guidelines and best practices.
  • Live the mission: inspire and empower others by genuinely caring for your own well-being and your colleagues.

AWSPythonKubernetesRubyGrafanaPrometheusCI/CD

Posted 22 days ago
Apply
Apply

πŸ“ United States, United Kingdom, Spain, Italy, Canada

πŸ” Interactive entertainment

🏒 Company: Escape Velocity Entertainment Inc

  • 5+ years of experience in a Site Reliability, DevOps, or Platform engineering role.
  • 5+ years of experience with observability, application monitoring, telemetry collection, and data visualization tools.
  • Experience with GitOps workflows and Helix Core / Perforce.
  • Experience in implementing and maintaining CI/CD systems.
  • Expertise in Infrastructure as Code design using Ansible, Terraform, and CloudFormation.
  • Experience with backend game engines like Pragma, GameLift, or Agones.
  • Experience with capacity planning and FinOps.
  • Proficiency in one or more high-level languages such as Python, Kotlin, JavaScript, or C++.
  • Strong Linux skills and understanding of public cloud services.
  • Analyze, implement, and improve complex systems responsible for delivering games to millions of fans.
  • Own the delivery, scalability, and reliability of the cloud-hosted game title.
  • Partner with game teams to advise and implement best practices.
  • Take ownership of projects from start to finish while maintaining quality.
  • Engage with product teams to diagnose and resolve operational issues.
  • Maintain relationships with internal and external partners.
  • Optimize reliability, availability, observability, and cost.
  • Participate in on-call rotations for critical incidents.

PythonJavascriptKotlinC++CI/CDLinuxTerraformData visualizationAnsible

Posted 22 days ago
Apply
Apply

πŸ“ United Kingdom

πŸ” FinTech

🏒 Company: ValstroπŸ‘₯ 51-100πŸ’° $23,500,000 8 months agoFinancial ServicesFinTech

  • Strong experience in site reliability engineering, systems engineering, or a related field.
  • Proficiency in cloud-based infrastructure (e.g., AWS, Azure, Google Cloud).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana).
  • Expertise in automation and scripting (e.g., Golang, Python, Bash, Terraform).
  • Knowledge of containerization and orchestration (e.g., Docker, Kubernetes).
  • Ability to communicate effectively between stakeholders.
  • Strong troubleshooting and problem-solving skills.
  • Experience enhancing reliability engineering practices.
  • Act as key intermediary between engineering, leadership, and vendors.
  • Ensure reliability, availability, and performance of cloud-based trading solutions.
  • Develop and maintain monitoring solutions for system performance.
  • Automate operational tasks to enhance efficiency.
  • Collaborate with development teams for integration and deployment.
  • Respond to incidents and troubleshoot issues.
  • Continuously improve systems and processes.
  • Participate in on-call rotations for 24/7 support.

AWSDockerPythonBashKubernetesGrafanaPrometheusTerraform

Posted about 1 month ago
Apply
Apply

πŸ“ United Kingdom

πŸ” Cybersecurity

🏒 Company: KnowBe4πŸ‘₯ 1001-5000πŸ’° $300,000,000 Post-IPO Equity over 1 year agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • 5+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
  • Expertise in designing and maintaining automated CI/CD workflows.
  • Strong knowledge of AWS and Azure services.
  • Proficiency in Infrastructure-as-Code tools like Terraform or Ansible.
  • Experience with observability platforms like Prometheus, Grafana, or Datadog.
  • Proficiency in scripting languages like Python or Bash.
  • Ability to lead incident response efforts and conduct root cause analysis.
  • Strong interpersonal skills for collaboration across teams.
  • Manage and maintain environments to ensure high availability and security.
  • Design and implement CI/CD pipelines to automate software delivery.
  • Monitor and troubleshoot system performance issues using observability tools.
  • Collaborate with development teams to align infrastructure efforts with project needs.
  • Build and maintain infrastructure as code solutions using tools like Terraform.
  • Manage AWS/Azure services, including ECS/Container Apps and storage solutions.
  • Participate in incident response and conduct post-incident reviews.
  • Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted about 2 months ago
Apply
Apply

πŸ“ United Kingdom

🧭 Full-Time

πŸ” Cybersecurity

🏒 Company: KnowBe4πŸ‘₯ 1001-5000πŸ’° $300,000,000 Post-IPO Equity over 1 year agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • Bachelor’s degree in Computer Science, Information Technology, or related field.
  • 3+ years equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
  • Expertise in designing and maintaining automated pipelines for continuous delivery.
  • Strong knowledge of AWS/Azure services.
  • Proficiency in Terraform, Ansible, or similar tools.
  • Experience with observability platforms.
  • Proficiency in Python, Bash, or other scripting languages.
  • Ability to lead incident response efforts and conduct root cause analysis.
  • Strong interpersonal skills to work effectively across teams and stakeholders.
  • Manage and maintain environments to ensure high availability and security.
  • Design and implement CI/CD pipelines to automate software delivery.
  • Monitor and troubleshoot system performance issues using observability tools.
  • Collaborate with development teams to align infrastructure with project needs.
  • Build and maintain infrastructure as code solutions.
  • Manage AWS/Azure services.
  • Participate in incident response and conduct root cause analysis.
  • Automate manual tasks to improve operational efficiency.

AWSPythonBashAzureGrafanaPrometheusCI/CDTerraform

Posted about 2 months ago
Apply
Apply

πŸ“ Scotland

🧭 Full-Time

πŸ” Technology

🏒 Company: IvantiπŸ‘₯ 1001-5000πŸ’° Private almost 4 years agoIT InfrastructureIT ManagementSoftware

  • BS or higher in Computer Science/Software Engineering or equivalent.
  • Enthusiastic self-starter with 2+ years industry experience.
  • Strong experience in designing, building, and managing fault-tolerant Linux and Windows-based web platforms in the Cloud (e.g., AWS, Azure, GCP).
  • Strong Kubernetes experience.
  • Strong Linux and Windows troubleshooting skills.
  • Strong experience with advanced networking, security, and programming.
  • Able to code in Java, Python, Ruby, advanced Shell Script, or other OOP languages.
  • Strong user of CI/CD tools like Azure DevOps, Chef, Ansible, Jenkins, Github Actions.
  • Experience with collaborative source control tools like Git or Subversion.
  • Ability to configure HA Proxy, Apache, Nginx, IIS.
  • Ability to configure monitoring tools like DataDog, New Relic, Splunk.
  • Familiar with databases like SQL Server, PostgreSQL, Redis, Kafka, MongoDB, Elasticsearch.
  • Strong understanding of DevOps practices and OOP principles.
  • Experience contributing to application/development projects.
  • Deploying, managing, and securing Ivanti’s customer-facing Software-as-a-Service (SaaS) web products in AWS and Azure.
  • Troubleshooting all infrastructure and application issues.
  • Working with geographically dispersed teams to solve problems.
  • Automating common and repetitive tasks.
  • Writing documentation and training material.
  • Training other colleagues.
  • Standing up new services in existing data centers.
  • Participating in an on-call rotation for 24 x 7 coverage.

AWSLeadershipPythonGCPJavaJenkinsKubernetesRubyAzureCollaborationCI/CDLinuxDocumentation

Posted about 2 months ago
Apply
Apply

πŸ“ EMEA, APAC, AMER

πŸ” DevSecOps

🏒 Company: GitLabπŸ‘₯ 1001-5000πŸ’° $268,000,000 Series E over 5 years agoπŸ«‚ Last layoff almost 2 years agoDeveloper ToolsDevOpsOpen SourceSaaSCloud Security

  • Advanced datastore platform management experience, preferably using Postgres at scale.
  • Advanced Cloud Infrastructure management, preferably using GCP.
  • Advanced experience with Linux.
  • Solid experience with infrastructure and database automation using Terraform.
  • Experience with orchestration tools like Chef and/or Ansible.
  • Experience implementing monitoring at scale using Prometheus and Grafana.
  • Ability to promote GitLab's CREDIT values in work.
  • Superior verbal and written communication skills.
  • Comfortable working asynchronously across timezones.
  • Build: Automating operational tasks like package updates and configuration changes.
  • Maintain: Develop systems for reliable maintenance tasks like library upgrades.
  • Plan: Create monitoring systems to predict capacity needs.
  • Respond: Address user emergencies and support requests.
  • Enhance: Update security measures for GitLab's infrastructure.
  • Partner: Collaborate with internal teams on compliance assessments and improvements.
  • Collaborate: Work with software teams to resolve architectural issues.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaboration

Posted 4 months ago
Apply