Apply

Site Reliability Engineer

Posted 2024-11-07

View full description

πŸ“ Location: United Kingdom, EU

πŸ” Industry: Consultancy

🏒 Company: The Dot Collective

πŸͺ„ Skills: PythonAgileCloud ComputingJavaJavascriptSCRUMCommunication SkillsAnalytical SkillsCollaborationJavaScript

Requirements:
  • A solid understanding of the networking stack and its application in cloud environments.
  • Comfortable with reducing toil through re-architecting or utilizing Python tooling.
Responsibilities:
  • Engage with delivery teams to enable reliable production services.
  • Build observability solutions centered around SLAs and SLOs, maintaining a clear error budget.
  • Support production by actioning root cause analysis and conducting post-mortems.
  • Review architecture designs to ensure production stability.
Apply

Related Jobs

Apply

πŸ“ Germany and within Europe

🧭 Full-Time

πŸ” Technology / Employee Communication

🏒 Company: Flip App

  • Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
  • Deep knowledge of Kubernetes and container solutions.
  • Interest in observability tools such as Prometheus, VictoriaMetrics, Mimir, Loki, ELK.
  • Familiarity with SLO, error budget, and Apdex.
  • Good knowledge of software development languages like Go, Python, Kotlin.
  • Business fluent in English; German is a plus.
  • Experience with infrastructure as code tools (e.g., Pulumi, OpenTofu) and automation tools (e.g., Ansible, Chef).

  • Ensure the availability, performance, and scalability of the infrastructure.
  • Promote practices like CI/CD, observability, and developer experience.
  • Shape goals for scalable systems and observability.
  • Expand cloud infrastructure and Kubernetes cluster.
  • Ensure resilience and safety through zero-downtime rollouts.
  • Create observability through the further development of the LGTM stack.
  • Design, develop, and optimize infrastructure as code using Pulumi in Go.

AWSPythonSoftware DevelopmentGCPKotlinKubernetesAzureGoGrafanaPrometheusCI/CD

Posted 2024-11-07
Apply
Apply

πŸ“ Germany, Sweden, United Kingdom, Spain, Poland, Austria

🧭 Full-Time

πŸ” Video Games

  • Experience in online operations support
  • Ability to work closely with production and architecture teams
  • Strong collaboration and communication skills

  • Serve as liaison between various development teams and the network operations team
  • Collaborate closely with the production team and system architect
  • Ensure that projects related to Hunt: Showdown are well planned, documented, and implemented
  • Handle operational and project duties.

LeadershipProject ManagementProject CoordinationCross-functional Team LeadershipOperations Management

Posted 2024-11-07
Apply
Apply

πŸ“ Germany, Sweden, United Kingdom, Spain, Poland, Austria

🧭 Full-Time

πŸ” Video Games

  • Experience in site reliability engineering.
  • Familiarity with network operations.

  • Support the Network Operations team for Hunt: Showdown.
  • Ensure high reliability and performance of the services.

Leadership

Posted 2024-11-07
Apply
Apply

πŸ“ US, Europe

🧭 Full-Time

πŸ’Έ 175000 - 210000 USD per year

πŸ” Cloud computing, AI

🏒 Company: CoreWeave

  • You have 5+ years of experience in the software or infrastructure engineering industry.
  • Experience with Python, Go or another scripting language.
  • Experience with how to containerize applications and/or have experience using Kubernetes to manage deployments.
  • Experience with Git.
  • Experience with Linux shell scripting and/or can navigate a *nix-based operating system.
  • Experience creating and maintaining GitHub Actions to automate workflows.
  • You have experience deploying services in production and are interested in learning reliability-at-scale engineering concepts.
  • You have experience refining SDLC, doing code reviews, and providing technical support.

  • Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations.
  • Streamline repetitive tasks and eliminate bottlenecks to improve development velocity with automated workflows and processes.
  • Partner with developers to understand their pain points and develop tailored solutions that enhance their productivity.
  • Champion best practices and advocate for new tools and technologies to drive ongoing productivity gains.
  • Tackle complex issues related to build systems, testing frameworks, code analysis, and other developer tooling.
  • Enable and evangelize the practice of reliability engineering across CoreWeave's engineering teams.

PythonSoftware DevelopmentGitKubernetes*NixGoCollaboration

Posted 2024-11-07
Apply
Apply

πŸ“ United Kingdom

πŸ’Έ 65000 - 80000 GBP per year

πŸ” Online marketplace

🏒 Company: OnBuy

  • Proven experience as a Senior Site Reliability Engineer or in a similar role.
  • Strong proficiency in programming languages such as Python, Go, or Java.
  • Experience with cloud service providers (AWS, Azure, Google Cloud) and container orchestration tools (Kubernetes, Docker).
  • Solid understanding of networking, distributed systems, and microservices architecture.
  • Familiarity with monitoring and logging tools (New Relic, Prometheus, Grafana, ELK stack, GCP logging).
  • Excellent problem-solving skills and ability to work effectively in a team.
  • Strong communication and interpersonal skills for collaboration with cross-functional teams.

  • Design and implement scalable systems to ensure high availability and performance.
  • Develop automated solutions for monitoring, scaling, and system health management.
  • Collaborate with software development teams to identify and resolve reliability issues.
  • Create and maintain documentation related to system architecture, processes, and configurations.
  • Perform incident response and postmortem analysis to improve site reliability and performance.
  • Monitor system performance and make necessary adjustments to ensure optimal functionality.
  • Implement and manage infrastructure as code using tools like Terraform or Ansible.

AWSDockerPythonSoftware DevelopmentGCPJavaKubernetesAzureGoGrafanaPrometheusDevOpsTerraformDocumentationMicroservices

Posted 2024-11-07
Apply
Apply

πŸ“ UK

🏒 Company: Landmark Information Group - Internal

  • Extensive, proven experience in a technical support and design role.
  • Knowledge and experience providing 3rd level support in a similar field.
  • Ability to analyse problems and determine appropriate solutions.
  • In-depth experience in technical support encompassing varying IaaS, PaaS and SaaS Solutions.
  • Proven experience with at least one CI tool set.
  • Experience in Web Application support, debugging and management.
  • Experience with Scripting (e.g., Powershell, Bash, Python).
  • Experience of running production workloads in Azure and at least one other cloud service provider.
  • Exposure to cloud hygiene tooling (e.g., Cloudsploit, security monkey, AT&T Cybersecurity).
  • Exposure to Cosmos DB.

  • Defining and agreeing with product stakeholders’ reliability metrics.
  • Providing input on Technical Architecture and solutions to improve all elements of Landmark Infrastructure.
  • Driving reporting based on agreed reliability metrics.
  • Ensuring misconfiguration and platform hygiene is maintained.
  • Undertaking investigations into technologies, practices, and methodologies outside of the normal Landmark stack.

PythonSoftware DevelopmentCloud ComputingCybersecurityGitKubernetesMicrosoft AzureAzureAnalytical SkillsCollaborationCI/CD

Posted 2024-10-24
Apply
Apply

πŸ“ United Kingdom, Spain, Italy, Portugal, Greece

πŸ” Esports, gaming tournaments, leagues, events

🏒 Company: ESL FACEIT Group

  • Proven experience as a Site Reliability Engineer, DevXP Engineer or Software Engineer.
  • Excellent knowledge of at least one major cloud provider (GCP/AWS/Azure).
  • Experience with cluster management systems, preferably Kubernetes.
  • Knowledge of incident management and troubleshooting.
  • Proficient in Go language and familiarity with at least another language (Java, Python, Rust).
  • Knowledge of GitOps practices.
  • Production scale experience with MongoDB, Redis, or MySQL.
  • Experience contributing to open source technologies is a plus.

  • Designing, analyzing, and troubleshooting large-scale distributed systems.
  • Maintaining and improving the monitoring and observability tools (Grafana/Prometheus/Thanos/Jaeger).
  • Optimizing existing systems and building infrastructure.
  • Collaborating with software engineering teams to deploy and operate systems.
  • Leading on incident management processes and adoption.
  • Using troubleshooting skills to identify and fix operational issues.
  • Experimenting with and introducing cutting edge technologies.

KubernetesMongoDBMySQLGoGrafanaPrometheusRedis

Posted 2024-10-20
Apply
Apply

πŸ“ EMEA, APAC, AMER

πŸ” DevSecOps

🏒 Company: GitLab

  • Advanced datastore platform management experience, preferably using Postgres at scale.
  • Advanced Cloud Infrastructure management, preferably using GCP.
  • Advanced experience with Linux.
  • Solid experience with infrastructure and database automation using Terraform.
  • Experience with orchestration tools like Chef and/or Ansible.
  • Experience implementing monitoring at scale using Prometheus and Grafana.
  • Ability to promote GitLab's CREDIT values in work.
  • Superior verbal and written communication skills.
  • Comfortable working asynchronously across timezones.

  • Build: Automating operational tasks like package updates and configuration changes.
  • Maintain: Develop systems for reliable maintenance tasks like library upgrades.
  • Plan: Create monitoring systems to predict capacity needs.
  • Respond: Address user emergencies and support requests.
  • Enhance: Update security measures for GitLab's infrastructure.
  • Partner: Collaborate with internal teams on compliance assessments and improvements.
  • Collaborate: Work with software teams to resolve architectural issues.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaboration

Posted 2024-10-16
Apply
Apply

πŸ“ APAC, EMEA, AMER

πŸ” DevSecOps Software

🏒 Company: GitLab

  • Advanced datastore platform management experience, preferably using Postgres at scale.
  • Advanced Cloud Infrastructure management, preferably using GCP.
  • Advanced experience with Linux.
  • Solid experience with automation including developing infrastructure and database automations.
  • Experience with Terraform for automation.
  • Experience with orchestration tools like Chef and/or Ansible.
  • Solid experience implementing monitoring at scale, preferably using Prometheus and Grafana.
  • Willingness and ability to promote GitLab's CREDIT Values.
  • Superior verbal and written communication skills.
  • Ability to work asynchronously across timezones and cultures.

  • Build, Run, and own the entire lifecycle of the PostgreSQL database engine for GitLab.com.
  • Automate operational tasks including package updates and configuration changes.
  • Develop warning systems for maintenance tasks like library upgrades.
  • Create monitoring and alerting systems to predict capacity needs.
  • Respond to user emergencies and support requests.
  • Implement and enhance security measures for GitLab infrastructure.
  • Partner with compliance assessors for regulatory certifications.
  • Collaborate with engineering teams to resolve architectural bottlenecks.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaborationTerraform

Posted 2024-10-16
Apply
Apply

πŸ“ European Economic Area

🧭 Full-Time

πŸ” Cryptocurrency

🏒 Company: Bitso

  • 2+ years of experience with AWS (and others)
  • Strong skills around observability, debugging and performance tuning
  • Experience with fine-tuning containerized java applications is an advantage
  • In-depth experience with multiple CI tools (e.g. Github Actions)
  • In-depth experience with CD tools (e.g. ArgoCD)
  • 2+ years of experience managing Production Kubernetes clusters and deploying applications
  • 2+ years of experience writing scripts in different languages (e.g. Java, TS, bash, Go)
  • Experience managing Infrastructure as Code (Terraform, Crossplane)
  • Experience with Software development (ideally in Java) to better understand the needs and problems software engineers are facing with their services

  • Improve observability, reliability and availability by defining and measuring key metrics.
  • Closely collaborate with our Core Services fleet to performance tune and optimize our architecture.
  • Proactively find and analyze reliability problems across our business units and stack, then design and implement solutions to create improvements.
  • Educate, mentor, and hold accountable the engineering team to improve the reliability of our systems.

AWSSoftware DevelopmentBashJavaKubernetesGo

Posted 2024-09-29
Apply