Site Reliability Engineer

Posted 2024-10-24

💎 Seniority level: Senior, extensive, proven experience

📍 Location: UK

🏢 Company: Landmark Information Group - Internal

⏳ Experience: Extensive, proven experience

🪄 Skills: PythonSoftware DevelopmentCloud ComputingCybersecurityGitKubernetesMicrosoft AzureAzureAnalytical SkillsCollaborationCI/CD

Requirements:

Extensive, proven experience in a technical support and design role.
Knowledge and experience providing 3rd level support in a similar field.
Ability to analyse problems and determine appropriate solutions.
In-depth experience in technical support encompassing varying IaaS, PaaS and SaaS Solutions.
Proven experience with at least one CI tool set.
Experience in Web Application support, debugging and management.
Experience with Scripting (e.g., Powershell, Bash, Python).
Experience of running production workloads in Azure and at least one other cloud service provider.
Exposure to cloud hygiene tooling (e.g., Cloudsploit, security monkey, AT&T Cybersecurity).
Exposure to Cosmos DB.

Responsibilities:

Defining and agreeing with product stakeholders’ reliability metrics.
Providing input on Technical Architecture and solutions to improve all elements of Landmark Infrastructure.
Driving reporting based on agreed reliability metrics.
Ensuring misconfiguration and platform hygiene is maintained.
Undertaking investigations into technologies, practices, and methodologies outside of the normal Landmark stack.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer (SRE) (m/w/d)

Posted 2024-11-07

📍 Germany and within Europe

🧭 Full-Time

🔍 Technology / Employee Communication

🏢 Company: Flip App

Experience in operating and scaling cloud infrastructures (Azure, AWS, GCP).
Deep knowledge of Kubernetes and container solutions.
Interest in observability tools such as Prometheus, VictoriaMetrics, Mimir, Loki, ELK.
Familiarity with SLO, error budget, and Apdex.
Good knowledge of software development languages like Go, Python, Kotlin.
Business fluent in English; German is a plus.
Experience with infrastructure as code tools (e.g., Pulumi, OpenTofu) and automation tools (e.g., Ansible, Chef).

Ensure the availability, performance, and scalability of the infrastructure.
Promote practices like CI/CD, observability, and developer experience.
Shape goals for scalable systems and observability.
Expand cloud infrastructure and Kubernetes cluster.
Ensure resilience and safety through zero-downtime rollouts.
Create observability through the further development of the LGTM stack.
Design, develop, and optimize infrastructure as code using Pulumi in Go.

AWSPythonSoftware DevelopmentGCPKotlinKubernetesAzureGoGrafanaPrometheusCI/CD

Posted 2024-11-07

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-11-07

📍 Germany, Sweden, United Kingdom, Spain, Poland, Austria

🧭 Full-Time

🔍 Video Games

Experience in online operations support
Ability to work closely with production and architecture teams
Strong collaboration and communication skills

Serve as liaison between various development teams and the network operations team
Collaborate closely with the production team and system architect
Ensure that projects related to Hunt: Showdown are well planned, documented, and implemented
Handle operational and project duties.

LeadershipProject ManagementProject CoordinationCross-functional Team LeadershipOperations Management

Posted 2024-11-07

Apply

🔥 Lead Site Reliability Engineer

Posted 2024-11-07

📍 Germany, Sweden, United Kingdom, Spain, Poland, Austria

🧭 Full-Time

🔍 Video Games

Experience in site reliability engineering.
Familiarity with network operations.

Support the Network Operations team for Hunt: Showdown.
Ensure high reliability and performance of the services.

Leadership

Posted 2024-11-07

Apply

🔥 Senior Site Reliability Engineer, Developer Productivity

Posted 2024-11-07

📍 US, Europe

🧭 Full-Time

💸 175000 - 210000 USD per year

🔍 Cloud computing, AI

🏢 Company: CoreWeave

You have 5+ years of experience in the software or infrastructure engineering industry.
Experience with Python, Go or another scripting language.
Experience with how to containerize applications and/or have experience using Kubernetes to manage deployments.
Experience with Git.
Experience with Linux shell scripting and/or can navigate a *nix-based operating system.
Experience creating and maintaining GitHub Actions to automate workflows.
You have experience deploying services in production and are interested in learning reliability-at-scale engineering concepts.
You have experience refining SDLC, doing code reviews, and providing technical support.

Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations.
Streamline repetitive tasks and eliminate bottlenecks to improve development velocity with automated workflows and processes.
Partner with developers to understand their pain points and develop tailored solutions that enhance their productivity.
Champion best practices and advocate for new tools and technologies to drive ongoing productivity gains.
Tackle complex issues related to build systems, testing frameworks, code analysis, and other developer tooling.
Enable and evangelize the practice of reliability engineering across CoreWeave's engineering teams.

PythonSoftware DevelopmentGitKubernetes*NixGoCollaboration

Posted 2024-11-07

Apply

🔥 Site Reliability Engineer

Posted 2024-11-07

📍 United Kingdom, EU

🔍 Consultancy

🏢 Company: The Dot Collective

A solid understanding of the networking stack and its application in cloud environments.
Comfortable with reducing toil through re-architecting or utilizing Python tooling.

Engage with delivery teams to enable reliable production services.
Build observability solutions centered around SLAs and SLOs, maintaining a clear error budget.
Support production by actioning root cause analysis and conducting post-mortems.
Review architecture designs to ensure production stability.

PythonAgileCloud ComputingJavaJavascriptSCRUMJavaScriptCommunication SkillsAnalytical SkillsCollaboration

Posted 2024-11-07

Apply

🔥 Senior Site Reliability Engineer

Posted 2024-11-07

📍 United Kingdom

💸 65000 - 80000 GBP per year

🔍 Online marketplace

🏢 Company: OnBuy

Proven experience as a Senior Site Reliability Engineer or in a similar role.
Strong proficiency in programming languages such as Python, Go, or Java.
Experience with cloud service providers (AWS, Azure, Google Cloud) and container orchestration tools (Kubernetes, Docker).
Solid understanding of networking, distributed systems, and microservices architecture.
Familiarity with monitoring and logging tools (New Relic, Prometheus, Grafana, ELK stack, GCP logging).
Excellent problem-solving skills and ability to work effectively in a team.
Strong communication and interpersonal skills for collaboration with cross-functional teams.

Design and implement scalable systems to ensure high availability and performance.
Develop automated solutions for monitoring, scaling, and system health management.
Collaborate with software development teams to identify and resolve reliability issues.
Create and maintain documentation related to system architecture, processes, and configurations.
Perform incident response and postmortem analysis to improve site reliability and performance.
Monitor system performance and make necessary adjustments to ensure optimal functionality.
Implement and manage infrastructure as code using tools like Terraform or Ansible.

AWSDockerPythonSoftware DevelopmentGCPJavaKubernetesAzureGoGrafanaPrometheusDevOpsTerraformDocumentationMicroservices

Posted 2024-11-07

Apply

🔥 Site Reliability Engineer - Remote

Posted 2024-10-20

📍 United Kingdom, Spain, Italy, Portugal, Greece

🔍 Esports, gaming tournaments, leagues, events

🏢 Company: ESL FACEIT Group

Proven experience as a Site Reliability Engineer, DevXP Engineer or Software Engineer.
Excellent knowledge of at least one major cloud provider (GCP/AWS/Azure).
Experience with cluster management systems, preferably Kubernetes.
Knowledge of incident management and troubleshooting.
Proficient in Go language and familiarity with at least another language (Java, Python, Rust).
Knowledge of GitOps practices.
Production scale experience with MongoDB, Redis, or MySQL.
Experience contributing to open source technologies is a plus.

Designing, analyzing, and troubleshooting large-scale distributed systems.
Maintaining and improving the monitoring and observability tools (Grafana/Prometheus/Thanos/Jaeger).
Optimizing existing systems and building infrastructure.
Collaborating with software engineering teams to deploy and operate systems.
Leading on incident management processes and adoption.
Using troubleshooting skills to identify and fix operational issues.
Experimenting with and introducing cutting edge technologies.

KubernetesMongoDBMySQLGoGrafanaPrometheusRedis

Posted 2024-10-20

Apply

🔥 Intermediate Site Reliability Engineer, Database Operations

Posted 2024-10-16

📍 EMEA, APAC, AMER

🔍 DevSecOps

🏢 Company: GitLab

Advanced datastore platform management experience, preferably using Postgres at scale.
Advanced Cloud Infrastructure management, preferably using GCP.
Advanced experience with Linux.
Solid experience with infrastructure and database automation using Terraform.
Experience with orchestration tools like Chef and/or Ansible.
Experience implementing monitoring at scale using Prometheus and Grafana.
Ability to promote GitLab's CREDIT values in work.
Superior verbal and written communication skills.
Comfortable working asynchronously across timezones.

Build: Automating operational tasks like package updates and configuration changes.
Maintain: Develop systems for reliable maintenance tasks like library upgrades.
Plan: Create monitoring systems to predict capacity needs.
Respond: Address user emergencies and support requests.
Enhance: Update security measures for GitLab's infrastructure.
Partner: Collaborate with internal teams on compliance assessments and improvements.
Collaborate: Work with software teams to resolve architectural issues.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaboration

Posted 2024-10-16

Apply

🔥 Intermediate Site Reliability Engineer: Database Operations

Posted 2024-10-16

📍 APAC, EMEA, AMER

🔍 DevSecOps Software

🏢 Company: GitLab

Advanced datastore platform management experience, preferably using Postgres at scale.
Advanced Cloud Infrastructure management, preferably using GCP.
Advanced experience with Linux.
Solid experience with automation including developing infrastructure and database automations.
Experience with Terraform for automation.
Experience with orchestration tools like Chef and/or Ansible.
Solid experience implementing monitoring at scale, preferably using Prometheus and Grafana.
Willingness and ability to promote GitLab's CREDIT Values.
Superior verbal and written communication skills.
Ability to work asynchronously across timezones and cultures.

Build, Run, and own the entire lifecycle of the PostgreSQL database engine for GitLab.com.
Automate operational tasks including package updates and configuration changes.
Develop warning systems for maintenance tasks like library upgrades.
Create monitoring and alerting systems to predict capacity needs.
Respond to user emergencies and support requests.
Implement and enhance security measures for GitLab infrastructure.
Partner with compliance assessors for regulatory certifications.
Collaborate with engineering teams to resolve architectural bottlenecks.

PostgreSQLSoftware DevelopmentGCPGrafanaPostgresPrometheusCommunication SkillsCollaborationTerraform

Posted 2024-10-16

Apply

🔥 Sr Site Reliability Engineer - Europe

Posted 2024-09-29

📍 European Economic Area

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Bitso

2+ years of experience with AWS (and others)
Strong skills around observability, debugging and performance tuning
Experience with fine-tuning containerized java applications is an advantage
In-depth experience with multiple CI tools (e.g. Github Actions)
In-depth experience with CD tools (e.g. ArgoCD)
2+ years of experience managing Production Kubernetes clusters and deploying applications
2+ years of experience writing scripts in different languages (e.g. Java, TS, bash, Go)
Experience managing Infrastructure as Code (Terraform, Crossplane)
Experience with Software development (ideally in Java) to better understand the needs and problems software engineers are facing with their services

Improve observability, reliability and availability by defining and measuring key metrics.
Closely collaborate with our Core Services fleet to performance tune and optimize our architecture.
Proactively find and analyze reliability problems across our business units and stack, then design and implement solutions to create improvements.
Educate, mentor, and hold accountable the engineering team to improve the reliability of our systems.

AWSSoftware DevelopmentBashJavaKubernetesGo

Posted 2024-09-29

Apply

Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities