Site Reliability Engineer
A
Arbor EducationEducation
Workable workplace: remote Workable remote: True Workable locations: United Kingdom Location: United Kingdom RemoteFull-Time
Salary60,000 - 70,000 GBP per year
Apply NowOpens the employer's application page
Job Details
- Required Skills
- DockerAgileNginxPrometheusTerraformScriptingDatadog
Requirements
- Experience in performance monitoring and analysis
- Capacity planning experience
- Scripting and automation skills, with experience in relevant technologies.
- Experience with Infrastructure as Code, in particular, Terraform
- Understanding of relational database technologies and their cloud versions (e.g. AWS Aurora)
- Experience with messaging and distributed asynchronous workloads
- Experience with nginx or similar technologies
- Familiarity with SRE processes.
- Aware of DevOps principles like the 3 ways and 5 ideals.
- Experience with other database technologies and cloud platforms.
- Past experience with Enterprise solutions running at scale
- Familiarity with Kanban and Agile development processes
- Experience with containerisation, for example Docker
- Familiarity with software best practices such as Refactoring, Clean Code, Domain-Driven Design and Test-Driven Development.
Responsibilities
- Proactively monitor and analyse platform performance.
- Collaborate with engineering teams to address performance bottlenecks and ensure scalability.
- Assist engineering teams with implementing and reviewing SLOs
- Continually improve observability through monitoring and alerting, and dashboards, using tools such as DataDog or Prometheus for example.
- Ensure the service is highly available and resilient
- Champion best practices in design for high availability
- Devise runbooks and run game sessions to test our DR plan, H/A and backups
- Conduct assessments of capacity and plan for scaling to meet current and future business needs.
- Work closely with the Head of Platform Engineering and Head of SRE to strategize and implement scalable solutions.
- Key player in the response and troubleshooting of incidents, ensuring rapid resolution and minimising downtime.
View Full Description & ApplyYou'll be redirected to the employer's site