Staff Site Reliability Engineer

New
R
ReplitSoftware Development
Remote - Europe; Secondary Locations: Remote - Ireland, Remote - France, Remote - Italy, Remote - Netherlands, Remote - United KingdomFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
8-10 years
Required Skills
DockerPythonKubernetesGoTerraformDistributed Systems

Requirements

  • 8-10 years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or Infrastructure Engineering.
  • Strong programming skills in Python or Go.
  • Deep understanding of distributed systems and service-oriented architecture.
  • Deep experience with container orchestration platforms, specifically Kubernetes.
  • Proven track record of designing and maintaining monitoring and observability solutions.
  • Strong incident management skills with experience leading response for complex systems.
  • Experience with infrastructure as code (e.g., Terraform, Pulumi).
  • Excellent written and verbal communication skills.
  • Strong interpersonal skills with experience mentoring engineers.
  • Willingness to dive into any layer of the stack.

Responsibilities

  • Design, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions.
  • Define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Lead incident management and response as a senior leader during high-impact incidents.
  • Architect, build, and improve automation to eliminate toil, including CI/CD pipelines and infrastructure as code.
  • Optimize performance for large-scale cloud deployments on Kubernetes, Docker, and GCP.
  • Debug and harden distributed systems to improve robustness and operability.
  • Provide staff-level design reviews for reliability, scalability, and security.
  • Mentor and educate the engineering team to improve system reliability.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now