Staff Site Reliability Engineer
New
R
ReplitSoftware Development
Remote - Europe; Secondary Locations: Remote - Ireland, Remote - France, Remote - Italy, Remote - Netherlands, Remote - United KingdomFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 8-10 years
- Required Skills
- DockerPythonKubernetesGoTerraformDistributed Systems
Requirements
- 8-10 years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or Infrastructure Engineering.
- Strong programming skills in Python or Go.
- Deep understanding of distributed systems and service-oriented architecture.
- Deep experience with container orchestration platforms, specifically Kubernetes.
- Proven track record of designing and maintaining monitoring and observability solutions.
- Strong incident management skills with experience leading response for complex systems.
- Experience with infrastructure as code (e.g., Terraform, Pulumi).
- Excellent written and verbal communication skills.
- Strong interpersonal skills with experience mentoring engineers.
- Willingness to dive into any layer of the stack.
Responsibilities
- Design, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions.
- Define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Lead incident management and response as a senior leader during high-impact incidents.
- Architect, build, and improve automation to eliminate toil, including CI/CD pipelines and infrastructure as code.
- Optimize performance for large-scale cloud deployments on Kubernetes, Docker, and GCP.
- Debug and harden distributed systems to improve robustness and operability.
- Provide staff-level design reviews for reliability, scalability, and security.
- Mentor and educate the engineering team to improve system reliability.
View Full Description & ApplyYou'll be redirected to the employer's site