Staff Site Reliability Engineer

New

ReplitSoftware Development

Remote - Europe; Secondary Locations: Remote - Ireland, Remote - France, Remote - Italy, Remote - Netherlands, Remote - United KingdomFull-TimeStaff

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

8-10 years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or Infrastructure Engineering.
Strong programming skills in Python or Go.
Deep understanding of distributed systems and service-oriented architecture.
Deep experience with container orchestration platforms, specifically Kubernetes.
Proven track record of designing and maintaining monitoring and observability solutions.
Strong incident management skills with experience leading response for complex systems.
Experience with infrastructure as code (e.g., Terraform, Pulumi).
Excellent written and verbal communication skills.
Strong interpersonal skills with experience mentoring engineers.
Willingness to dive into any layer of the stack.

Design, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions.
Define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Lead incident management and response as a senior leader during high-impact incidents.
Architect, build, and improve automation to eliminate toil, including CI/CD pipelines and infrastructure as code.
Optimize performance for large-scale cloud deployments on Kubernetes, Docker, and GCP.
Debug and harden distributed systems to improve robustness and operability.
Provide staff-level design reviews for reliability, scalability, and security.
Mentor and educate the engineering team to improve system reliability.

View Full Description & ApplyYou'll be redirected to the employer's site