Senior Site Reliability Engineer
United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- KubernetesAzureGrafanaAzure DevOpsDatadog
Requirements
- 5+ years of experience in SRE, Platform, or Production Engineering
- Strong hands-on experience with Kubernetes and production environments
- Experience with Azure and Azure DevOps
- Experience with monitoring tools such as Datadog
- Strong understanding of incident management and root cause analysis
- Ability to build practical monitoring and alerting systems
- Experience with AI or LLM pipelines
- Experience building monitoring platforms across multiple systems
- Experience with Grafana
- Experience working in large scale or distributed environments
Responsibilities
- Build and maintain central monitoring and alerting layer for AI applications and pipelines
- Define and implement SLIs, alerts, and operational dashboards
- Manage incidents including triage, coordination, root cause analysis, and prevention
- Standardise telemetry across systems including latency, throughput, and failures
- Optimise CI CD pipelines and introduce quality gates for reliability
- Work closely with engineering teams to reduce recurring issues and improve stability
View Full Description & ApplyYou'll be redirected to the employer's site