Senior Site Reliability Engineer

United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
KubernetesAzureGrafanaAzure DevOpsDatadog

Requirements

  • 5+ years of experience in SRE, Platform, or Production Engineering
  • Strong hands-on experience with Kubernetes and production environments
  • Experience with Azure and Azure DevOps
  • Experience with monitoring tools such as Datadog
  • Strong understanding of incident management and root cause analysis
  • Ability to build practical monitoring and alerting systems
  • Experience with AI or LLM pipelines
  • Experience building monitoring platforms across multiple systems
  • Experience with Grafana
  • Experience working in large scale or distributed environments

Responsibilities

  • Build and maintain central monitoring and alerting layer for AI applications and pipelines
  • Define and implement SLIs, alerts, and operational dashboards
  • Manage incidents including triage, coordination, root cause analysis, and prevention
  • Standardise telemetry across systems including latency, throughput, and failures
  • Optimise CI CD pipelines and introduce quality gates for reliability
  • Work closely with engineering teams to reduce recurring issues and improve stability
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now