Senior Site Reliability Engineer

CertifyOSHealthcare Data

Remote USFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
Deep hands-on experience with GCP, specifically GKE and Cloud Run.
Experience building and maintaining Infrastructure as Code with Terraform and/or Pulumi.
Fluency in deployment patterns such as rolling, blue/green, and canary.
Strong knowledge of Linux systems administration.
Experience with observability platforms like Google Cloud Monitoring, Datadog, Grafana, or Prometheus.
Experience designing SLIs, SLOs, error budgets, and alerting strategies.
Proficiency in Python, Bash, or Go.
Experience building and maintaining CI/CD pipelines using GitHub Actions or similar.
Experience operating systems handling sensitive data or PII in regulated environments.

Own the operational lifecycle end-to-end and influence platform architecture and reliability standards.
Manage incident response processes, root cause analysis, escalation workflows, and runbooks.
Maintain uptime, reduce alert fatigue, and build actionable observability across GKE and Cloud Run.
Improve autoscaling behavior, resource utilization, and workload efficiency.
Build and maintain Infrastructure as Code (IaC) and CI/CD pipelines.
Instrument data freshness and infrastructure health monitoring.
Mentor teams on reliability practices and influence operational standards.

View Full Description & ApplyYou'll be redirected to the employer's site