Senior Site Reliability Engineer

New
Remote / Grand Rapids, MI / Austin, TXFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
PythonKubernetesAzureLinuxDevOpsTerraformDatadog

Requirements

  • 5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
  • Experience supporting production SaaS systems in Azure (preferred), AWS, or GCP.
  • Strong Linux, networking, and distributed systems troubleshooting skills.
  • Strong experience with containers and orchestration (Kubernetes/EKS/AKS).
  • Expertise with Infrastructure-as-Code (Terraform strongly preferred).
  • Strong scripting/programming skills in Python, Go, Bash, or C#/.NET.
  • Hands-on experience with Datadog, Prometheus/Grafana, or OpenTelemetry.

Responsibilities

  • Define and operationalize SLIs/SLOs and error budgets.
  • Design and implement autonomous and semi-autonomous AI agents for monitoring distributed systems.
  • Participate in and help lead an on-call rotation.
  • Build automated workflows to eliminate manual work.
  • Design and maintain Infrastructure-as-Code with Terraform.
  • Improve metrics, logs, traces, and alerting to reduce noise.
  • Partner with application teams to implement reliability best practices.
  • Mentor junior engineers to foster a culture of knowledge sharing.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now