Senior Site Reliability Engineer

New

Remote / Grand Rapids, MI / Austin, TXFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
Experience supporting production SaaS systems in Azure (preferred), AWS, or GCP.
Strong Linux, networking, and distributed systems troubleshooting skills.
Strong experience with containers and orchestration (Kubernetes/EKS/AKS).
Expertise with Infrastructure-as-Code (Terraform strongly preferred).
Strong scripting/programming skills in Python, Go, Bash, or C#/.NET.
Hands-on experience with Datadog, Prometheus/Grafana, or OpenTelemetry.

Define and operationalize SLIs/SLOs and error budgets.
Design and implement autonomous and semi-autonomous AI agents for monitoring distributed systems.
Participate in and help lead an on-call rotation.
Build automated workflows to eliminate manual work.
Design and maintain Infrastructure-as-Code with Terraform.
Improve metrics, logs, traces, and alerting to reduce noise.
Partner with application teams to implement reliability best practices.
Mentor junior engineers to foster a culture of knowledge sharing.

View Full Description & ApplyYou'll be redirected to the employer's site