Senior Site Reliability Engineer
New
Remote / Grand Rapids, MI / Austin, TXFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- PythonKubernetesAzureLinuxDevOpsTerraformDatadog
Requirements
- 5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
- Experience supporting production SaaS systems in Azure (preferred), AWS, or GCP.
- Strong Linux, networking, and distributed systems troubleshooting skills.
- Strong experience with containers and orchestration (Kubernetes/EKS/AKS).
- Expertise with Infrastructure-as-Code (Terraform strongly preferred).
- Strong scripting/programming skills in Python, Go, Bash, or C#/.NET.
- Hands-on experience with Datadog, Prometheus/Grafana, or OpenTelemetry.
Responsibilities
- Define and operationalize SLIs/SLOs and error budgets.
- Design and implement autonomous and semi-autonomous AI agents for monitoring distributed systems.
- Participate in and help lead an on-call rotation.
- Build automated workflows to eliminate manual work.
- Design and maintain Infrastructure-as-Code with Terraform.
- Improve metrics, logs, traces, and alerting to reduce noise.
- Partner with application teams to implement reliability best practices.
- Mentor junior engineers to foster a culture of knowledge sharing.
View Full Description & ApplyYou'll be redirected to the employer's site