Senior Site Reliability Engineer
New
A
AkuityCloud Infrastructure
Live within US time zones (Pacific through Eastern), including Canada and other regions, Pacific through EasternFull-TimeSenior
SalaryCompetitive compensation, commensurate with experience
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonBashKubernetesGoTerraformSaaS
Requirements
- 5+ years of SRE, platform engineering, or production operations experience in a SaaS environment.
- Deep hands-on Kubernetes expertise including scheduler, networking, storage, and autoscaling.
- Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM.
- Experience defining and operating against production SLOs and error budgets.
- Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent).
- Solid scripting and automation skills (Go, Python, Bash, or similar).
- Strong written communication skills for runbooks, incident reports, and post-mortems.
- Must reside within US time zones (Pacific through Eastern).
Responsibilities
- Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them.
- Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure.
- Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes.
- Partner with engineering teams to build reliability into new features before they ship to production.
- Participate in an on-call rotation and act as incident commander for high-severity production events.
- Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low.
- Drive improvements to alerting fidelity to reduce noise and eliminate toil.
- Lead post-incident reviews with clear timelines and root cause analysis.
View Full Description & ApplyYou'll be redirected to the employer's site