Senior Site Reliability Engineer

New

AkuityCloud Infrastructure

Live within US time zones (Pacific through Eastern), including Canada and other regions, Pacific through EasternFull-TimeSenior

SalaryCompetitive compensation, commensurate with experience

Apply NowOpens the employer's application page

Job Details

5+ years of SRE, platform engineering, or production operations experience in a SaaS environment.
Deep hands-on Kubernetes expertise including scheduler, networking, storage, and autoscaling.
Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM.
Experience defining and operating against production SLOs and error budgets.
Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent).
Solid scripting and automation skills (Go, Python, Bash, or similar).
Strong written communication skills for runbooks, incident reports, and post-mortems.
Must reside within US time zones (Pacific through Eastern).

Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them.
Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure.
Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes.
Partner with engineering teams to build reliability into new features before they ship to production.
Participate in an on-call rotation and act as incident commander for high-severity production events.
Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low.
Drive improvements to alerting fidelity to reduce noise and eliminate toil.
Lead post-incident reviews with clear timelines and root cause analysis.

View Full Description & ApplyYou'll be redirected to the employer's site