Staff Site Reliability Engineer

New

Remote-friendly work environment within the United States.Full-TimeStaff

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Required Skills: PostgreSQLKubernetesGrafanaPrometheusTerraformDatadogDistributed Systems

Bachelor’s degree in Computer Science or equivalent practical experience.
Background at Staff or Principal-level SRE or Platform Engineering roles.
Experience building SRE practices or platform engineering foundations for developer platforms, SaaS, or PaaS environments.
Deep expertise in Kubernetes (multi-tenant cluster architecture, networking, scaling, security hardening).
Strong experience designing and operating large-scale distributed systems.
Hands-on expertise with infrastructure-as-code and GitOps tooling (Terraform, Terragrunt, Helm, ArgoCD).
Experience building and maintaining observability stacks (Prometheus, Grafana, Datadog).
Strong knowledge of cloud infrastructure, networking, and data systems (PostgreSQL, Redis, object storage).
Experience in incident management, postmortems, on-call operations, and reliability governance.
Strong technical leadership, communication, and mentoring skills.

Define and own end-to-end reliability for the platform, including SLOs, SLIs, error budgets, incident response frameworks, and operational best practices across all services.
Architect and implement multi-region Kubernetes-based infrastructure supporting edge deployments, backend services, and platform control planes.
Build and evolve GitOps-driven CI/CD pipelines and deployment systems using tools such as ArgoCD, Helm, Terraform, and Terragrunt.
Design and operate scalable, multi-tenant data systems including PostgreSQL clusters, caching layers, and object storage with a focus on resilience and performance.
Establish observability standards from inception, including monitoring, logging, alerting, dashboards, and runbooks using tools such as Datadog, Prometheus, and Grafana.
Partner with engineering, product, and security teams to integrate reliability, compliance, and operational excellence into platform architecture decisions.
Lead incident management, postmortems, and on-call practices while fostering a blameless, high-learning engineering culture.
Mentor engineers across teams on SRE principles, reliability engineering practices, and operational maturity.
Evaluate and adopt emerging technologies relevant to edge computing, serverless platforms, and modern distributed infrastructure.

View Full Description & ApplyYou'll be redirected to the employer's site