Staff Site Reliability Engineer

New
Remote-friendly work environment within the United States.Full-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
PostgreSQLKubernetesGrafanaPrometheusTerraformDatadogDistributed Systems

Requirements

  • Bachelor’s degree in Computer Science or equivalent practical experience.
  • Background at Staff or Principal-level SRE or Platform Engineering roles.
  • Experience building SRE practices or platform engineering foundations for developer platforms, SaaS, or PaaS environments.
  • Deep expertise in Kubernetes (multi-tenant cluster architecture, networking, scaling, security hardening).
  • Strong experience designing and operating large-scale distributed systems.
  • Hands-on expertise with infrastructure-as-code and GitOps tooling (Terraform, Terragrunt, Helm, ArgoCD).
  • Experience building and maintaining observability stacks (Prometheus, Grafana, Datadog).
  • Strong knowledge of cloud infrastructure, networking, and data systems (PostgreSQL, Redis, object storage).
  • Experience in incident management, postmortems, on-call operations, and reliability governance.
  • Strong technical leadership, communication, and mentoring skills.

Responsibilities

  • Define and own end-to-end reliability for the platform, including SLOs, SLIs, error budgets, incident response frameworks, and operational best practices across all services.
  • Architect and implement multi-region Kubernetes-based infrastructure supporting edge deployments, backend services, and platform control planes.
  • Build and evolve GitOps-driven CI/CD pipelines and deployment systems using tools such as ArgoCD, Helm, Terraform, and Terragrunt.
  • Design and operate scalable, multi-tenant data systems including PostgreSQL clusters, caching layers, and object storage with a focus on resilience and performance.
  • Establish observability standards from inception, including monitoring, logging, alerting, dashboards, and runbooks using tools such as Datadog, Prometheus, and Grafana.
  • Partner with engineering, product, and security teams to integrate reliability, compliance, and operational excellence into platform architecture decisions.
  • Lead incident management, postmortems, and on-call practices while fostering a blameless, high-learning engineering culture.
  • Mentor engineers across teams on SRE principles, reliability engineering practices, and operational maturity.
  • Evaluate and adopt emerging technologies relevant to edge computing, serverless platforms, and modern distributed infrastructure.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now