Staff Site Reliability Engineer
New
Remote-friendly work environment within the United States.Full-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- PostgreSQLKubernetesGrafanaPrometheusTerraformDatadogDistributed Systems
Requirements
- Bachelor’s degree in Computer Science or equivalent practical experience.
- Background at Staff or Principal-level SRE or Platform Engineering roles.
- Experience building SRE practices or platform engineering foundations for developer platforms, SaaS, or PaaS environments.
- Deep expertise in Kubernetes (multi-tenant cluster architecture, networking, scaling, security hardening).
- Strong experience designing and operating large-scale distributed systems.
- Hands-on expertise with infrastructure-as-code and GitOps tooling (Terraform, Terragrunt, Helm, ArgoCD).
- Experience building and maintaining observability stacks (Prometheus, Grafana, Datadog).
- Strong knowledge of cloud infrastructure, networking, and data systems (PostgreSQL, Redis, object storage).
- Experience in incident management, postmortems, on-call operations, and reliability governance.
- Strong technical leadership, communication, and mentoring skills.
Responsibilities
- Define and own end-to-end reliability for the platform, including SLOs, SLIs, error budgets, incident response frameworks, and operational best practices across all services.
- Architect and implement multi-region Kubernetes-based infrastructure supporting edge deployments, backend services, and platform control planes.
- Build and evolve GitOps-driven CI/CD pipelines and deployment systems using tools such as ArgoCD, Helm, Terraform, and Terragrunt.
- Design and operate scalable, multi-tenant data systems including PostgreSQL clusters, caching layers, and object storage with a focus on resilience and performance.
- Establish observability standards from inception, including monitoring, logging, alerting, dashboards, and runbooks using tools such as Datadog, Prometheus, and Grafana.
- Partner with engineering, product, and security teams to integrate reliability, compliance, and operational excellence into platform architecture decisions.
- Lead incident management, postmortems, and on-call practices while fostering a blameless, high-learning engineering culture.
- Mentor engineers across teams on SRE principles, reliability engineering practices, and operational maturity.
- Evaluate and adopt emerging technologies relevant to edge computing, serverless platforms, and modern distributed infrastructure.
View Full Description & ApplyYou'll be redirected to the employer's site