Staff Site Reliability Engineer

Posted 4 months agoViewed

201000 - 287100 USD per year

United StatesFull-TimeData Resilience / SaaS

Company:Veeam Software

Location:United States

Languages:English

Seniority level:Staff, 8+ years

Experience:8+ years

Skills:

LeadershipJavaJavascriptKubernetesMicrosoft AzureTypeScriptC#GoGrafanaPrometheusCI/CDDevOpsTerraformSoftware Engineering

Requirements:

8+ years of experience in a Software Engineering or SRE role, including technical leadership. Demonstrated experience mentoring and guiding senior engineers. Deep expertise in building distributed systems on public cloud (Azure preferred). Strong skills in programming (e.g., JS, Go, Typescript, Java, or C#). Hands-on experience with observability tooling (e.g., Prometheus, Grafana, OpenTelemetry). Mastery of infrastructure automation tools (Terraform, Pulumi) and container orchestration (Kubernetes). Ability to communicate clearly across geographies and disciplines. Experience leading SRE initiatives across multiple product teams (preferred). Background in chaos engineering, incident learning, or performance and load testing (preferred). Familiarity with global compliance standards (ISO, SOC 2, GDPR, FedRAMP, CMMC) (preferred).

Responsibilities:

Serve as a hands-on technical leader within the SRE team. Guide senior engineers, influence product development teams. Ensure systems are reliable, scalable, and observable. Drive strategic initiatives and mentor others in SRE practices. Help define architectural best practices. Align teams, enforce high standards, and scale SRE principles globally. Act as a technical authority, mentoring senior engineers and guiding design choices. Lead the definition and enforcement of SLIs, SLOs, and error budgets. Collaborate with Staff peers to align strategy and champion reliability standards. Partner with development and product teams to design for failure and build resilient architecture. Drive adoption of observability best practices and tooling. Ensure metrics, logs, and traces provide actionable insights. Lead complex incident responses and systemic reliability improvements. Promote a blameless culture of learning. Lead initiatives in infrastructure as code, deployment automation, and resilience testing. Influence the development and adoption of chaos engineering practices. Partner with platform and security teams to ensure production readiness. Provide architectural guidance and advocate for engineering rigor. Represent the SRE team in technical leadership forums.