Senior Site Reliability Engineer, Infrastructure

New
United StatesFull-TimeSenior
Salary125,000 - 135,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
GrafanaLinuxTerraform

Requirements

  • 5+ years of experience in site reliability engineering, platform engineering, or infrastructure engineering in production environments.
  • Strong hands-on experience building observability systems, including metrics, logs, alerting, and monitoring pipelines.
  • Familiarity with tools such as Grafana, Loki, Mimir, or similar observability platforms.
  • Working knowledge of datacenter hardware telemetry protocols such as Redfish, IPMI, and/or SNMP.
  • Strong Linux systems knowledge and experience operating production-grade infrastructure.
  • Experience with infrastructure-as-code tools such as Terraform, Ansible, Chef, or equivalent technologies.
  • Proven ability to collaborate across technical and operational teams in complex environments.
  • Strong communication skills and ability to translate operational needs into engineering solutions.

Responsibilities

  • Design and build observability pipelines for datacenter and provisioning infrastructure, including telemetry ingestion from systems such as Redfish, IPMI, SNMP, and OpenTelemetry.
  • Own the full observability stack, from data collection through storage, processing, visualization, and alerting using tools such as Grafana, Loki, and Mimir.
  • Develop dashboards, metrics, and alerting systems that provide actionable insights for datacenter operations, networking, systems, and provisioning teams.
  • Define and enforce standards for telemetry collection, observability design, and infrastructure monitoring across global environments.
  • Partner with cross-functional engineering and operations teams to translate operational needs into measurable signals and reliable monitoring systems.
  • Drive infrastructure-as-code practices for observability systems to ensure scalability, consistency, and maintainability.
  • Continuously improve system reliability, visibility, and operational efficiency across large-scale infrastructure environments.
View Full Description & ApplyYou'll be redirected to the employer's site
125,000 - 135,000 USD per year
Apply Now