Senior Site Reliability Engineer, Infrastructure
New
United StatesFull-TimeSenior
Salary125,000 - 135,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- GrafanaLinuxTerraform
Requirements
- 5+ years of experience in site reliability engineering, platform engineering, or infrastructure engineering in production environments.
- Strong hands-on experience building observability systems, including metrics, logs, alerting, and monitoring pipelines.
- Familiarity with tools such as Grafana, Loki, Mimir, or similar observability platforms.
- Working knowledge of datacenter hardware telemetry protocols such as Redfish, IPMI, and/or SNMP.
- Strong Linux systems knowledge and experience operating production-grade infrastructure.
- Experience with infrastructure-as-code tools such as Terraform, Ansible, Chef, or equivalent technologies.
- Proven ability to collaborate across technical and operational teams in complex environments.
- Strong communication skills and ability to translate operational needs into engineering solutions.
Responsibilities
- Design and build observability pipelines for datacenter and provisioning infrastructure, including telemetry ingestion from systems such as Redfish, IPMI, SNMP, and OpenTelemetry.
- Own the full observability stack, from data collection through storage, processing, visualization, and alerting using tools such as Grafana, Loki, and Mimir.
- Develop dashboards, metrics, and alerting systems that provide actionable insights for datacenter operations, networking, systems, and provisioning teams.
- Define and enforce standards for telemetry collection, observability design, and infrastructure monitoring across global environments.
- Partner with cross-functional engineering and operations teams to translate operational needs into measurable signals and reliable monitoring systems.
- Drive infrastructure-as-code practices for observability systems to ensure scalability, consistency, and maintainability.
- Continuously improve system reliability, visibility, and operational efficiency across large-scale infrastructure environments.
View Full Description & ApplyYou'll be redirected to the employer's site