Senior Site Reliability Engineer

Based in the United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSServiceNowDatadogDistributed Systems

Requirements

  • Extensive experience in Site Reliability Engineering, production support, or infrastructure engineering roles.
  • Strong expertise in AWS services and cloud-native architectures.
  • Proven experience with observability tools such as Splunk, Datadog, or similar monitoring platforms.
  • Demonstrated ability to reduce alert noise, improve signal-to-noise ratios, and design effective alerting strategies.
  • Experience building incident response playbooks, severity frameworks, and operational runbooks.
  • Strong troubleshooting skills in complex distributed systems and production environments.
  • Experience working in regulated industries such as financial services, banking, or payments is highly preferred.
  • Excellent communication skills with the ability to coordinate across engineering and operations teams.

Responsibilities

  • Own and improve production reliability across large-scale distributed systems, ensuring high availability and performance in critical financial infrastructure environments.
  • Design, refine, and maintain observability and monitoring systems using tools such as Splunk, Datadog, and ServiceNow, focusing on actionable insights rather than alert noise.
  • Reduce alert fatigue by analyzing existing monitoring signals, eliminating false positives, and improving severity classification frameworks and escalation paths.
  • Develop and maintain incident response playbooks, ensuring clear operational procedures for troubleshooting, mitigation, and post-incident review.
  • Lead efforts to troubleshoot complex production issues in AWS-based environments, ensuring rapid identification and resolution of system failures.
  • Collaborate with engineering, infrastructure, and product teams to improve system reliability, scalability, and operational efficiency.
  • Continuously enhance operational maturity by introducing automation, observability improvements, and best practices for production support.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now