Senior Site Reliability Engineer
Based in the United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSServiceNowDatadogDistributed Systems
Requirements
- Extensive experience in Site Reliability Engineering, production support, or infrastructure engineering roles.
- Strong expertise in AWS services and cloud-native architectures.
- Proven experience with observability tools such as Splunk, Datadog, or similar monitoring platforms.
- Demonstrated ability to reduce alert noise, improve signal-to-noise ratios, and design effective alerting strategies.
- Experience building incident response playbooks, severity frameworks, and operational runbooks.
- Strong troubleshooting skills in complex distributed systems and production environments.
- Experience working in regulated industries such as financial services, banking, or payments is highly preferred.
- Excellent communication skills with the ability to coordinate across engineering and operations teams.
Responsibilities
- Own and improve production reliability across large-scale distributed systems, ensuring high availability and performance in critical financial infrastructure environments.
- Design, refine, and maintain observability and monitoring systems using tools such as Splunk, Datadog, and ServiceNow, focusing on actionable insights rather than alert noise.
- Reduce alert fatigue by analyzing existing monitoring signals, eliminating false positives, and improving severity classification frameworks and escalation paths.
- Develop and maintain incident response playbooks, ensuring clear operational procedures for troubleshooting, mitigation, and post-incident review.
- Lead efforts to troubleshoot complex production issues in AWS-based environments, ensuring rapid identification and resolution of system failures.
- Collaborate with engineering, infrastructure, and product teams to improve system reliability, scalability, and operational efficiency.
- Continuously enhance operational maturity by introducing automation, observability improvements, and best practices for production support.
View Full Description & ApplyYou'll be redirected to the employer's site