Senior Site Reliability Engineer - AWS
New
USFull-TimeSenior
Salary175,000 - 190,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years of experience in software engineering, infrastructure, or operations, including at least 4+ years in Site Reliability Engineering roles.
- Required Skills
- AWSPythonBashCI/CD
Requirements
- 8+ years of experience in software engineering, infrastructure, or operations, including at least 4+ years in Site Reliability Engineering roles.
- Strong hands-on expertise with AWS services such as EC2, EKS, Lambda, S3, IAM, and CloudWatch.
- Proficiency in scripting and programming languages such as Python, Bash, or PowerShell.
- Proven experience building and maintaining highly automated, large-scale production systems.
- Strong knowledge of CI/CD pipelines, monitoring/alerting systems, incident response, and capacity planning.
- Experience improving system reliability through automation and reducing operational toil in production environments.
- Strong understanding of security best practices in cloud infrastructure.
- Ability to work independently in fast-paced environments while driving continuous improvement initiatives.
- Strong communication skills with the ability to collaborate across technical and non-technical stakeholders.
- Bachelor’s degree in Computer Science or related field, or equivalent hands-on experience and certifications.
Responsibilities
- Design, build, and maintain highly automated and autonomous systems for deployment, testing, monitoring, and operation of production environments.
- Lead reliability engineering efforts across the SDLC, ensuring system stability, performance, and scalability standards are consistently met.
- Develop and enhance CI/CD pipelines, automation scripts, and operational tooling to reduce manual effort and improve delivery speed.
- Implement robust monitoring, alerting, and observability systems to ensure real-time visibility into infrastructure and application health.
- Identify and resolve issues related to system availability, performance bottlenecks, and security vulnerabilities.
- Collaborate with engineering teams to improve architecture, reliability practices, and incident response processes.
- Participate in on-call rotations and provide rapid response support for production incidents.
- Document system architecture, operational procedures, and best practices while mentoring junior engineers.
View Full Description & ApplyYou'll be redirected to the employer's site