Senior Software Engineer, Site Reliability Engineering
New
Flexible remote work options within North AmericaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonJavascriptGoLinuxMicroservicesNetworkingDistributed Systems
Requirements
- 5+ years of experience in Site Reliability Engineering, infrastructure engineering, or distributed systems roles.
- Strong expertise in AWS and Linux-based environments.
- Proficiency in programming languages such as Python, Go, JavaScript, or similar for automation and system development.
- Deep understanding of distributed systems and networking protocols including DNS, HTTP/S, TLS, and TCP/IP.
- Hands-on experience operating, monitoring, and debugging large-scale microservices architectures in production environments.
- Strong problem-solving skills with the ability to break down complex system challenges and evaluate technical trade-offs.
- Excellent communication skills with the ability to collaborate across engineering and non-engineering stakeholders.
- Strong focus on system reliability, scalability, and reducing operational overhead.
Responsibilities
- Design, build, and maintain scalable and highly available infrastructure and systems that support large-scale distributed applications.
- Define and influence architectural direction for platform services, ensuring resilience, performance, and scalability across systems.
- Develop tools and automation for deployment, monitoring, configuration management, and infrastructure operations.
- Troubleshoot and resolve complex production issues across distributed systems, ensuring minimal downtime and rapid recovery.
- Improve observability, monitoring, and alerting systems to enhance system visibility and reliability.
- Participate in capacity planning, performance tuning, and forecasting to proactively address scaling challenges.
- Collaborate with engineering teams to improve developer experience and reduce operational toil through automation and platform improvements.
- Participate in on-call rotations and provide incident response support for critical systems.
View Full Description & ApplyYou'll be redirected to the employer's site