Senior Software Engineer, Site Reliability Engineering

New

Flexible remote work options within North AmericaFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: AWSPythonJavascriptGoLinuxMicroservicesNetworkingDistributed Systems

5+ years of experience in Site Reliability Engineering, infrastructure engineering, or distributed systems roles.
Strong expertise in AWS and Linux-based environments.
Proficiency in programming languages such as Python, Go, JavaScript, or similar for automation and system development.
Deep understanding of distributed systems and networking protocols including DNS, HTTP/S, TLS, and TCP/IP.
Hands-on experience operating, monitoring, and debugging large-scale microservices architectures in production environments.
Strong problem-solving skills with the ability to break down complex system challenges and evaluate technical trade-offs.
Excellent communication skills with the ability to collaborate across engineering and non-engineering stakeholders.
Strong focus on system reliability, scalability, and reducing operational overhead.

Design, build, and maintain scalable and highly available infrastructure and systems that support large-scale distributed applications.
Define and influence architectural direction for platform services, ensuring resilience, performance, and scalability across systems.
Develop tools and automation for deployment, monitoring, configuration management, and infrastructure operations.
Troubleshoot and resolve complex production issues across distributed systems, ensuring minimal downtime and rapid recovery.
Improve observability, monitoring, and alerting systems to enhance system visibility and reliability.
Participate in capacity planning, performance tuning, and forecasting to proactively address scaling challenges.
Collaborate with engineering teams to improve developer experience and reduce operational toil through automation and platform improvements.
Participate in on-call rotations and provide incident response support for critical systems.

View Full Description & ApplyYou'll be redirected to the employer's site