Senior Software Engineer, Site Reliability Engineering

New
Flexible remote work options within North AmericaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSPythonJavascriptGoLinuxMicroservicesNetworkingDistributed Systems

Requirements

  • 5+ years of experience in Site Reliability Engineering, infrastructure engineering, or distributed systems roles.
  • Strong expertise in AWS and Linux-based environments.
  • Proficiency in programming languages such as Python, Go, JavaScript, or similar for automation and system development.
  • Deep understanding of distributed systems and networking protocols including DNS, HTTP/S, TLS, and TCP/IP.
  • Hands-on experience operating, monitoring, and debugging large-scale microservices architectures in production environments.
  • Strong problem-solving skills with the ability to break down complex system challenges and evaluate technical trade-offs.
  • Excellent communication skills with the ability to collaborate across engineering and non-engineering stakeholders.
  • Strong focus on system reliability, scalability, and reducing operational overhead.

Responsibilities

  • Design, build, and maintain scalable and highly available infrastructure and systems that support large-scale distributed applications.
  • Define and influence architectural direction for platform services, ensuring resilience, performance, and scalability across systems.
  • Develop tools and automation for deployment, monitoring, configuration management, and infrastructure operations.
  • Troubleshoot and resolve complex production issues across distributed systems, ensuring minimal downtime and rapid recovery.
  • Improve observability, monitoring, and alerting systems to enhance system visibility and reliability.
  • Participate in capacity planning, performance tuning, and forecasting to proactively address scaling challenges.
  • Collaborate with engineering teams to improve developer experience and reduce operational toil through automation and platform improvements.
  • Participate in on-call rotations and provide incident response support for critical systems.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now