Principal Site Reliability Engineer

New
S
SymmetrioHealthcare Technology
United StatesFull-TimePrincipal
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
6+ years of hands-on experience supporting and managing AWS-based production environments; 4+ years of experience supporting web applications and backend services
Required Skills
AWSPythonDjangoKubernetesCI/CDTerraformDatadog

Requirements

  • 6+ years of hands-on experience supporting and managing AWS-based production environments.
  • 4+ years of experience supporting web applications and backend services (Python/Django preferred).
  • Experience with AWS networking technologies including VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups.
  • Strong experience with Terraform and infrastructure-as-code deployment practices.
  • Experience with containerized environments including ECS, Fargate, Kubernetes, or similar technologies.
  • Experience building and supporting CI/CD pipelines and release automation processes.
  • Familiarity with monitoring and observability platforms such as Datadog, CloudWatch, Sentry, Grafana, or similar tools.
  • Experience leading production incidents, outage management, and root cause analysis initiatives.
  • Exposure to Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts is preferred.
  • Healthcare technology or regulated industry experience is highly preferred.
  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field.

Responsibilities

  • Serve as the primary technical owner for production reliability across U.S. customer environments.
  • Investigate and resolve complex issues spanning web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations.
  • Lead production incident response efforts, coordinating cross-functional teams to restore service and minimize customer impact.
  • Perform root cause analysis and drive corrective actions that improve long-term system stability and resilience.
  • Design, configure, and validate secure customer connectivity solutions including Site-to-Site VPNs and Transit Gateway integrations.
  • Enhance platform observability through improvements in monitoring, logging, alerting, and operational dashboards.
  • Contribute to CI/CD, infrastructure automation, and deployment processes that improve release safety.
  • Develop operational tooling that supports incident response, troubleshooting, and system monitoring.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now