Principal Site Reliability Engineer

New

SymmetrioHealthcare Technology

United StatesFull-TimePrincipal

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 6+ years of hands-on experience supporting and managing AWS-based production environments; 4+ years of experience supporting web applications and backend services
Required Skills: AWSPythonDjangoKubernetesCI/CDTerraformDatadog

6+ years of hands-on experience supporting and managing AWS-based production environments.
4+ years of experience supporting web applications and backend services (Python/Django preferred).
Experience with AWS networking technologies including VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups.
Strong experience with Terraform and infrastructure-as-code deployment practices.
Experience with containerized environments including ECS, Fargate, Kubernetes, or similar technologies.
Experience building and supporting CI/CD pipelines and release automation processes.
Familiarity with monitoring and observability platforms such as Datadog, CloudWatch, Sentry, Grafana, or similar tools.
Experience leading production incidents, outage management, and root cause analysis initiatives.
Exposure to Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts is preferred.
Healthcare technology or regulated industry experience is highly preferred.
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field.

Serve as the primary technical owner for production reliability across U.S. customer environments.
Investigate and resolve complex issues spanning web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations.
Lead production incident response efforts, coordinating cross-functional teams to restore service and minimize customer impact.
Perform root cause analysis and drive corrective actions that improve long-term system stability and resilience.
Design, configure, and validate secure customer connectivity solutions including Site-to-Site VPNs and Transit Gateway integrations.
Enhance platform observability through improvements in monitoring, logging, alerting, and operational dashboards.
Contribute to CI/CD, infrastructure automation, and deployment processes that improve release safety.
Develop operational tooling that supports incident response, troubleshooting, and system monitoring.

View Full Description & ApplyYou'll be redirected to the employer's site