Principal Site Reliability Engineer

Posted 3 months agoViewed

100000 - 120000 USD per year

United StatesFull-TimeHealthcare Workforce Management

Company:QGenda

Location:United States

Languages:English

Seniority level:Principal, 8+ years

Experience:8+ years

Skills:

AWSDockerLeadershipPythonSoftware DevelopmentAgileGitJenkinsSCRUMCI/CDDevOpsTerraformProblem SolvingMentoring

Requirements:

B.S. in Computer Science, Computer Information Systems, or Computer Engineering from a major U.S. university or equivalent industry experience 8+ years of experience as a DevOps, SRE or Systems Engineer Advanced proficiency with at least one scripting or programming language Experience with Docker and container orchestration tools such as AWS ECS Hands-on experience building infrastructure and supporting applications in AWS using services such as Lambda, EC2, ECS, S3, SNS, SQS, RDS, Redshift, and Elasticache Experience with logging, creating dashboards, and alerts using observability tools such as Datadog and Amazon CloudWatch Strong understanding of networking and DNS Familiarity with configuration management and infrastructure as code (IaC) tools such as Terraform Firm understanding and experience with Agile and Scrum SDLC processes Using distributed version control system experience (Git preferred) Knowledge of CI/CD best practices and tools such as AWS CodeBuild, Jenkins and/or TeamCity Experience designing and delivering secure, high performance and highly available cloud services

Responsibilities:

Design, implement, and manage scalable systems for high availability and performance. Continuously monitor and enhance system health and performance. Embed observability (metrics, logs, traces, alerts) with actionable thresholds and runbooks. Eliminate toil by building automation and self-service tools. Own CI/CD pipelines and enable progressive delivery. Manage infrastructure as code via Terraform. Participate in on-call rotation for incident management. Lead incident response and conduct blameless post-incident reviews. Operate and secure AWS environments with focus on resilience and compliance. Optimize cost, performance, and reliability of AWS environments. Serve as a technical advisor to engineering teams on infrastructure and operations best practices. Mentor peers on SRE practices. Contribute to roadmaps and capacity planning.