Senior Site Reliability Engineer

United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSDockerPythonKubernetesGrafanaPrometheusCI/CDTerraformDatadog

Requirements

  • Monitoring and observability best practice including using tools like Datadog, Prometheus, Grafana
  • Expertise in setting up and managing alerts, dashboards, and logging
  • Understanding of networking concepts, security best practices, and performance optimization in AWS
  • Proficiency in AWS services: EKS, EC2, ECS, S3, RDS, VPC, IAM, Route 53, etc.
  • Experience with containerization and orchestration tools like Docker and Kubernetes
  • Strong knowledge of Infrastructure as Code (IaC) tools such as Terraform, CDK or CloudFormation
  • Knowledge of scripting and automation using languages like Python, Bash, or PowerShell
  • Experience with CI/CD pipelines for deploying and testing applications in AWS

Responsibilities

  • Implementing best practices for monitoring, alerting, and incident response using DataDog and other tools.
  • Designing, building, and maintaining cost-effective, reliable, and scalable AWS infrastructure.
  • Collaborating with cross-functional teams to identify and address performance bottlenecks and reliability issues.
  • Conducting post-incident reviews to analyse root causes and implement preventive measures.
  • Automating routine tasks and processes to improve efficiency and reduce manual intervention.
  • Participating in an on-call rotation to respond to system outages and emergencies.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now