Senior Site Reliability Engineer
United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSDockerPythonKubernetesGrafanaPrometheusCI/CDTerraformDatadog
Requirements
- Monitoring and observability best practice including using tools like Datadog, Prometheus, Grafana
- Expertise in setting up and managing alerts, dashboards, and logging
- Understanding of networking concepts, security best practices, and performance optimization in AWS
- Proficiency in AWS services: EKS, EC2, ECS, S3, RDS, VPC, IAM, Route 53, etc.
- Experience with containerization and orchestration tools like Docker and Kubernetes
- Strong knowledge of Infrastructure as Code (IaC) tools such as Terraform, CDK or CloudFormation
- Knowledge of scripting and automation using languages like Python, Bash, or PowerShell
- Experience with CI/CD pipelines for deploying and testing applications in AWS
Responsibilities
- Implementing best practices for monitoring, alerting, and incident response using DataDog and other tools.
- Designing, building, and maintaining cost-effective, reliable, and scalable AWS infrastructure.
- Collaborating with cross-functional teams to identify and address performance bottlenecks and reliability issues.
- Conducting post-incident reviews to analyse root causes and implement preventive measures.
- Automating routine tasks and processes to improve efficiency and reduce manual intervention.
- Participating in an on-call rotation to respond to system outages and emergencies.
View Full Description & ApplyYou'll be redirected to the employer's site