Senior Site Reliability Engineer
O
OmiliaIT Engineering
PhilippinesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSPythonBashKubernetesGoGrafanaPrometheus
Requirements
- Bachelor's Degree or MS in Engineering or equivalent.
- Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm).
- Experience developing or maintaining software for production services at scale.
- Experience with ELK.
- Experience with AWS.
- Experience with Grafana/Prometheus stack.
- Strong scripting skills (Bash, Python or Go).
- Excellent communication skills.
- Agile/lean methodology experience.
Responsibilities
- Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
- First response for incidents, contribute to problem management and root cause analysis.
- Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle.
- Develop troubleshooting documentation for production support resources.
- Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks.
- Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle.
- Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK.
- Participate in on-call rotations and continuously improve alert quality and response processes.
- Champion a culture of reliability, performance, and continuous improvement across teams.
View Full Description & ApplyYou'll be redirected to the employer's site