Senior Site Reliability Engineer

OmiliaIT Engineering

PhilippinesFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Bachelor's Degree or MS in Engineering or equivalent.
Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm).
Experience developing or maintaining software for production services at scale.
Experience with ELK.
Experience with AWS.
Experience with Grafana/Prometheus stack.
Strong scripting skills (Bash, Python or Go).
Excellent communication skills.
Agile/lean methodology experience.

Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
First response for incidents, contribute to problem management and root cause analysis.
Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle.
Develop troubleshooting documentation for production support resources.
Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks.
Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle.
Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK.
Participate in on-call rotations and continuously improve alert quality and response processes.
Champion a culture of reliability, performance, and continuous improvement across teams.

View Full Description & ApplyYou'll be redirected to the employer's site