Senior Site Reliability Engineer

O
OmiliaIT Engineering
PhilippinesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSPythonBashKubernetesGoGrafanaPrometheus

Requirements

  • Bachelor's Degree or MS in Engineering or equivalent.
  • Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm).
  • Experience developing or maintaining software for production services at scale.
  • Experience with ELK.
  • Experience with AWS.
  • Experience with Grafana/Prometheus stack.
  • Strong scripting skills (Bash, Python or Go).
  • Excellent communication skills.
  • Agile/lean methodology experience.

Responsibilities

  • Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
  • First response for incidents, contribute to problem management and root cause analysis.
  • Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle.
  • Develop troubleshooting documentation for production support resources.
  • Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks.
  • Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle.
  • Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK.
  • Participate in on-call rotations and continuously improve alert quality and response processes.
  • Champion a culture of reliability, performance, and continuous improvement across teams.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now