Site Reliability Engineer
New
Australia, Canada, Germany, India, United Kingdom, United StatesFull-TimeSenior
Salary164,382 - 215,050 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 15 + years
- Required Skills
- PythonBashKubernetesGrafanaPrometheusCI/CDLinuxDistributed Systems
Requirements
- 15+ years in system reliability, DevSecOps, cloud operations, or infrastructure engineering
- Bachelor's degree or 4 years of work experience in lieu of degree
- Strong scripting and automation skills (Python, Bash, PowerShell, etc.)
- Hands‑on experience with monitoring tools (Prometheus, Grafana, Splunk, ELK, Datadog, etc.)
- Familiarity with Kubernetes, container orchestration, and modern CI/CD pipelines
- Understanding of networking, Linux system internals, and distributed systems
- Ability to troubleshoot complex technical issues across the stack
Responsibilities
- Build/Design and maintain highly available, scalable systems across cloud and on‑prem environments.
- Develop automation solutions that improves observability, speeds recovery, and eliminates manual operational work.
- Implement monitoring, alerting, and performance tuning strategies that ensure system health.
- Collaborate with development and infrastructure teams to design reliable architectures and CI/CD pipelines.
- Conduct root cause analysis and drive systemic improvements to prevent future incidents.
- Champion SRE best practices such as SLIs/SLOs, error budgets, and automated incident response.
- Provide inputs into proposal operations in area of subject matter expertise.
View Full Description & ApplyYou'll be redirected to the employer's site