Site Reliability Engineer
O
OrkesCloud Infrastructure
Location: Remote-US; Location: US/PT hrs, Canada/PT hrs, PT hoursFull-TimeSenior
Salary180,000 - 250,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonBashGCPKubernetesAzureGrafanaPrometheusTerraformDatadog
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related infrastructure roles
- Strong experience with cloud platforms such as AWS, GCP, or Azure
- Hands-on experience with Kubernetes and containerized environments
- Strong understanding of distributed systems and microservices architecture
- Experience with observability tools such as Prometheus, Grafana, Datadog, ELK, or OpenTelemetry
- Proficiency with infrastructure automation and scripting (Terraform, Python, Bash, etc.)
- Experience managing CI/CD pipelines and deployment automation
- Strong troubleshooting and incident management skills
- Ability to work cross-functionally and communicate effectively during high-pressure situations
Responsibilities
- Own reliability, availability, and performance of production systems running in cloud environments
- Define and monitor SLIs/SLOs and help manage error budgets across the platform
- Lead incident response efforts including detection, triage, mitigation, and postmortems
- Improve observability through logging, monitoring, alerting, and dashboards
- Automate operational workflows and reduce manual toil wherever possible
- Partner closely with engineering teams to improve system resiliency and scalability
- Assist with capacity planning, infrastructure optimization, and performance tuning
- Build internal tooling, runbooks, and operational best practices
- Support Kubernetes-based infrastructure and distributed systems at scale
- Act as an escalation point for complex production and platform issues
View Full Description & ApplyYou'll be redirected to the employer's site