Site Reliability Engineer

OrkesCloud Infrastructure

Location: Remote-US; Location: US/PT hrs, Canada/PT hrs, PT hoursFull-TimeSenior

Salary180,000 - 250,000 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: AWSPythonBashGCPKubernetesAzureGrafanaPrometheusTerraformDatadog

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related infrastructure roles
Strong experience with cloud platforms such as AWS, GCP, or Azure
Hands-on experience with Kubernetes and containerized environments
Strong understanding of distributed systems and microservices architecture
Experience with observability tools such as Prometheus, Grafana, Datadog, ELK, or OpenTelemetry
Proficiency with infrastructure automation and scripting (Terraform, Python, Bash, etc.)
Experience managing CI/CD pipelines and deployment automation
Strong troubleshooting and incident management skills
Ability to work cross-functionally and communicate effectively during high-pressure situations

Responsibilities

Own reliability, availability, and performance of production systems running in cloud environments
Define and monitor SLIs/SLOs and help manage error budgets across the platform
Lead incident response efforts including detection, triage, mitigation, and postmortems
Improve observability through logging, monitoring, alerting, and dashboards
Automate operational workflows and reduce manual toil wherever possible
Partner closely with engineering teams to improve system resiliency and scalability
Assist with capacity planning, infrastructure optimization, and performance tuning
Build internal tooling, runbooks, and operational best practices
Support Kubernetes-based infrastructure and distributed systems at scale
Act as an escalation point for complex production and platform issues

View Full Description & ApplyYou'll be redirected to the employer's site

About the Company

Orkes

11-50 employeesArtificial Intelligence (AI)

View Company Profile

Similar Jobs

Staff Site Reliability Engineer

Kentik

Remote – United StatesFull-Time

165,000 - 200,000 USD per year

View Job

Site Reliability Engineer, Inference Infrastructure

Cohere

Remote-flexible, offices in Toronto, New York, San Francisco, London and ParisFull-Time

View Job

Senior Site Reliability Engineer, Infrastructure Foundations

Wikimedia Foundation

Please note that we are currently able to hire in the following: US States: Arizona, California, Colorado, Connecticut, District of Columbia*, Florida, Georgia, Idaho, Illinois, Indiana, Iowa, Maryland, Massachusetts, Michigan, Minnesota, Missouri, New Jersey, New Mexico, New York, North Carolina, Ohio, Oklahoma, Oregon, Pennsylvania, Puerto Rico*, Rhode Island, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin and Wyoming (*US Territory or Federal District) Countries: Brazil, Canada, Colombia, France, Germany, Ghana, India, Indonesia, Italy, Kenya*, Mexico, Morocco, Netherlands, Poland, Singapore*, South Africa, Spain, Switzerland and the United Kingdom.Full-Time

113,082 - 175,725 USD per year

View Job