Site Reliability Engineer - AI Agents

New

Based in the United StatesFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar production-focused roles
Hands-on experience supporting ML systems, model serving infrastructure, or MLOps pipelines in production environments
Strong experience building developer platforms, internal tools, APIs, or SDKs used by engineering teams at scale
Deep understanding of platform engineering principles
Strong proficiency with Infrastructure as Code tools, particularly Terraform
Advanced experience with Kubernetes and containerized environments (Docker)
Solid cloud infrastructure experience, preferably within AWS environments
Strong programming and scripting skills (Python preferred, plus bash/shell proficiency)
Experience designing and operating observability, logging, monitoring, and alerting systems
Proven experience with incident response, on-call rotations, and production reliability ownership

Design, build, and operate cloud-native infrastructure supporting AI agent execution, orchestration, and model serving at scale
Ensure reliability, observability, and performance of distributed agentic systems across internal and external-facing products
Develop platform services, APIs, SDKs, and self-service tooling to enable teams to efficiently consume AI infrastructure capabilities
Manage and optimize compute, orchestration, and serving layers for AI and ML workloads in production environments
Build and maintain CI/CD pipelines to enable safe, fast, and reliable deployment of AI services and agent workflows
Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based infrastructure
Design monitoring, alerting, and observability systems tailored to AI/ML and agent-based workloads
Define and enforce reliability patterns, guardrails, and failure recovery mechanisms for LLM and agentic systems
Manage Kubernetes-based container orchestration environments
Implement security best practices and access controls across infrastructure and platform services

View Full Description & ApplyYou'll be redirected to the employer's site