Site Reliability Engineer - AI Agents

New
Based in the United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSDockerPythonKubernetesTerraformMLOps

Requirements

  • 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar production-focused roles
  • Hands-on experience supporting ML systems, model serving infrastructure, or MLOps pipelines in production environments
  • Strong experience building developer platforms, internal tools, APIs, or SDKs used by engineering teams at scale
  • Deep understanding of platform engineering principles
  • Strong proficiency with Infrastructure as Code tools, particularly Terraform
  • Advanced experience with Kubernetes and containerized environments (Docker)
  • Solid cloud infrastructure experience, preferably within AWS environments
  • Strong programming and scripting skills (Python preferred, plus bash/shell proficiency)
  • Experience designing and operating observability, logging, monitoring, and alerting systems
  • Proven experience with incident response, on-call rotations, and production reliability ownership

Responsibilities

  • Design, build, and operate cloud-native infrastructure supporting AI agent execution, orchestration, and model serving at scale
  • Ensure reliability, observability, and performance of distributed agentic systems across internal and external-facing products
  • Develop platform services, APIs, SDKs, and self-service tooling to enable teams to efficiently consume AI infrastructure capabilities
  • Manage and optimize compute, orchestration, and serving layers for AI and ML workloads in production environments
  • Build and maintain CI/CD pipelines to enable safe, fast, and reliable deployment of AI services and agent workflows
  • Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based infrastructure
  • Design monitoring, alerting, and observability systems tailored to AI/ML and agent-based workloads
  • Define and enforce reliability patterns, guardrails, and failure recovery mechanisms for LLM and agentic systems
  • Manage Kubernetes-based container orchestration environments
  • Implement security best practices and access controls across infrastructure and platform services
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now