Site Reliability Engineer - AI Agents
New
Based in the United StatesFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSDockerPythonKubernetesTerraformMLOps
Requirements
- 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar production-focused roles
- Hands-on experience supporting ML systems, model serving infrastructure, or MLOps pipelines in production environments
- Strong experience building developer platforms, internal tools, APIs, or SDKs used by engineering teams at scale
- Deep understanding of platform engineering principles
- Strong proficiency with Infrastructure as Code tools, particularly Terraform
- Advanced experience with Kubernetes and containerized environments (Docker)
- Solid cloud infrastructure experience, preferably within AWS environments
- Strong programming and scripting skills (Python preferred, plus bash/shell proficiency)
- Experience designing and operating observability, logging, monitoring, and alerting systems
- Proven experience with incident response, on-call rotations, and production reliability ownership
Responsibilities
- Design, build, and operate cloud-native infrastructure supporting AI agent execution, orchestration, and model serving at scale
- Ensure reliability, observability, and performance of distributed agentic systems across internal and external-facing products
- Develop platform services, APIs, SDKs, and self-service tooling to enable teams to efficiently consume AI infrastructure capabilities
- Manage and optimize compute, orchestration, and serving layers for AI and ML workloads in production environments
- Build and maintain CI/CD pipelines to enable safe, fast, and reliable deployment of AI services and agent workflows
- Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based infrastructure
- Design monitoring, alerting, and observability systems tailored to AI/ML and agent-based workloads
- Define and enforce reliability patterns, guardrails, and failure recovery mechanisms for LLM and agentic systems
- Manage Kubernetes-based container orchestration environments
- Implement security best practices and access controls across infrastructure and platform services
View Full Description & ApplyYou'll be redirected to the employer's site