Senior MLOps Engineer - SRE | DevOps

New
Based in Brazil, EST/PSTFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSKubernetesCI/CDDevOpsTerraformLLMMLOps

Requirements

  • 5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps roles, operating production systems at scale.
  • Strong hands-on experience deploying and managing ML/AI workloads in production environments.
  • Deep SRE expertise, including SLO definition, incident response, postmortems, and reliability engineering practices.
  • Advanced experience with Infrastructure-as-Code using Terraform in complex, multi-account environments.
  • Strong GitOps experience with declarative infrastructure and deployment workflows.
  • Deep expertise in Kubernetes, including production operations and failure-mode troubleshooting.
  • Strong AWS knowledge, including networking, IAM, compute, storage, and distributed architectures.
  • Experience building CI/CD pipelines using tools such as GitHub Actions, GitLab CI, CircleCI, or similar.
  • Strong automation mindset with ability to eliminate manual operational work through engineering solutions.
  • Familiarity with agentic coding tools and ability to use them effectively in infrastructure and pipeline development.
  • Strong communication skills to articulate technical decisions, trade-offs, and incident analysis clearly.

Responsibilities

  • Design, build, and operate scalable ML and inference infrastructure supporting real-time and batch workloads across multiple tenants.
  • Own the end-to-end ML deployment lifecycle, including model registry, versioning, rollout strategies, and safe rollback mechanisms.
  • Operate and optimize production-grade AI and LLM workloads, managing inference providers, throttling, quotas, and fallback strategies under load.
  • Develop and maintain reproducible ML pipelines for training, evaluation, and deployment with full lineage and automation.
  • Implement Infrastructure-as-Code practices using Terraform, ensuring scalable multi-account cloud architectures.
  • Manage GitOps workflows using tools such as ArgoCD to ensure reliable and consistent deployments across environments.
  • Operate Kubernetes-based infrastructure (AWS EKS), including GPU scheduling, workload isolation, and cost-aware scaling strategies.
  • Define and enforce SRE best practices, including SLOs, observability, incident response, and performance monitoring for ML systems.
  • Drive cost optimization initiatives across ML workloads, including resource right-sizing and efficient infrastructure utilization.
  • Improve automation across the ML lifecycle using modern engineering and agentic coding tools.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now