Senior MLOps Engineer - SRE | DevOps

New

Based in Brazil, EST/PSTFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps roles, operating production systems at scale.
Strong hands-on experience deploying and managing ML/AI workloads in production environments.
Deep SRE expertise, including SLO definition, incident response, postmortems, and reliability engineering practices.
Advanced experience with Infrastructure-as-Code using Terraform in complex, multi-account environments.
Strong GitOps experience with declarative infrastructure and deployment workflows.
Deep expertise in Kubernetes, including production operations and failure-mode troubleshooting.
Strong AWS knowledge, including networking, IAM, compute, storage, and distributed architectures.
Experience building CI/CD pipelines using tools such as GitHub Actions, GitLab CI, CircleCI, or similar.
Strong automation mindset with ability to eliminate manual operational work through engineering solutions.
Familiarity with agentic coding tools and ability to use them effectively in infrastructure and pipeline development.
Strong communication skills to articulate technical decisions, trade-offs, and incident analysis clearly.

Design, build, and operate scalable ML and inference infrastructure supporting real-time and batch workloads across multiple tenants.
Own the end-to-end ML deployment lifecycle, including model registry, versioning, rollout strategies, and safe rollback mechanisms.
Operate and optimize production-grade AI and LLM workloads, managing inference providers, throttling, quotas, and fallback strategies under load.
Develop and maintain reproducible ML pipelines for training, evaluation, and deployment with full lineage and automation.
Implement Infrastructure-as-Code practices using Terraform, ensuring scalable multi-account cloud architectures.
Manage GitOps workflows using tools such as ArgoCD to ensure reliable and consistent deployments across environments.
Operate Kubernetes-based infrastructure (AWS EKS), including GPU scheduling, workload isolation, and cost-aware scaling strategies.
Define and enforce SRE best practices, including SLOs, observability, incident response, and performance monitoring for ML systems.
Drive cost optimization initiatives across ML workloads, including resource right-sizing and efficient infrastructure utilization.
Improve automation across the ML lifecycle using modern engineering and agentic coding tools.

View Full Description & ApplyYou'll be redirected to the employer's site