Senior MLOps Engineer - SRE | DevOps
New
Based in Brazil, EST/PSTFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSKubernetesCI/CDDevOpsTerraformLLMMLOps
Requirements
- 5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps roles, operating production systems at scale.
- Strong hands-on experience deploying and managing ML/AI workloads in production environments.
- Deep SRE expertise, including SLO definition, incident response, postmortems, and reliability engineering practices.
- Advanced experience with Infrastructure-as-Code using Terraform in complex, multi-account environments.
- Strong GitOps experience with declarative infrastructure and deployment workflows.
- Deep expertise in Kubernetes, including production operations and failure-mode troubleshooting.
- Strong AWS knowledge, including networking, IAM, compute, storage, and distributed architectures.
- Experience building CI/CD pipelines using tools such as GitHub Actions, GitLab CI, CircleCI, or similar.
- Strong automation mindset with ability to eliminate manual operational work through engineering solutions.
- Familiarity with agentic coding tools and ability to use them effectively in infrastructure and pipeline development.
- Strong communication skills to articulate technical decisions, trade-offs, and incident analysis clearly.
Responsibilities
- Design, build, and operate scalable ML and inference infrastructure supporting real-time and batch workloads across multiple tenants.
- Own the end-to-end ML deployment lifecycle, including model registry, versioning, rollout strategies, and safe rollback mechanisms.
- Operate and optimize production-grade AI and LLM workloads, managing inference providers, throttling, quotas, and fallback strategies under load.
- Develop and maintain reproducible ML pipelines for training, evaluation, and deployment with full lineage and automation.
- Implement Infrastructure-as-Code practices using Terraform, ensuring scalable multi-account cloud architectures.
- Manage GitOps workflows using tools such as ArgoCD to ensure reliable and consistent deployments across environments.
- Operate Kubernetes-based infrastructure (AWS EKS), including GPU scheduling, workload isolation, and cost-aware scaling strategies.
- Define and enforce SRE best practices, including SLOs, observability, incident response, and performance monitoring for ML systems.
- Drive cost optimization initiatives across ML workloads, including resource right-sizing and efficient infrastructure utilization.
- Improve automation across the ML lifecycle using modern engineering and agentic coding tools.
View Full Description & ApplyYou'll be redirected to the employer's site