DevOps / Platform Engineer - AI/ML
New
BrazilContractSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 6+ years
- Required Skills
- AWSPythonBashGCPJenkinsKubernetesAzureGrafanaPrometheusTerraformGitHub ActionsAzure DevOpsDatadog
Requirements
- 6+ years of experience in DevOps, SRE, or platform engineering roles
- At least 2 years supporting AI/ML production workloads
- Strong expertise in infrastructure-as-code tools such as Terraform
- Knowledge of cloud-native alternatives (Pulumi, CloudFormation, Bicep)
- Hands-on experience with Kubernetes (EKS, AKS, or GKE), including cluster operations, scaling, and troubleshooting
- Solid experience designing and maintaining CI/CD pipelines using tools such as GitHub Actions, GitLab CI, Azure DevOps, or Jenkins
- Strong cloud experience across at least two major providers (AWS, Azure, GCP), including networking, compute, storage, and IAM
- Proficiency in scripting and automation using Python, Bash, or PowerShell
- Familiarity with Go or TypeScript is a plus
- Experience with observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry) and incident management tools
- Strong understanding of security best practices, including secrets management, IAM, container security, and compliance automation
- Experience with GitOps workflows (ArgoCD, Flux) and modern platform engineering practices
- Excellent communication skills and ability to work effectively in global, multicultural teams
Responsibilities
- Design, implement, and maintain end-to-end CI/CD pipelines supporting applications, infrastructure-as-code, data pipelines, and AI/ML model deployments.
- Build and manage cloud-agnostic infrastructure across AWS, Azure, and/or GCP, ensuring scalability, portability, and reliability.
- Develop and operate Kubernetes-based environments for containerized workloads, including microservices and AI model serving systems.
- Implement Infrastructure-as-Code solutions using Terraform or similar tools to create reusable, modular, and secure infrastructure components.
- Lead MLOps initiatives, including model training environments, deployment pipelines, model registries, and production monitoring systems.
- Establish observability, reliability, and security practices, including monitoring, logging, alerting, and incident response frameworks.
- Design and support internal developer platforms to improve self-service infrastructure provisioning and developer experience.
- Drive cloud cost optimization, security automation, and compliance alignment across distributed environments.
View Full Description & ApplyYou'll be redirected to the employer's site