DevOps Engineer - Senior Vice President, Platform Infrastructure (MLOps)

Remote - United StatesFull-TimeVp

Salary180000 - 230000 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 15+ years
Required Skills: AWSPythonDynamoDBKubernetesPostgresCI/CDLinuxTerraformLLMMLOpsGenerative AI

15+ years of experience in DevOps, SRE, or Platform Engineering, with AWS as a primary cloud.
Experience supporting machine learning systems in production, including deployment and monitoring concerns.
Hands-on experience with machine learning platforms, particularly AWS SageMaker.
Strong hands-on experience with Kubernetes, containerized workloads, and cloud networking.
Proven experience building and operating CI/CD pipelines (e.g., GitLab CI, ArgoCD).
Strong proficiency with Terraform and scripting/programming in Python or similar languages.
Solid Linux, systems, and troubleshooting fundamentals.
Excellent communication skills and ability to work across teams.
Direct experience with MLOps platforms and tooling (model registries, experiment tracking, feature stores).
Exposure to Generative AI / LLM workloads in production environments.
Familiarity with data stores commonly used in ML systems (e.g., Postgres, DynamoDB, object storage).
Experience operating in regulated or fintech environments.
Background in cost optimization for compute-intensive workloads.
Strong written and verbal communication skills.
AWS certifications are a plus.

Design, build, and operate MLOps pipelines supporting the full ML lifecycle (training, validation, deployment, monitoring).
Enable production workloads for AI/ML and Generative AI systems, including LLM-based services.
Develop and maintain CI/CD pipelines for AI/ML services and supporting infrastructure.
Build and manage cloud-native infrastructure on AWS, with heavy use of Kubernetes and containerized workloads.
Automate infrastructure provisioning and configuration using Infrastructure as Code (Terraform).
Implement model versioning, experiment tracking, and artifact management across environments.
Ensure reliability, scalability, observability, and cost efficiency of AI platforms.
Partner with AI/ML engineers to operationalize models and standardize deployment patterns.
Implement monitoring and alerting for system health, model performance, and drift.
Enforce security, compliance, and governance requirements for AI workloads.
Participate in incident response, root cause analysis, and continuous improvement initiatives.
Document standards, best practices, and reference architectures for MLOps and AI infrastructure.

View Full Description & ApplyYou'll be redirected to the employer's site