DevOps Engineer - Senior Vice President, Platform Infrastructure (MLOps)

I
iCapitalFintech
Remote - United StatesFull-TimeVp
Salary180000 - 230000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
15+ years
Required Skills
AWSPythonDynamoDBKubernetesPostgresCI/CDLinuxTerraformLLMMLOpsGenerative AI

Requirements

  • 15+ years of experience in DevOps, SRE, or Platform Engineering, with AWS as a primary cloud.
  • Experience supporting machine learning systems in production, including deployment and monitoring concerns.
  • Hands-on experience with machine learning platforms, particularly AWS SageMaker.
  • Strong hands-on experience with Kubernetes, containerized workloads, and cloud networking.
  • Proven experience building and operating CI/CD pipelines (e.g., GitLab CI, ArgoCD).
  • Strong proficiency with Terraform and scripting/programming in Python or similar languages.
  • Solid Linux, systems, and troubleshooting fundamentals.
  • Excellent communication skills and ability to work across teams.
  • Direct experience with MLOps platforms and tooling (model registries, experiment tracking, feature stores).
  • Exposure to Generative AI / LLM workloads in production environments.
  • Familiarity with data stores commonly used in ML systems (e.g., Postgres, DynamoDB, object storage).
  • Experience operating in regulated or fintech environments.
  • Background in cost optimization for compute-intensive workloads.
  • Strong written and verbal communication skills.
  • AWS certifications are a plus.

Responsibilities

  • Design, build, and operate MLOps pipelines supporting the full ML lifecycle (training, validation, deployment, monitoring).
  • Enable production workloads for AI/ML and Generative AI systems, including LLM-based services.
  • Develop and maintain CI/CD pipelines for AI/ML services and supporting infrastructure.
  • Build and manage cloud-native infrastructure on AWS, with heavy use of Kubernetes and containerized workloads.
  • Automate infrastructure provisioning and configuration using Infrastructure as Code (Terraform).
  • Implement model versioning, experiment tracking, and artifact management across environments.
  • Ensure reliability, scalability, observability, and cost efficiency of AI platforms.
  • Partner with AI/ML engineers to operationalize models and standardize deployment patterns.
  • Implement monitoring and alerting for system health, model performance, and drift.
  • Enforce security, compliance, and governance requirements for AI workloads.
  • Participate in incident response, root cause analysis, and continuous improvement initiatives.
  • Document standards, best practices, and reference architectures for MLOps and AI infrastructure.
View Full Description & ApplyYou'll be redirected to the employer's site
180000 - 230000 USD per year
Apply Now