DevOps Engineer - Senior Vice President, Platform Infrastructure (MLOps)
I
iCapitalFintech
Remote - United StatesFull-TimeVp
Salary180000 - 230000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 15+ years
- Required Skills
- AWSPythonDynamoDBKubernetesPostgresCI/CDLinuxTerraformLLMMLOpsGenerative AI
Requirements
- 15+ years of experience in DevOps, SRE, or Platform Engineering, with AWS as a primary cloud.
- Experience supporting machine learning systems in production, including deployment and monitoring concerns.
- Hands-on experience with machine learning platforms, particularly AWS SageMaker.
- Strong hands-on experience with Kubernetes, containerized workloads, and cloud networking.
- Proven experience building and operating CI/CD pipelines (e.g., GitLab CI, ArgoCD).
- Strong proficiency with Terraform and scripting/programming in Python or similar languages.
- Solid Linux, systems, and troubleshooting fundamentals.
- Excellent communication skills and ability to work across teams.
- Direct experience with MLOps platforms and tooling (model registries, experiment tracking, feature stores).
- Exposure to Generative AI / LLM workloads in production environments.
- Familiarity with data stores commonly used in ML systems (e.g., Postgres, DynamoDB, object storage).
- Experience operating in regulated or fintech environments.
- Background in cost optimization for compute-intensive workloads.
- Strong written and verbal communication skills.
- AWS certifications are a plus.
Responsibilities
- Design, build, and operate MLOps pipelines supporting the full ML lifecycle (training, validation, deployment, monitoring).
- Enable production workloads for AI/ML and Generative AI systems, including LLM-based services.
- Develop and maintain CI/CD pipelines for AI/ML services and supporting infrastructure.
- Build and manage cloud-native infrastructure on AWS, with heavy use of Kubernetes and containerized workloads.
- Automate infrastructure provisioning and configuration using Infrastructure as Code (Terraform).
- Implement model versioning, experiment tracking, and artifact management across environments.
- Ensure reliability, scalability, observability, and cost efficiency of AI platforms.
- Partner with AI/ML engineers to operationalize models and standardize deployment patterns.
- Implement monitoring and alerting for system health, model performance, and drift.
- Enforce security, compliance, and governance requirements for AI workloads.
- Participate in incident response, root cause analysis, and continuous improvement initiatives.
- Document standards, best practices, and reference architectures for MLOps and AI infrastructure.
View Full Description & ApplyYou'll be redirected to the employer's site