Principal MLOps Engineer

New
Remote, US; DMV; McLean, VA; Boston, MA; San Antonio, TX; Colorado Springs, CO; Tampa, FL; Honolulu, HIFull-TimePrincipal
Salary150000 - 200000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
7+ years of relevant hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related technical roles 5+ years of experience with Docker and Kubernetes in production environments 5+ years of experience supporting enterprise cloud infrastructure or applications in AWS, Azure, or similar environments
Required Skills
AWSDockerPythonAgileGitKubernetesSCRUMAzureCI/CDDevOpsHelm

Requirements

  • 7+ years of relevant hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related technical roles
  • 5+ years of experience with Docker and Kubernetes in production environments
  • 5+ years of experience supporting enterprise cloud infrastructure or applications in AWS, Azure, or similar environments
  • Strong experience provisioning, operating, and troubleshooting Kubernetes clusters in production
  • Experience building and maintaining machine learning platforms, infrastructure, or pipelines used by engineering or data science teams
  • Practical experience deploying machine learning workloads on Kubernetes
  • Experience managing clusters or workloads that use GPUs
  • Strong understanding of Helm and Kubernetes deployment patterns
  • Strong scripting or programming skills, preferably in Python
  • Experience with modern software engineering practices including Git, CI/CD, DevOps, and Agile/Scrum workflows
  • Strong troubleshooting, systems thinking, and communication skills
  • Ability to work independently and collaboratively in a fast-moving environment
  • Ability to obtain Security+ certification within the first 90 days of employment

Responsibilities

  • Design, build, and maintain secure, scalable MLOps infrastructure and deployment pipelines for production ML systems
  • Help mature Raft’s internal ML platform and model lifecycle capabilities, including model packaging, registry/catalog workflows, deployment, monitoring, and operational support
  • Deploy and manage machine learning workloads on Kubernetes, including GPU-enabled clusters
  • Support model serving and inference infrastructure for a range of ML use cases, including traditional ML, computer vision, speech/audio, and LLM-based systems
  • Build and maintain CI/CD workflows for ML services, model artifacts, and platform components
  • Partner closely with ML engineers, software engineers, and product teams to move models from experimentation to reliable operational deployment
  • Improve observability, reliability, security, and maintainability across ML infrastructure and services
  • Help evaluate and standardize runtime patterns, serving frameworks, and deployment architectures for production ML workloads
  • Contribute to infrastructure decisions across edge, on-prem, and cloud-hosted deployment environments
  • Support compliance-driven deployment practices and secure software supply chain requirements in defense environments
  • Get hands-on with customers at the most forward-leaning places in the Department of War
View Full Description & ApplyYou'll be redirected to the employer's site
150000 - 200000 USD per year
Apply Now