Principal MLOps Engineer
New
R
Raft Company WebsiteDefense Tech
Remote, US; DMV; McLean, VA; Boston, MA; San Antonio, TX; Colorado Springs, CO; Tampa, FL; Honolulu, HIFull-TimePrincipal
Salary150000 - 200000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 7+ years of relevant hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related technical roles 5+ years of experience with Docker and Kubernetes in production environments 5+ years of experience supporting enterprise cloud infrastructure or applications in AWS, Azure, or similar environments
- Required Skills
- AWSDockerPythonAgileGitKubernetesSCRUMAzureCI/CDDevOpsHelm
Requirements
- 7+ years of relevant hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related technical roles
- 5+ years of experience with Docker and Kubernetes in production environments
- 5+ years of experience supporting enterprise cloud infrastructure or applications in AWS, Azure, or similar environments
- Strong experience provisioning, operating, and troubleshooting Kubernetes clusters in production
- Experience building and maintaining machine learning platforms, infrastructure, or pipelines used by engineering or data science teams
- Practical experience deploying machine learning workloads on Kubernetes
- Experience managing clusters or workloads that use GPUs
- Strong understanding of Helm and Kubernetes deployment patterns
- Strong scripting or programming skills, preferably in Python
- Experience with modern software engineering practices including Git, CI/CD, DevOps, and Agile/Scrum workflows
- Strong troubleshooting, systems thinking, and communication skills
- Ability to work independently and collaboratively in a fast-moving environment
- Ability to obtain Security+ certification within the first 90 days of employment
Responsibilities
- Design, build, and maintain secure, scalable MLOps infrastructure and deployment pipelines for production ML systems
- Help mature Raft’s internal ML platform and model lifecycle capabilities, including model packaging, registry/catalog workflows, deployment, monitoring, and operational support
- Deploy and manage machine learning workloads on Kubernetes, including GPU-enabled clusters
- Support model serving and inference infrastructure for a range of ML use cases, including traditional ML, computer vision, speech/audio, and LLM-based systems
- Build and maintain CI/CD workflows for ML services, model artifacts, and platform components
- Partner closely with ML engineers, software engineers, and product teams to move models from experimentation to reliable operational deployment
- Improve observability, reliability, security, and maintainability across ML infrastructure and services
- Help evaluate and standardize runtime patterns, serving frameworks, and deployment architectures for production ML workloads
- Contribute to infrastructure decisions across edge, on-prem, and cloud-hosted deployment environments
- Support compliance-driven deployment practices and secure software supply chain requirements in defense environments
- Get hands-on with customers at the most forward-leaning places in the Department of War
View Full Description & ApplyYou'll be redirected to the employer's site