Principal ML Ops Engineer
P
PragmatikeArtificial Intelligence
Relocation package available or Remote option for Out-Of-State applicants, Eastern TimeFull-TimePrincipal
SalaryCompetitive salary & equity options. Sign-on bonus.
Apply NowOpens the employer's application page
Job Details
- Languages
- English
- Required Skills
- AWSDockerPythonGCPKubernetesPyTorchAzureMLOpsDistributed Systems
Requirements
- Staff/Principal-level hands-on experience designing and operating production ML systems at scale.
- Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure).
- Proficiency with Python; familiarity with TypeScript or Go.
- Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM.
- Practical understanding of CUDA and GPU acceleration.
- Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling).
- Deep understanding of ML lifecycle workflows including training, fine-tuning, evaluation, and model registries.
- Ability to lead technical strategy and collaborate cross-functionally.
Responsibilities
- Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
- Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters.
- Optimize compute usage across distributed systems, including Kubernetes, autoscaling, caching, and GPU allocation.
- Lead the implementation of observability for ML systems to monitor drift, performance, throughput, and reliability.
- Build automated workflows for dataset curation, labeling, feature pipelines, and CI/CD for ML models.
- Collaborate with researchers to productionize models and accelerate training/inference pipelines.
- Establish ML Ops best practices and cross-team tooling while mentoring engineers.
View Full Description & ApplyYou'll be redirected to the employer's site