Principal ML Ops Engineer

PragmatikeArtificial Intelligence

Relocation package available or Remote option for Out-Of-State applicants, Eastern TimeFull-TimePrincipal

SalaryCompetitive salary & equity options. Sign-on bonus.

Apply NowOpens the employer's application page

Job Details

Languages: English
Required Skills: AWSDockerPythonGCPKubernetesPyTorchAzureMLOpsDistributed Systems

Staff/Principal-level hands-on experience designing and operating production ML systems at scale.
Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure).
Proficiency with Python; familiarity with TypeScript or Go.
Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM.
Practical understanding of CUDA and GPU acceleration.
Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling).
Deep understanding of ML lifecycle workflows including training, fine-tuning, evaluation, and model registries.
Ability to lead technical strategy and collaborate cross-functionally.

Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters.
Optimize compute usage across distributed systems, including Kubernetes, autoscaling, caching, and GPU allocation.
Lead the implementation of observability for ML systems to monitor drift, performance, throughput, and reliability.
Build automated workflows for dataset curation, labeling, feature pipelines, and CI/CD for ML models.
Collaborate with researchers to productionize models and accelerate training/inference pipelines.
Establish ML Ops best practices and cross-team tooling while mentoring engineers.

View Full Description & ApplyYou'll be redirected to the employer's site