- Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
- Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters.
- Optimize compute usage across distributed systems, including Kubernetes, autoscaling, caching, and GPU allocation.
- Lead the implementation of observability for ML systems to monitor drift, performance, throughput, and reliability.
- Build automated workflows for dataset curation, labeling, feature pipelines, and CI/CD for ML models.
- Collaborate with researchers to productionize models and accelerate training/inference pipelines.
- Establish ML Ops best practices and cross-team tooling while mentoring engineers.
AWSDockerPython+6 more