- Design and maintain automated ML training pipelines.
- Build infrastructure for large-scale distributed experimentation.
- Develop CI/CD workflows tailored for machine learning systems.
- Orchestrate data ingestion, preprocessing, validation, and model versioning.
- Implement experiment tracking, hyperparameter tuning automation, and reproducibility systems.
- Optimize GPU/compute utilization across cloud and on-prem environments.
- Deploy, monitor, and maintain production ML models
- Establish and enforce MLOps best practices including model registry, artifact management, and observability.
- Improve system reliability, performance, and security.
- Collaborate closely with ML researchers make new algorithms product ready.
- More typical DevOps responsibilities for software development as required.
AWSDockerPython+7 more