ApplySenior Engineer, ML Training Platform
Posted about 1 month agoViewed
View full description
Requirements:
- 3+ years of strong Python coding skills.
- Experience designing, developing, and maintaining production systems in AWS cloud with Kubernetes.
- Hands-on experience with popular ML frameworks (PyTorch or TensorFlow).
- Experience with CPU and GPU performance optimization.
- Proven track record of operating highly-available systems at scale.
Responsibilities:
- Build and maintain scalable ML data processing and model training solutions in AWS cloud using Kubernetes.
- Implement solutions in ML codebase (Python, PyTorch) and platform codebase (Volcano and Ray).
- Perform training and infrastructure performance optimization with various GPUs to enhance training speed and efficiency.
- Communicate with machine learning engineers to identify obstacles and prioritize solutions.
Apply