ML Infrastructure Engineer
New
100% remote (within the United States)Full-TimeSenior
Salary100,000 - 150,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 6+ years
- Required Skills
- PythonKubernetesPyTorchC++GoLinux
Requirements
- Bachelor’s or Master’s degree in Computer Science or related field.
- 6+ years of experience in infrastructure, platform engineering, or high-performance computing environments.
- Hands-on experience operating GPU clusters or large-scale ML training systems in production.
- Strong proficiency in Python and at least one systems programming language (Go or C++ preferred).
- Deep understanding of distributed systems, accelerator architectures, and ML training workflows.
- Experience with Kubernetes, Slurm, Ray, or similar orchestration/scheduling systems.
- Strong knowledge of Linux internals, networking concepts, and high-performance storage systems.
- Familiarity with at least one major cloud provider’s ML infrastructure stack.
- Solid software engineering practices including testing, CI/CD, and code review workflows.
- Strong communication skills and ability to collaborate across research and engineering teams.
Responsibilities
- Design, build, and operate GPU and accelerator infrastructure for large-scale training and inference workloads across cloud, on-prem, and hybrid environments.
- Develop scheduling, queueing, and resource management systems to maximize utilization of compute clusters.
- Integrate and support ML frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray-based training workflows.
- Build and maintain high-performance storage and data pipelines ensuring consistent GPU throughput.
- Design and optimize networking layers including RDMA, InfiniBand, and NCCL-based communication.
- Implement observability, monitoring, and failure analysis tools for distributed ML workloads.
- Drive automation for provisioning, lifecycle management, and infrastructure configuration.
- Partner with ML teams to forecast capacity needs and improve developer workflows and tooling.
- Ensure security, isolation, and multi-tenant access control across AI infrastructure systems.
- Optimize cost efficiency across compute, storage, and networking through intelligent resource management.
View Full Description & ApplyYou'll be redirected to the employer's site