ML Infrastructure Engineer

New

100% remote (within the United States)Full-TimeSenior

Salary100,000 - 150,000 USD per year

Apply NowOpens the employer's application page

Job Details

Bachelor’s or Master’s degree in Computer Science or related field.
6+ years of experience in infrastructure, platform engineering, or high-performance computing environments.
Hands-on experience operating GPU clusters or large-scale ML training systems in production.
Strong proficiency in Python and at least one systems programming language (Go or C++ preferred).
Deep understanding of distributed systems, accelerator architectures, and ML training workflows.
Experience with Kubernetes, Slurm, Ray, or similar orchestration/scheduling systems.
Strong knowledge of Linux internals, networking concepts, and high-performance storage systems.
Familiarity with at least one major cloud provider’s ML infrastructure stack.
Solid software engineering practices including testing, CI/CD, and code review workflows.
Strong communication skills and ability to collaborate across research and engineering teams.

Design, build, and operate GPU and accelerator infrastructure for large-scale training and inference workloads across cloud, on-prem, and hybrid environments.
Develop scheduling, queueing, and resource management systems to maximize utilization of compute clusters.
Integrate and support ML frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray-based training workflows.
Build and maintain high-performance storage and data pipelines ensuring consistent GPU throughput.
Design and optimize networking layers including RDMA, InfiniBand, and NCCL-based communication.
Implement observability, monitoring, and failure analysis tools for distributed ML workloads.
Drive automation for provisioning, lifecycle management, and infrastructure configuration.
Partner with ML teams to forecast capacity needs and improve developer workflows and tooling.
Ensure security, isolation, and multi-tenant access control across AI infrastructure systems.
Optimize cost efficiency across compute, storage, and networking through intelligent resource management.

View Full Description & ApplyYou'll be redirected to the employer's site