ML Infrastructure Engineer

New
100% remote (within the United States)Full-TimeSenior
Salary100,000 - 150,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
6+ years
Required Skills
PythonKubernetesPyTorchC++GoLinux

Requirements

  • Bachelor’s or Master’s degree in Computer Science or related field.
  • 6+ years of experience in infrastructure, platform engineering, or high-performance computing environments.
  • Hands-on experience operating GPU clusters or large-scale ML training systems in production.
  • Strong proficiency in Python and at least one systems programming language (Go or C++ preferred).
  • Deep understanding of distributed systems, accelerator architectures, and ML training workflows.
  • Experience with Kubernetes, Slurm, Ray, or similar orchestration/scheduling systems.
  • Strong knowledge of Linux internals, networking concepts, and high-performance storage systems.
  • Familiarity with at least one major cloud provider’s ML infrastructure stack.
  • Solid software engineering practices including testing, CI/CD, and code review workflows.
  • Strong communication skills and ability to collaborate across research and engineering teams.

Responsibilities

  • Design, build, and operate GPU and accelerator infrastructure for large-scale training and inference workloads across cloud, on-prem, and hybrid environments.
  • Develop scheduling, queueing, and resource management systems to maximize utilization of compute clusters.
  • Integrate and support ML frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray-based training workflows.
  • Build and maintain high-performance storage and data pipelines ensuring consistent GPU throughput.
  • Design and optimize networking layers including RDMA, InfiniBand, and NCCL-based communication.
  • Implement observability, monitoring, and failure analysis tools for distributed ML workloads.
  • Drive automation for provisioning, lifecycle management, and infrastructure configuration.
  • Partner with ML teams to forecast capacity needs and improve developer workflows and tooling.
  • Ensure security, isolation, and multi-tenant access control across AI infrastructure systems.
  • Optimize cost efficiency across compute, storage, and networking through intelligent resource management.
View Full Description & ApplyYou'll be redirected to the employer's site
100,000 - 150,000 USD per year
Apply Now