AI Infrastructure Engineer

New
100% remote work opportunity within the Continental United States.Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
6 years
Required Skills
PythonKubernetesPyTorchC++GoLinux

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • Minimum of 6 years of experience in infrastructure engineering, platform engineering, high-performance computing, or related domains.
  • Hands-on experience operating GPU clusters and large-scale machine learning infrastructure in production environments.
  • Strong programming skills in Python and at least one systems programming language such as Go or C++.
  • Deep understanding of distributed training architectures, accelerator technologies, and collective communication frameworks.
  • Experience with Kubernetes, Slurm, Ray, or comparable orchestration and scheduling systems for ML workloads.
  • Strong expertise in Linux internals, networking, storage systems, and distributed systems operations.
  • Experience working with major cloud providers and cloud-native AI infrastructure services.
  • Solid understanding of software engineering best practices including CI/CD, testing, automation, and code review processes.
  • Excellent troubleshooting, analytical, documentation, and cross-functional collaboration skills.

Responsibilities

  • Design, deploy, and operate GPU and accelerator infrastructure supporting large-scale AI training and inference workloads across cloud, on-premise, and hybrid environments.
  • Build and optimize scheduling, queueing, and resource-sharing systems to maximize accelerator utilization.
  • Integrate distributed training frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train.
  • Manage high-performance storage systems and data pipelines.
  • Design and maintain networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth distributed communication protocols.
  • Develop observability, monitoring, and analytics solutions for AI workloads.
  • Implement checkpointing, fault tolerance, and resiliency strategies for long-running distributed training jobs.
  • Drive infrastructure cost optimization initiatives across compute, storage, networking, and cloud resource utilization.
  • Create developer tooling, automation workflows, and self-service platform capabilities.
  • Implement security controls, multi-tenant isolation strategies, and access management policies.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now