AI Infrastructure Engineer

New

100% remote work opportunity within the Continental United States.Full-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
Minimum of 6 years of experience in infrastructure engineering, platform engineering, high-performance computing, or related domains.
Hands-on experience operating GPU clusters and large-scale machine learning infrastructure in production environments.
Strong programming skills in Python and at least one systems programming language such as Go or C++.
Deep understanding of distributed training architectures, accelerator technologies, and collective communication frameworks.
Experience with Kubernetes, Slurm, Ray, or comparable orchestration and scheduling systems for ML workloads.
Strong expertise in Linux internals, networking, storage systems, and distributed systems operations.
Experience working with major cloud providers and cloud-native AI infrastructure services.
Solid understanding of software engineering best practices including CI/CD, testing, automation, and code review processes.
Excellent troubleshooting, analytical, documentation, and cross-functional collaboration skills.

Design, deploy, and operate GPU and accelerator infrastructure supporting large-scale AI training and inference workloads across cloud, on-premise, and hybrid environments.
Build and optimize scheduling, queueing, and resource-sharing systems to maximize accelerator utilization.
Integrate distributed training frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train.
Manage high-performance storage systems and data pipelines.
Design and maintain networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth distributed communication protocols.
Develop observability, monitoring, and analytics solutions for AI workloads.
Implement checkpointing, fault tolerance, and resiliency strategies for long-running distributed training jobs.
Drive infrastructure cost optimization initiatives across compute, storage, networking, and cloud resource utilization.
Create developer tooling, automation workflows, and self-service platform capabilities.
Implement security controls, multi-tenant isolation strategies, and access management policies.

View Full Description & ApplyYou'll be redirected to the employer's site