AI Infrastructure Engineer
New
100% remote work opportunity within the Continental United States.Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 6 years
- Required Skills
- PythonKubernetesPyTorchC++GoLinux
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- Minimum of 6 years of experience in infrastructure engineering, platform engineering, high-performance computing, or related domains.
- Hands-on experience operating GPU clusters and large-scale machine learning infrastructure in production environments.
- Strong programming skills in Python and at least one systems programming language such as Go or C++.
- Deep understanding of distributed training architectures, accelerator technologies, and collective communication frameworks.
- Experience with Kubernetes, Slurm, Ray, or comparable orchestration and scheduling systems for ML workloads.
- Strong expertise in Linux internals, networking, storage systems, and distributed systems operations.
- Experience working with major cloud providers and cloud-native AI infrastructure services.
- Solid understanding of software engineering best practices including CI/CD, testing, automation, and code review processes.
- Excellent troubleshooting, analytical, documentation, and cross-functional collaboration skills.
Responsibilities
- Design, deploy, and operate GPU and accelerator infrastructure supporting large-scale AI training and inference workloads across cloud, on-premise, and hybrid environments.
- Build and optimize scheduling, queueing, and resource-sharing systems to maximize accelerator utilization.
- Integrate distributed training frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train.
- Manage high-performance storage systems and data pipelines.
- Design and maintain networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth distributed communication protocols.
- Develop observability, monitoring, and analytics solutions for AI workloads.
- Implement checkpointing, fault tolerance, and resiliency strategies for long-running distributed training jobs.
- Drive infrastructure cost optimization initiatives across compute, storage, networking, and cloud resource utilization.
- Create developer tooling, automation workflows, and self-service platform capabilities.
- Implement security controls, multi-tenant isolation strategies, and access management policies.
View Full Description & ApplyYou'll be redirected to the employer's site