Infrastructure Engineer (Storage)
New
L
Lightning AIAI/ML, HPC
Remote within the U.S.Full-TimeSenior
Salary180,000 - 200,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years of experience in infrastructure engineering, systems engineering, or related roles
- Required Skills
- PythonLinux
Requirements
- 5+ years of experience in infrastructure engineering, systems engineering, or related roles
- Hands-on experience operating distributed storage systems (e.g., VAST, Ceph, or similar)
- Strong Linux systems experience in production environments
- Proficiency in Python or similar scripting/programming languages for automation
- Experience working with bare-metal infrastructure and hardware-oriented systems
- Ability to debug complex issues across system boundaries (storage, OS, hardware, networking)
- Experience with storage networking protocols (e.g., NFS or similar)
- Experience with capacity planning, monitoring, and performance tuning
- Experience with VAST storage systems in production environments
- Experience operating S3-compatible object storage at scale
- Data center operations experience, including working with physical hardware
- Familiarity with AI/ML or HPC workloads and their storage requirements
- Background in high-performance or low-latency distributed systems
- Familiarity with high-performance data transfer technologies (e.g., RDMA, GPU Direct Storage)
- Experience supporting GPU-based workloads or large-scale compute clusters
Responsibilities
- Operate and scale distributed storage systems, including VAST and S3-compatible object storage (e.g., Ceph)
- Improve performance, reliability, and efficiency of storage systems supporting large-scale AI/ML workloads
- Troubleshoot complex storage and data path issues across hardware and software layers
- Optimize storage performance to support high-throughput, low-latency AI training and inference workloads
- Build and maintain automation for provisioning, managing, and monitoring storage infrastructure
- Develop Python-based tools and workflows to reduce manual operational overhead
- Manage and operate Linux-based systems in production, including bare-metal environments
- Support capacity planning, utilization tracking, and forecasting for storage systems
- Leverage monitoring and telemetry to diagnose issues and improve system performance and reliability
- Work closely with Infrastructure Engineering, Network Engineering, and Platform teams to integrate storage into the broader platform
View Full Description & ApplyYou'll be redirected to the employer's site