Infrastructure Engineer (Storage)

New
L
Lightning AIAI/ML, HPC
Remote within the U.S.Full-TimeSenior
Salary180,000 - 200,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
5+ years of experience in infrastructure engineering, systems engineering, or related roles
Required Skills
PythonLinux

Requirements

  • 5+ years of experience in infrastructure engineering, systems engineering, or related roles
  • Hands-on experience operating distributed storage systems (e.g., VAST, Ceph, or similar)
  • Strong Linux systems experience in production environments
  • Proficiency in Python or similar scripting/programming languages for automation
  • Experience working with bare-metal infrastructure and hardware-oriented systems
  • Ability to debug complex issues across system boundaries (storage, OS, hardware, networking)
  • Experience with storage networking protocols (e.g., NFS or similar)
  • Experience with capacity planning, monitoring, and performance tuning
  • Experience with VAST storage systems in production environments
  • Experience operating S3-compatible object storage at scale
  • Data center operations experience, including working with physical hardware
  • Familiarity with AI/ML or HPC workloads and their storage requirements
  • Background in high-performance or low-latency distributed systems
  • Familiarity with high-performance data transfer technologies (e.g., RDMA, GPU Direct Storage)
  • Experience supporting GPU-based workloads or large-scale compute clusters

Responsibilities

  • Operate and scale distributed storage systems, including VAST and S3-compatible object storage (e.g., Ceph)
  • Improve performance, reliability, and efficiency of storage systems supporting large-scale AI/ML workloads
  • Troubleshoot complex storage and data path issues across hardware and software layers
  • Optimize storage performance to support high-throughput, low-latency AI training and inference workloads
  • Build and maintain automation for provisioning, managing, and monitoring storage infrastructure
  • Develop Python-based tools and workflows to reduce manual operational overhead
  • Manage and operate Linux-based systems in production, including bare-metal environments
  • Support capacity planning, utilization tracking, and forecasting for storage systems
  • Leverage monitoring and telemetry to diagnose issues and improve system performance and reliability
  • Work closely with Infrastructure Engineering, Network Engineering, and Platform teams to integrate storage into the broader platform
View Full Description & ApplyYou'll be redirected to the employer's site
180,000 - 200,000 USD per year
Apply Now