- Operate and scale distributed storage systems, including VAST and S3-compatible object storage (e.g., Ceph)
- Improve performance, reliability, and efficiency of storage systems supporting large-scale AI/ML workloads
- Troubleshoot complex storage and data path issues across hardware and software layers
- Optimize storage performance to support high-throughput, low-latency AI training and inference workloads
- Build and maintain automation for provisioning, managing, and monitoring storage infrastructure
- Develop Python-based tools and workflows to reduce manual operational overhead
- Manage and operate Linux-based systems in production, including bare-metal environments
- Support capacity planning, utilization tracking, and forecasting for storage systems
- Leverage monitoring and telemetry to diagnose issues and improve system performance and reliability
- Work closely with Infrastructure Engineering, Network Engineering, and Platform teams to integrate storage into the broader platform
PythonLinux