- Ensuring storage is reliable, predictable, and not a bottleneck for any critical workloads across the company
- Owning performance and stability of storage systems, and continuously improving them as data volumes and workloads grow
- Designing and evolving data placement, resiliency, and lifecycle strategies to balance performance, cost, and reliability
- Ensuring the platform behaves predictably during failures, maintenance, and scaling events
- Improving how storage integrates with compute environments (GPU/HPC, Kubernetes, data pipelines)
- Driving faster and more reliable incident detection, resolution, and prevention
- Improving capacity planning to avoid emergency scaling and unexpected degradation
- Continuously improving tooling, automation, and operational practices to make the platform easier to operate and scale
Linux