Identify, study, and troubleshoot hardware problems during training at scale Minimize the GPU idle time during faults, both operationally and strategically Design and develop tools and add-ons to accelerate the training recovery Improve the performance and reliability of checkpointing Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code