Proficiency in C++ and Python Deep understanding of HPC concepts (MPI, BSP, Multi-GPU/Multi-node distributed computing) CUDA/ROCm programming experience preferred Solid understanding of gradient descent and backpropagation algorithms Experience with transformer architectures Knowledge of deep learning training and its applications Understanding of distributed training techniques 3+ years of experience in machine learning engineering or research Experience with large-scale distributed training frameworks (Megatron-LM, DeepSpeed, FairScale, etc.) Familiarity with inference optimization frameworks (vLLM, TensorRT, etc.) Experience with containerization (Docker, Kubernetes) and cluster management Background in systems programming and performance optimization Publications in machine learning research preferred Ability to read, understand, and implement techniques from recent ML research papers Demonstrated commitment to open source development and community collaboration