Strong expertise in the Python ecosystem and major ML frameworks (PyTorch, JAX). Experience with lower-level programming (C++ or Rust preferred). Deep understanding of GPU acceleration (CUDA, profiling, kernel-level optimization); TPU experience is a strong plus. Proven ability to accelerate deep learning workloads using compiler frameworks, graph optimizations, and parallelization strategies. Solid understanding of the deep learning lifecycle: model design, large-scale training, data processing pipelines, and inference deployment. Strong debugging, profiling, and optimization skills in large-scale distributed environments. Excellent communication and collaboration skills, with the ability to clearly prioritize and articulate impact-driven technical solutions.