Member of Engineering (Pre-training and inference fault tolerance)

Posted 2 months agoViewed
EMEA, East CoastFull-TimeSoftware Development
Company:poolside
Location:EMEA, East Coast, EST, PST
Languages:English
Skills:
PythonSoftware DevelopmentArtificial IntelligenceMachine LearningNumpyPyTorchC++Debugging
Requirements:
Strong engineering background Good knowledge of Torch Good knowledge of NVIDIA GPU architecture Good knowledge of reliability concepts Good knowledge of distributed systems Good knowledge of best coding practices Basic understanding of LLM training and inference principles Programming experience Linux API Linux kernel Strong algorithmic skills Python with numpy, PyTorch, or Jax C/C++ NCCL Use modern tools and are always looking to improve Strong critical thinking and ability to question code quality policies when applicable K8s stack
Responsibilities:
Identify, study, and troubleshoot hardware problems during training at scale Minimize the GPU idle time during faults, both operationally and strategically Design and develop tools and add-ons to accelerate the training recovery Improve the performance and reliability of checkpointing Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code
About the Company
poolside
View Company Profile
Similar Jobs:
Posted 5 months ago
EMEA, East CoastFull-TimeSoftware Development
Member of Engineering (Pre-training and inference software)
Company:poolside
Posted 6 months ago
Europe, North AmericaFull-TimeArtificial Intelligence, Software Develo...
Member of Engineering (Pre-training)
Company:poolside
Posted 6 months ago
Europe, North AmericaFull-TimeAI, Software Development
Member of Engineering (Pre-training / Data)
Company:poolside