Member of Engineering (Pre-training and inference fault tolerance)

Posted about 1 month agoViewed
EMEA, East Coast North AmericaFull-TimeSoftware Development, AI
Company:poolside
Location:EMEA, East Coast North America, EST, PST
Languages:English
Skills:
PythonSoftware DevelopmentArtificial IntelligenceMachine LearningNumpyPyTorchC++Debugging
Requirements:
Strong engineering skills Good knowledge of Torch Understanding of NVIDIA GPU architecture Knowledge of reliability concepts Experience with distributed systems Familiarity with best coding practices Basic understanding of LLM training and inference principles Ability to debug Linux kernel modules Experience with Python (PyTorch, numpy), Cython, C/C++, CUDA API Knowledge of NCCL Experience with K8s stack
Responsibilities:
Identify and troubleshoot hardware problems during large-scale training Minimize GPU idle time during faults Design and develop tools to accelerate training recovery Improve performance and reliability of checkpointing Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code
Similar Jobs:
Posted about 15 hours ago
United StatesFull-TimeLegal Tech
AI Legal and Compliance Tutor
Posted about 15 hours ago
GermanyFull-TimeSoftware Development
Engineering Manager (f/m/d) - Ruby on Rails Development
Posted about 17 hours ago
United StatesFull-TimeSoftware Development
Head of Sales
Company:Bizee