Machine Learning Systems Engineer

Posted 3 months agoViewed

San Francisco Bay AreaNorth AmericaSouth AmericaFull-TimeAI, Machine Learning

Company:RelationalAI

Location:San Francisco Bay Area, North America, South America, EST, PST

Languages:English

Seniority level:Senior, 3+ years

Experience:3+ years

Skills:

DockerPythonKubernetesC++

Requirements:

Proficiency in C++ and Python Deep understanding of HPC concepts (MPI, BSP, Multi-GPU/Multi-node distributed computing) CUDA/ROCm programming experience preferred Solid understanding of gradient descent and backpropagation algorithms Experience with transformer architectures Knowledge of deep learning training and its applications Understanding of distributed training techniques 3+ years of experience in machine learning engineering or research Experience with large-scale distributed training frameworks (Megatron-LM, DeepSpeed, FairScale, etc.) Familiarity with inference optimization frameworks (vLLM, TensorRT, etc.) Experience with containerization (Docker, Kubernetes) and cluster management Background in systems programming and performance optimization Publications in machine learning research preferred Ability to read, understand, and implement techniques from recent ML research papers Demonstrated commitment to open source development and community collaboration

Responsibilities:

Contribute code and performance improvements to the open source project. Develop and optimize distributed training algorithms for large language models. Implement high-performance inference engines and optimization techniques. Work on integration between vLLM, Megatron-LM, and HuggingFace ecosystems. Build tools for seamless model training, fine-tuning, and deployment. Optimize performance of advanced GPU architectures. Collaborate with the open source community on feature development and bug fixes. Research and implement new techniques for self-improving AI agents.