Member of Engineering (Reinforcement Learning Infrastructure)
New
P
PoolsideArtificial General Intelligence
Remote (EMEA/East Coast)Full-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- PythonPyTorchSoftware EngineeringLLMDistributed Systems
Requirements
- Experience with LLMs and model post-training workflows
- Understanding how Reinforcement Learning works and what its main bottlenecks are
- Solid software engineering fundamentals (testing, code review, debugging complex systems)
- Proficiency in Python with knowledge of concurrency, asynchronous programming, multiprocessing and performance optimization
- Familiarity with deep learning frameworks (PyTorch or JAX)
- Familiarity with RL workflows (rollouts, replay buffers, policy updates)
- Experience designing and maintaining distributed RL training systems
- Experience with large-scale LLM training infrastructure
- Experience with profiling tools across the stack (e.g. py-spy)
- Experience with inference stacks (e.g. vLLM)
Responsibilities
- Build and scale the infrastructure that enables reliable, efficient training of Large Language Models with Reinforcement Learning at the frontier
- Keep up with the latest research, and be familiar with the state of the art in LLMs, RL, and code generation
- Develop methods for tuning training and inference end-to-end for high throughput
- Design data control systems in an RL pipeline that govern what the model sees and when
- Debug cases where infrastructure decisions are silently degrading learning dynamics
- Build observability tooling that surfaces when a system-level issue is the root cause of a training regression
- Help build robust, flexible and scalable RL pipelines
- Optimize performance across the stack — networking, memory, compute scheduling, and I/O
- Write high-quality, pragmatic code
- Work in the team: plan future steps, discuss, and always stay in touch
View Full Description & ApplyYou'll be redirected to the employer's site