Research Engineer, Search and Knowledge Post-Training
New
A
AnthropicAI systems
Remote-Friendly (Travel-Required) | San Francisco, CA | Seattle, WA | New York City, NYFull-Time
Salary500,000 - 850,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- Years of experience required will correlate with the internal job level requirements for the position
- Required Skills
- Python
Requirements
- Have an unusually rigorous, quantitative mindset
- Are an outstanding software engineer in Python, comfortable across the stack from data pipelines to RL training to evaluation infrastructure
- Have shipped real ML research repeatedly, with taste for which experiments are worth running.
- You instinctively reach for ablations, controls, and confidence intervals to understand why
- Operate well with high autonomy and ambiguity and can identify the most impactful problem to work on next without being told
- Want to set research direction, advocate for experimental rigor, and raise the bar for the people around you
- Communicate research clearly in writing and in person; you can defend a design choice and update on evidence
- Hands-on experience with RL on large language models — environments, reward design, training stability, scaling behavior.
- Background in search, retrieval, RAG, or agents that reason over external information sources.
- Experience building evaluations for open-ended or knowledge-intensive LLM behavior
- Prior work in a research-heavy environment — frontier AI lab, quant research firm, or similarly demanding empirical setting — where rigor is the default.
- Published research on LLMs, RL, retrieval, calibration, or related topics.
- Experience with distributed training systems and large-scale experimentation infrastructure.
Responsibilities
- Own a research direction for a class of search post-training problems end-to-end: form hypotheses about latent capabilities, design experiments that isolate them, run training, and decide what to try next.
- Build the instrumentation that turns environment design into a controlled experiment so we can study how each environment factor contributes to the capabilities we care about, rather than overfitting to any one regime.
- Design frontier-discriminating evaluations that distinguish genuine reasoning over evidence from plausible pattern matching and that hold up as models improve.
- Drive optimization rigor across the stack: efficient experiment design, ablations, training run economics, and the discipline to know when a result is real.
- Collaborate deeply with researchers across post-training, RL infrastructure, and product to translate model behavior in the wild into concrete training signals and back again.
- Set the bar for the team's experimental standards — what we measure, how we measure it, how we know a result is real.
View Full Description & ApplyYou'll be redirected to the employer's site