Staff Research Engineer, Post-training & Evaluation

RedditMachine Learning

This role is completely remote friendly within the United States.Full-TimeStaff

Salary$230,000 — $322,000 USD

Apply NowOpens the employer's application page

Job Details

6+ years of professional ML experience (or PhD + 4+) in LLM post-training and evaluation.
PhD or MS in CS, ML, NLP, IR, or related quantitative field.
Expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, and statistical significance.
Experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI).
Experience evaluating both generation and representation/classification metrics.
Deep understanding of Continuous Pre-training (CPT) and Instruction Tuning (SFT).
Fluency in Python.
Experience with data-pipeline and eval-harness engineering.
Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3).

Define the 'Reddit Benchmark' evaluation standard for Safety, Reasoning, and knowledge.
Establish statistical rigor for evaluation, including judge calibration and multi-sample scoring.
Design and own model-as-a-judge methodology and prompt calibration.
Set post-training recipes including SFT data mixtures and curriculum.
Evaluate base and CPT checkpoints to guide training compute allocation.
Drive synthetic data generation and curation strategies.
Partner with Safety Engineering to translate policies into classification metrics and CI/CD tests.
Diagnose post-training instability, loss curves, and alignment tax.
Mentor team members and set technical research direction.

View Full Description & ApplyYou'll be redirected to the employer's site