Staff Research Engineer, Post-training & Evaluation

R
RedditMachine Learning
This role is completely remote friendly within the United States.Full-TimeStaff
Salary$230,000 — $322,000 USD
Apply NowOpens the employer's application page

Job Details

Experience
6+ years of professional ML experience (or PhD + 4+)
Required Skills
PythonMachine LearningPyTorchNLPLLM

Requirements

  • 6+ years of professional ML experience (or PhD + 4+) in LLM post-training and evaluation.
  • PhD or MS in CS, ML, NLP, IR, or related quantitative field.
  • Expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, and statistical significance.
  • Experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI).
  • Experience evaluating both generation and representation/classification metrics.
  • Deep understanding of Continuous Pre-training (CPT) and Instruction Tuning (SFT).
  • Fluency in Python.
  • Experience with data-pipeline and eval-harness engineering.
  • Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3).

Responsibilities

  • Define the 'Reddit Benchmark' evaluation standard for Safety, Reasoning, and knowledge.
  • Establish statistical rigor for evaluation, including judge calibration and multi-sample scoring.
  • Design and own model-as-a-judge methodology and prompt calibration.
  • Set post-training recipes including SFT data mixtures and curriculum.
  • Evaluate base and CPT checkpoints to guide training compute allocation.
  • Drive synthetic data generation and curation strategies.
  • Partner with Safety Engineering to translate policies into classification metrics and CI/CD tests.
  • Diagnose post-training instability, loss curves, and alignment tax.
  • Mentor team members and set technical research direction.
View Full Description & ApplyYou'll be redirected to the employer's site
$230,000 — $322,000 USD
Apply Now