Staff Research Engineer, Post-training & Evaluation

This role is completely remote friendly within the United States.Full-TimeStaff

Salary$230,000 — $322,000 USD

Apply NowOpens the employer's application page

Job Details

6+ years of professional ML experience or PhD + 4 years.
PhD or MS in CS, ML, NLP, IR, or related field.
Deep expertise in evaluation reliability, calibration, and statistical significance.
Experience building custom domain-specific evaluation harnesses.
Experience evaluating both generation and representation/classification metrics.
Deep understanding of CPT and SFT data quality.
Fluency in Python.
Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed).
Experience with eval-harness tools like Hugging Face Transformers, vLLM, or lm-eval-harness.

Define the Reddit Benchmark evaluation standard for model quality.
Own evaluation reliability, including statistical significance and judge calibration.
Design model-as-a-judge methodology for automated evaluation.
Set post-training recipes and strategy (SFT data mixtures, curriculum).
Design checkpoint-selection methodology for base and CPT models.
Drive synthetic data generation strategy for instruction and evaluation.
Partner with Safety Engineering on classification metrics and probe sets.
Diagnose post-training instability via loss curves and logs.
Lead research direction and mentor team members.

View Full Description & ApplyYou'll be redirected to the employer's site