Research Engineer – Evals

New
F
FirecrawlData Extraction AI
Americas, UTC-3 to UTC-10, UTC-3 to UTC-10Full-TimeSenior
Salary160,000 - 240,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
3+ years in ML engineering, applied AI, or data quality
Required Skills
PythonCI/CD

Requirements

  • 3+ years in ML engineering, applied AI, or data quality with production systems.
  • Strong experience building eval infrastructure.
  • Deep understanding of LLM evaluation methodology (LLM-as-judge).
  • Expertise in working with unstructured, messy web data.
  • Strong proficiency in designing rubrics for LLM evaluation.
  • Experience with production systems and real-world traffic tradeoffs.
  • Ability to work in a fast-paced environment with rapid experiment cycles.

Responsibilities

  • Build the eval stack from scratch including metrics, pipelines, and datasets.
  • Integrate evals into CI/CD for regression testing.
  • Design benchmarks reflecting real-world web data distribution.
  • Own LLM-as-judge pipelines and human review tooling.
  • Collaborate with RL and Search/IR researchers to create feedback loops.
  • Run fast experiments and communicate findings clearly.
  • Build datasets and data collection/labeling systems.
View Full Description & ApplyYou'll be redirected to the employer's site
160,000 - 240,000 USD per year
Apply Now