Research Engineer – Evals
New
F
FirecrawlData Extraction AI
Americas, UTC-3 to UTC-10, UTC-3 to UTC-10Full-TimeSenior
Salary160,000 - 240,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 3+ years in ML engineering, applied AI, or data quality
- Required Skills
- PythonCI/CD
Requirements
- 3+ years in ML engineering, applied AI, or data quality with production systems.
- Strong experience building eval infrastructure.
- Deep understanding of LLM evaluation methodology (LLM-as-judge).
- Expertise in working with unstructured, messy web data.
- Strong proficiency in designing rubrics for LLM evaluation.
- Experience with production systems and real-world traffic tradeoffs.
- Ability to work in a fast-paced environment with rapid experiment cycles.
Responsibilities
- Build the eval stack from scratch including metrics, pipelines, and datasets.
- Integrate evals into CI/CD for regression testing.
- Design benchmarks reflecting real-world web data distribution.
- Own LLM-as-judge pipelines and human review tooling.
- Collaborate with RL and Search/IR researchers to create feedback loops.
- Run fast experiments and communicate findings clearly.
- Build datasets and data collection/labeling systems.
View Full Description & ApplyYou'll be redirected to the employer's site