- Build and maintain the evaluation harness and RL environment infrastructure—task runners, sandboxed environments, and scoring logic that can scale to thousands of parallel agents.
- Own the data pipeline that turns freshly collected court filings into benchmark and RL tasks before they reach any model's training set.
- Integrate with partner harnesses and model APIs to run contamination-free evaluations.
- Collaborate with attorneys to translate legal workflows like cite checks, motion drafting, and precedent research into structured, scorable task formats using the Harbor spec.
PostgreSQLPythonTypeScript+1 more