- Design and advise on rubric architecture used to evaluate the quality of AI outputs.
- Create, edit, and certify 'Golden Tasks' that serve as ground truth for model training.
- Analyze model failures such as hallucinations and logical fallacies.
- Provide technical breakdowns explaining why models fail and how experts would resolve issues.
- Consult on the difficulty and nuance of prompts to ensure rigorous AI testing.