- Reviewing AI-generated responses to clinical scenarios
- Rating responses for accuracy, clinical appropriateness, safety, and reasoning quality
- Comparing multiple model answers and selecting/justifying the best response
- Writing improved exemplars, rationales, or structured feedback to help models learn where they fall short