- Design and develop multi-agent benchmark tasks focused on complex data analysis workflows
- Create or curate realistic datasets (CSV, JSON, logs, reports, financial or operational data)
- Build tasks requiring cross-referencing across multiple data sources
- Build tasks requiring anomaly detection and contradiction identification
- Build tasks requiring statistical analysis and interpretation
- Define task decomposition strategies across specialized sub-agents (e.g., financial, technical, operational analysis)
- Develop verification logic to validate precise analytical outputs (not generic summaries)
- Implement evaluation pipelines using Python and SQL
- Create reproducible environments using Docker
- Analyze task performance and refine for clarity, difficulty, and scoring accuracy
DockerPythonSQL+2 more