AI Evaluation Engineer - Data Analysis & Multi-Agent Systems
New
Pakistan. Egypt. Kenya. Ghana. Nigeria. Brazil, Bangladesh, Colombia, India, Indonesia, Turkey, Vietnam, 4 hours overlap with PSTContractMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- DockerPythonSQLNumpyPandas
Requirements
- 5+ years of experience in data analysis or analytics-heavy roles
- Strong proficiency in Python (pandas, NumPy)
- Strong proficiency in SQL
- Experience working with real-world, messy datasets (CSV, JSON, logs, reports)
- Ability to design analytical problems with clear, verifiable answers
- Solid understanding of statistics (distributions, correlations, outliers)
- Familiarity with AI benchmarks or evaluation environments (e.g., SWE-bench or similar)
- Hands-on experience with Docker (Dockerfiles, image builds, debugging)
Responsibilities
- Design and develop multi-agent benchmark tasks focused on complex data analysis workflows
- Create or curate realistic datasets (CSV, JSON, logs, reports, financial or operational data)
- Build tasks requiring cross-referencing across multiple data sources
- Build tasks requiring anomaly detection and contradiction identification
- Build tasks requiring statistical analysis and interpretation
- Define task decomposition strategies across specialized sub-agents (e.g., financial, technical, operational analysis)
- Develop verification logic to validate precise analytical outputs (not generic summaries)
- Implement evaluation pipelines using Python and SQL
- Create reproducible environments using Docker
- Analyze task performance and refine for clarity, difficulty, and scoring accuracy
View Full Description & ApplyYou'll be redirected to the employer's site