AI Evaluation Engineer - Data Analysis & Multi-Agent Systems

New

Pakistan. Egypt. Kenya. Ghana. Nigeria. Brazil, Bangladesh, Colombia, India, Indonesia, Turkey, Vietnam, 4 hours overlap with PSTContractMiddle

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

5+ years of experience in data analysis or analytics-heavy roles
Strong proficiency in Python (pandas, NumPy)
Strong proficiency in SQL
Experience working with real-world, messy datasets (CSV, JSON, logs, reports)
Ability to design analytical problems with clear, verifiable answers
Solid understanding of statistics (distributions, correlations, outliers)
Familiarity with AI benchmarks or evaluation environments (e.g., SWE-bench or similar)
Hands-on experience with Docker (Dockerfiles, image builds, debugging)

Design and develop multi-agent benchmark tasks focused on complex data analysis workflows
Create or curate realistic datasets (CSV, JSON, logs, reports, financial or operational data)
Build tasks requiring cross-referencing across multiple data sources
Build tasks requiring anomaly detection and contradiction identification
Build tasks requiring statistical analysis and interpretation
Define task decomposition strategies across specialized sub-agents (e.g., financial, technical, operational analysis)
Develop verification logic to validate precise analytical outputs (not generic summaries)
Implement evaluation pipelines using Python and SQL
Create reproducible environments using Docker
Analyze task performance and refine for clarity, difficulty, and scoring accuracy

View Full Description & ApplyYou'll be redirected to the employer's site