Senior Software Engineer — AI Evaluation & Benchmarks

New

Albania, Austria, Belgium, Bosnia and Herzegovina, Brazil, Bulgaria, Canada, Chile, Colombia, Czechia, Dominican Republic, Ecuador, Estonia, France, Germany, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Malta, Mexico, Montenegro, Netherlands, North Macedonia, Paraguay, Peru, Poland, Portugal, Puerto Rico, Romania, Serbia, Slovakia, Spain, Turkey, United Kingdom, United States, Uruguay, VenezuelaContractSenior

Salary166,400 - 208,000 USD per year

Apply NowOpens the employer's application page

Job Details

Languages: English
Experience: 4+ years
Required Skills: PythonGitMachine Learning

Requirements

4+ years of professional software engineering experience
Expert Python proficiency
Hands-on experience working in large, complex codebases
Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
Strong command of Git and modern development workflows
Track record at a high-growth tech company or top-tier software organization
Strong written English communication

Responsibilities

Design coding benchmarks that evaluate frontier models on real-world programming tasks
Build and maintain scalable data pipelines for evaluation workflows
Analyze model-generated code for correctness, reliability, and edge-case failures
Construct structured evaluation scenarios across large repos and multi-language environments
Provide detailed technical feedback on model performance and failure patterns
Contribute to evaluation frameworks that set the bar for how coding ability is measured

View Full Description & ApplyYou'll be redirected to the employer's site