AI Benchmark Engineer | Native Language Specialist - Spanish
New
L
LILT (Production)AI, Language Technology
Spain (Remote)ContractMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- Spanish, English
- Experience
- 5+ years
- Required Skills
- Python
Requirements
- 5+ years of industry experience in software engineering
- Proven track record at leading technology companies and/or graduation from top-tier engineering universities
- Native or near-native fluency in Spanish, with a deep understanding of its grammar, register, and phrasing rules
- High English proficiency
- Strong proficiency in Python
- Strong proficiency in standard shell scripting
- Strong proficiency in data processing
- Extensive experience with Terminal/CLI-based development workflows
- Working familiarity with coding agents
- Deep technical understanding of multilingual text processing pitfalls, including encoding/decoding robustness and Unicode normalization
- Deep technical understanding of locale-dependent conventions (collation, casing, non-Gregorian dates)
- Deep technical understanding of text I/O, toolchain interoperability, and safe string operations
- For Spanish, deep understanding of bidirectional/RTL handling, font fallbacks, and rendering/typography in UI or artifacts
Responsibilities
- Design, build, and validate Terminal-Bench tasks to test large language models on multilingual software challenges.
- Create high-signal, high-quality tasks that genuinely test a model's ability to handle multilingual environments without relying on English translation crutches.
- Evaluate coding agents.
- Build realistic task environments using datasets and files in your native language (Spanish).
- Find failure points where AI does not work, in your native language (Spanish).
- Support the development of robust solutions (reference implementations) and write highly reliable, deterministic verifier scripts.
- Analyze execution logs and calibrate task difficulty (Easy to Very Hard) using standard Terminal-Bench run configurations against various model tiers.
- Participate in a rigorous, 4-layer human quality control process (creation, human review, calibration review, and audit) alongside automated LLM-based checks to ensure fairness, grammatical accuracy, and benchmark integrity.
View Full Description & ApplyYou'll be redirected to the employer's site