Senior Data Architect - LLM/ML Data Infrastructure

Omilia

Poland. Portugal. Spain. Czechia. GreeceFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: PythonSQLETLSnowflakeAirflowData modelingdbt

Requirements

5+ years in data architecture, data engineering, or LLM/ML data infrastructure, with demonstrated ownership of production data systems serving ML/AI model development.
Strong understanding of ML training data requirements.
Deep experience with data modeling, schema design, and data pipeline architecture.
Strong proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar).
Experience defining annotation requirements and managing data annotation workflows.
Experience with data cataloging, metadata management, and dataset discovery at scale.
Strong SQL and Python skills for data pipeline development and data quality analysis.
Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization.
Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field.

Responsibilities

Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development.
Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies.
Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction.
Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors.
Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation.
Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker.
Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements and translate these into concrete dataset specifications and pipeline configurations.
Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data.
Identify gaps in production training data and define requirements for external data acquisition; design data augmentation strategies for underrepresented languages, domains, or conversational patterns.

View Full Description & ApplyYou'll be redirected to the employer's site