Senior Data Architect - LLM/ML Data Infrastructure

Poland. Portugal. Spain. Czechia. GreeceFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
PythonSQLETLSnowflakeAirflowData modelingdbt

Requirements

  • 5+ years in data architecture, data engineering, or LLM/ML data infrastructure, with demonstrated ownership of production data systems serving ML/AI model development.
  • Strong understanding of ML training data requirements.
  • Deep experience with data modeling, schema design, and data pipeline architecture.
  • Strong proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar).
  • Experience defining annotation requirements and managing data annotation workflows.
  • Experience with data cataloging, metadata management, and dataset discovery at scale.
  • Strong SQL and Python skills for data pipeline development and data quality analysis.
  • Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization.
  • Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field.

Responsibilities

  • Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development.
  • Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies.
  • Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction.
  • Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors.
  • Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation.
  • Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker.
  • Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements and translate these into concrete dataset specifications and pipeline configurations.
  • Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data.
  • Identify gaps in production training data and define requirements for external data acquisition; design data augmentation strategies for underrepresented languages, domains, or conversational patterns.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now