Senior Research Data Engineer

New
Remote - USFull-TimeSenior
Salary$178,800 - $198,600 a year
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
PythonSQLMachine LearningMLFlowData engineeringDatabricksHIPAAPySpark

Requirements

  • 5+ years building production data systems, with at least 2 years supporting ML or AI workloads.
  • Advanced Python, SQL, and PySpark/Databricks proficiency.
  • Expertise in Databricks ecosystem: Delta Lake, Unity Catalog, Spark tuning, and MLflow.
  • Strong understanding of AI domains: embeddings, tokenization, feature engineering, and point-in-time correctness.
  • Experience transforming diverse data modalities including unstructured content (PDFs, text, logs) into model-ready forms.
  • Knowledge of AI-friendly data formats (Parquet, Hugging Face) and storage optimization.
  • Experience with data quality and synthesis pipelines (e.g., Snorkel, MinHash/LSH, LLM-based synthetic data).
  • Proficiency in pipeline orchestration (e.g., Airflow, Databricks Workflows, Dagster, or Prefect).
  • Experience handling regulated or sensitive data (e.g., HIPAA) and de-identification concepts.
  • Strong documentation skills and ability to elicit technical requirements from cross-functional stakeholders.
  • Bachelor’s degree in computer science, data science, engineering, statistics, or a related field.

Responsibilities

  • Own the gold data layer by transforming silver tables into curated, semantically rich, and documented datasets.
  • Reverse-engineer data semantics by analyzing source code, stored procedures, and interviewing product and clinical experts.
  • Bridge technical data definitions with AI researcher requirements to create efficient foundations for model R&D.
  • Curate datasets across modalities, including unstructured text and tabular data, with metadata and point-in-time features.
  • Develop reusable transformations in Databricks/Spark as observable, scheduled workloads.
  • Automate quality, filtering, and synthesis pipelines, including programmatic labeling, near-duplicate detection, and synthetic data generation.
  • Maintain versioned, reproducible dataset snapshots with clear lineage using tools like Unity Catalog.
View Full Description & ApplyYou'll be redirected to the employer's site
$178,800 - $198,600 a year
Apply Now