Senior Research Data Engineer

New

Remote - USFull-TimeSenior

Salary$178,800 - $198,600 a year

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: PythonSQLMachine LearningMLFlowData engineeringDatabricksHIPAAPySpark

5+ years building production data systems, with at least 2 years supporting ML or AI workloads.
Advanced Python, SQL, and PySpark/Databricks proficiency.
Expertise in Databricks ecosystem: Delta Lake, Unity Catalog, Spark tuning, and MLflow.
Strong understanding of AI domains: embeddings, tokenization, feature engineering, and point-in-time correctness.
Experience transforming diverse data modalities including unstructured content (PDFs, text, logs) into model-ready forms.
Knowledge of AI-friendly data formats (Parquet, Hugging Face) and storage optimization.
Experience with data quality and synthesis pipelines (e.g., Snorkel, MinHash/LSH, LLM-based synthetic data).
Proficiency in pipeline orchestration (e.g., Airflow, Databricks Workflows, Dagster, or Prefect).
Experience handling regulated or sensitive data (e.g., HIPAA) and de-identification concepts.
Strong documentation skills and ability to elicit technical requirements from cross-functional stakeholders.
Bachelor’s degree in computer science, data science, engineering, statistics, or a related field.

Own the gold data layer by transforming silver tables into curated, semantically rich, and documented datasets.
Reverse-engineer data semantics by analyzing source code, stored procedures, and interviewing product and clinical experts.
Bridge technical data definitions with AI researcher requirements to create efficient foundations for model R&D.
Curate datasets across modalities, including unstructured text and tabular data, with metadata and point-in-time features.
Develop reusable transformations in Databricks/Spark as observable, scheduled workloads.
Automate quality, filtering, and synthesis pipelines, including programmatic labeling, near-duplicate detection, and synthetic data generation.
Maintain versioned, reproducible dataset snapshots with clear lineage using tools like Unity Catalog.

View Full Description & ApplyYou'll be redirected to the employer's site