Senior Research Data Engineer
New
Remote - USFull-TimeSenior
Salary$178,800 - $198,600 a year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- PythonSQLMachine LearningMLFlowData engineeringDatabricksHIPAAPySpark
Requirements
- 5+ years building production data systems, with at least 2 years supporting ML or AI workloads.
- Advanced Python, SQL, and PySpark/Databricks proficiency.
- Expertise in Databricks ecosystem: Delta Lake, Unity Catalog, Spark tuning, and MLflow.
- Strong understanding of AI domains: embeddings, tokenization, feature engineering, and point-in-time correctness.
- Experience transforming diverse data modalities including unstructured content (PDFs, text, logs) into model-ready forms.
- Knowledge of AI-friendly data formats (Parquet, Hugging Face) and storage optimization.
- Experience with data quality and synthesis pipelines (e.g., Snorkel, MinHash/LSH, LLM-based synthetic data).
- Proficiency in pipeline orchestration (e.g., Airflow, Databricks Workflows, Dagster, or Prefect).
- Experience handling regulated or sensitive data (e.g., HIPAA) and de-identification concepts.
- Strong documentation skills and ability to elicit technical requirements from cross-functional stakeholders.
- Bachelor’s degree in computer science, data science, engineering, statistics, or a related field.
Responsibilities
- Own the gold data layer by transforming silver tables into curated, semantically rich, and documented datasets.
- Reverse-engineer data semantics by analyzing source code, stored procedures, and interviewing product and clinical experts.
- Bridge technical data definitions with AI researcher requirements to create efficient foundations for model R&D.
- Curate datasets across modalities, including unstructured text and tabular data, with metadata and point-in-time features.
- Develop reusable transformations in Databricks/Spark as observable, scheduled workloads.
- Automate quality, filtering, and synthesis pipelines, including programmatic labeling, near-duplicate detection, and synthetic data generation.
- Maintain versioned, reproducible dataset snapshots with clear lineage using tools like Unity Catalog.
View Full Description & ApplyYou'll be redirected to the employer's site