Apply

LLM Data Engineer | United States | Fully Remote

Posted 2024-11-07

View full description

💎 Seniority level: Senior, 3-5 years

📍 Location: United States

🔍 Industry: Artificial Intelligence and Data Engineering

🏢 Company: Halo Media

⏳ Experience: 3-5 years

🪄 Skills: AWSPythonGCPSnowflakeAlgorithmsAzureData engineeringData scienceSparkCollaboration

Requirements:
  • Master's degree in Computer Science, Data Science, or a related field.
  • 3-5 years of work experience in data engineering, preferably in AI/ML contexts.
  • Proficiency in Python, JSON, HTTP, and related tools.
  • Strong understanding of LLM architectures, training processes, and data requirements.
  • Experience with RAG systems, knowledge base construction, and vector databases.
  • Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts.
  • Hands-on experience with data cleaning, tagging, and annotation processes.
  • Knowledge of data crawling techniques and associated ethical considerations.
  • Familiarity with Snowflake and its integration in AI/ML pipelines.
  • Experience with various vector store technologies and their applications in AI.
  • Understanding of data lakehouse concepts and architectures.
  • Excellent communication, collaboration, and problem-solving skills.
  • Ability to translate business needs into technical solutions.
  • Passion for innovation and commitment to ethical AI development.
  • Experience building LLMs pipeline using frameworks like LangChain, etc.
Responsibilities:
  • Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes.
  • Identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform.
  • Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data.
  • Benchmark and implement various vector stores, embedding techniques, and retrieval methods.
  • Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types.
  • Implement and maintain auto-tagging systems and data preparation processes for LLMs.
  • Develop tools for text and image data crawling, cleaning, and refinement.
  • Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models.
  • Work with data lakehouse architectures to optimize data storage and processing.
  • Integrate and optimize workflows using Snowflake and various vector store technologies.
Apply