Data Engineer (United States)

Posted 7 months agoViewed

💎 Seniority level: Junior, 1-3 years

📍 Location: United States

🔍 Industry: Data Technology

🗣️ Languages: English

⏳ Experience: 1-3 years

🪄 Skills: AWSPythonSQLETLGitSnowflakeSoftware ArchitectureAirflowPandasCommunication Skills

Bachelor's in Computer Science, Data Science, Engineering or similar technical discipline (or commensurate work experience); Master's degree preferred.
1-3 years of Python programming (with Pandas experience).
Experience with CSV, JSON, parquet, and other common formats.
Data cleaning and structuring (ETL experience).
Knowledge of API (REST and SOAP), HTTP protocols, API Security and best practices.
Experience with SQL, Git, and Airflow.
Strong written and oral communication skills.
Excellent attention to detail.
Ability to learn and adapt quickly.

Collaborate with internal project managers, sales directors, account managers, and clients’ stakeholders to identify requirements and build external data-driven solutions.
Perform data appends, extracts, and analyses to deliver curated datasets and insights to clients to help achieve their business objectives.
Understand and keep current with external data landscapes such as consumer, business, and property data.
Engage in projects involving entity detection, record linking, and data modelling projects.
Design scalable code blocks using Demyst’s APIs/SDKs that can be leveraged across production projects.
Govern releases, change management and maintenance of production solutions in close coordination with clients' IT teams.

Posted 5 months ago

📍 United States

🔍 Artificial Intelligence and Data Engineering

🔧 Requirements

Master's degree in Computer Science, Data Science, or a related field.
3-5 years of work experience in data engineering, preferably in AI/ML contexts.
Proficiency in Python, JSON, HTTP, and related tools.
Strong understanding of LLM architectures, training processes, and data requirements.
Experience with RAG systems, knowledge base construction, and vector databases.
Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts.
Hands-on experience with data cleaning, tagging, and annotation processes.
Knowledge of data crawling techniques and associated ethical considerations.
Familiarity with Snowflake and its integration in AI/ML pipelines.
Experience with various vector store technologies and their applications in AI.
Understanding of data lakehouse concepts and architectures.
Excellent communication, collaboration, and problem-solving skills.
Ability to translate business needs into technical solutions.
Passion for innovation and commitment to ethical AI development.
Experience building LLMs pipeline using frameworks like LangChain, etc.

💡 Responsibilities

Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes.
Identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform.
Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data.
Benchmark and implement various vector stores, embedding techniques, and retrieval methods.
Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types.
Implement and maintain auto-tagging systems and data preparation processes for LLMs.
Develop tools for text and image data crawling, cleaning, and refinement.
Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models.
Work with data lakehouse architectures to optimize data storage and processing.
Integrate and optimize workflows using Snowflake and various vector store technologies.

AWSPythonGCPSnowflakeAlgorithmsAzureData engineeringData scienceSparkCollaborationJSON

Posted 5 months ago