Research Crawling Engineer

M
MLabsAI data accessibility
New York, New York, United States. Florida, United States. Pennsylvania, United States. Poland. Romania, 6 hour overlap with ESTFull-TimeMiddle
Salary80,000 - 175,000 USD per year
Apply NowOpens the employer's application page

Job Details

Required Skills
PythonJavaMachine LearningC++GoRustNLPPlaywrightDistributed Systems

Requirements

  • Extensive programming experience in Go, Rust, Python, Java, or C++
  • Proven experience in building web crawlers or large-scale data pipelines
  • Solid understanding of HTTP, networking protocols, and browser behavior
  • Familiarity with distributed systems and parallel processing techniques
  • Experience handling large datasets, ideally at the terabyte to petabyte scale
  • Demonstrated ability to debug and maintain systems within unstable or adversarial environments
  • Experience with NLP pipelines or dataset curation for machine learning (Preferred)
  • Familiarity with LLM pre-training data or retrieval systems (Preferred)
  • Practical experience with headless browsers (e.g., Playwright, Puppeteer, or Chrome DevTools Protocol) (Preferred)
  • Knowledge of proxy systems, IP rotation, and large-scale request orchestration (Preferred)
  • Background in data quality evaluation or benchmarking (Preferred)
  • Experience running workloads on cloud or bare-metal infrastructure (Preferred)

Responsibilities

  • Construct and maintain large-scale web crawlers across diverse domains.
  • Design high-throughput, fault-tolerant systems for data collection, managing volumes ranging from millions to billions of URLs per day.
  • Navigate anti-bot systems, rate limits, and dynamic, JavaScript-heavy websites.
  • Develop robust pipelines for data cleaning, deduplication, filtering, and normalization.
  • Build and maintain datasets specifically structured for research and machine learning model training.
  • Monitor and optimize crawl performance, coverage, and data quality through rapid iteration.
  • Collaborate with research teams to ensure data collection efforts align with modeling requirements.
  • Optimize infrastructure to ensure cost-efficiency, low latency, and reliability.
View Full Description & ApplyYou'll be redirected to the employer's site
80,000 - 175,000 USD per year
Apply Now