Research Crawling Engineer
M
MLabsAI data accessibility
New York, New York, United States. Florida, United States. Pennsylvania, United States. Poland. Romania, 6 hour overlap with ESTFull-TimeMiddle
Salary80,000 - 175,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Required Skills
- PythonJavaMachine LearningC++GoRustNLPPlaywrightDistributed Systems
Requirements
- Extensive programming experience in Go, Rust, Python, Java, or C++
- Proven experience in building web crawlers or large-scale data pipelines
- Solid understanding of HTTP, networking protocols, and browser behavior
- Familiarity with distributed systems and parallel processing techniques
- Experience handling large datasets, ideally at the terabyte to petabyte scale
- Demonstrated ability to debug and maintain systems within unstable or adversarial environments
- Experience with NLP pipelines or dataset curation for machine learning (Preferred)
- Familiarity with LLM pre-training data or retrieval systems (Preferred)
- Practical experience with headless browsers (e.g., Playwright, Puppeteer, or Chrome DevTools Protocol) (Preferred)
- Knowledge of proxy systems, IP rotation, and large-scale request orchestration (Preferred)
- Background in data quality evaluation or benchmarking (Preferred)
- Experience running workloads on cloud or bare-metal infrastructure (Preferred)
Responsibilities
- Construct and maintain large-scale web crawlers across diverse domains.
- Design high-throughput, fault-tolerant systems for data collection, managing volumes ranging from millions to billions of URLs per day.
- Navigate anti-bot systems, rate limits, and dynamic, JavaScript-heavy websites.
- Develop robust pipelines for data cleaning, deduplication, filtering, and normalization.
- Build and maintain datasets specifically structured for research and machine learning model training.
- Monitor and optimize crawl performance, coverage, and data quality through rapid iteration.
- Collaborate with research teams to ensure data collection efforts align with modeling requirements.
- Optimize infrastructure to ensure cost-efficiency, low latency, and reliability.
View Full Description & ApplyYou'll be redirected to the employer's site