- Design, build, and operate a large-scale web crawler responsible for acquiring all openly accessible data on the internet
- Develop specialized deep crawlers targeting high-value sources to improve recall and coverage
- In collaboration with data researchers, own a long-term road map for data acquisition
- Build observability, monitoring, and debugging tooling to ensure reliability and transparency across crawl infrastructure
- Collaborate with pre-training, post-training, and evaluations teams to align data acquisition priorities with model training needs
- Build high-throughput ingestion pipelines for rapidly onboarding partner data and evaluating it for quality
AWSDockerPython+2 more