Designing a Web Crawler That Scales to 1 Billion Pages a Day

1 / 3

Designing a Web Crawler That Scales to 1 Billion Pages a Day

DEV Community·Gabriel Anhaia·about 1 month ago

#4znvjMg5

#architecture #database #tutorial #host #frontier #self

Reading 0:00

15s threshold

Book: System Design Pocket Guide: Interviews Also by me: Database Playbook My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub One billion pages a day is 11,574 pages every second. If your average page is 80 KB compressed, that's roughly 80 TB of HTML landing on object storage daily (back-of-envelope), and your dedup set carries roughly 30 billion URLs after a month. The interview answer that starts with "I'd use a queue and some workers" gets cut off well before you've covered the actual hard parts. The right answer is boring in a useful way. A crawler at this scale is four problems pretending to be one. The frontier picks the next URL. The deduper rejects URLs you've already seen. The politeness layer keeps you from getting blocked. The fetcher pool doesn't melt your egress NICs. Get those four right and the rest is plumbing.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Designing a Web Crawler That Scales to 1 Billion Pages a Day