I Tested 15 LLMs for Web Scraping and Built Heuristics Instead

1 / 4

I Tested 15 LLMs for Web Scraping and Built Heuristics Instead

DEV Community·Rohith M·27 days ago

#O0bJSFGH

#webdev #ai #scraping #model #problem #seconds

Reading 0:00

15s threshold

The problem nobody talks about: 600KB of DOM When I started building a web scraper, the obvious move was to send the page to an LLM and ask it to extract the data. Simple, right? Wrong. A typical product listing page is 500–700KB of raw DOM. Sending that to any model means you're paying for ~150,000 tokens per page, waiting 15–30 seconds per request, and hitting context limits on anything complex. I hit this wall on page one. Four months, 15 models, same result I tested everything: GPT-4, GPT-4o, Gemini 1.5 Pro, Gemini Ultra, Claude 3 Opus, Claude 3.5 Sonnet, Mistral Large, Llama 3 70B, Cohere Command R+, and a handful of smaller fine-tuned models. The results were consistent: GPT-4 / Gemini Ultra: accurate, but 25–35 seconds per page Claude 3.5 Sonnet: best accuracy-to-latency ratio, still 5–10 seconds Smaller models: faster, but hallucinated field names constantly No model solved the latency problem because I was asking them to solve the wrong problem.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

I Tested 15 LLMs for Web Scraping and Built Heuristics Instead