Menu

Post image 1
Post image 2
Post image 3
Post image 4
1 / 4
0

I Tested 15 LLMs for Web Scraping and Built Heuristics Instead

DEV Community·Rohith M·27 days ago
#O0bJSFGH
#webdev#ai#scraping#model#problem#seconds
Reading 0:00
15s threshold

The problem nobody talks about: 600KB of DOM When I started building a web scraper, the obvious move was to send the page to an LLM and ask it to extract the data. Simple, right? Wrong. A typical product listing page is 500–700KB of raw DOM. Sending that to any model means you're paying for ~150,000 tokens per page, waiting 15–30 seconds per request, and hitting context limits on anything complex. I hit this wall on page one. Four months, 15 models, same result I tested everything: GPT-4, GPT-4o, Gemini 1.5 Pro, Gemini Ultra, Claude 3 Opus, Claude 3.5 Sonnet, Mistral Large, Llama 3 70B, Cohere Command R+, and a handful of smaller fine-tuned models. The results were consistent: GPT-4 / Gemini Ultra: accurate, but 25–35 seconds per page Claude 3.5 Sonnet: best accuracy-to-latency ratio, still 5–10 seconds Smaller models: faster, but hallucinated field names constantly No model solved the latency problem because I was asking them to solve the wrong problem.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More