When your web extraction tool should fail loudly instead of returning pretty lies

1 / 2

When your web extraction tool should fail loudly instead of returning pretty lies

DEV Community·Zee·20 days ago

#ouGTQTnS

#python #webscraping #ai #mcp #fullscreen #return

Reading 0:00

15s threshold

A web extraction API has one job that sounds boring until it fails: return the data that exists, or admit that it could not get it. That second half matters more than most people want to admit. When you put an LLM at the end of a scraping pipeline, you get a nasty failure mode. The fetch fails, the page is blocked, the PDF text is empty, or the site returns a CAPTCHA page, and the model still tries to be helpful. Helpful, in this case, means inventing plausible JSON. That is worse than a 500. A 500 tells your pipeline to retry, route, alert, or skip. Fabricated JSON quietly poisons whatever comes next. The pattern we ended up using For Haunt API, the extraction path is deliberately boring before it is clever: fetch the page directly fall back through stronger fetch/render paths when needed inspect what actually came back only ask the model to extract when there is real page content return a structured failure when the page is inaccessible or clearly a verification wall The key part is step 3.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

When your web extraction tool should fail loudly instead of returning pretty lies