If Your Scraper Uses Regex on HTML, You're Already Broken

1 / 4

If Your Scraper Uses Regex on HTML, You're Already Broken

DEV Community·SIÁN Agency·18 days ago

#dSnv9Tea

#software #coding #development #json #scraper #price

Reading 0:00

15s threshold

If your "scraper" is a requests.get() followed by re.findall(r'<div class=\"price\">.*?</div>', html) , I have bad news. You don't have a scraper. You have a layout sensor. The first time the dev team renames the class, adds a wrapper <span> , or A/B tests a new pricing component, your pipeline goes silent. Not loud, not error-throwing — silent. Empty rows in the dataset. No alarm. You find out a week later when a stakeholder asks why the dashboard looks weird. I rebuilt our Idealista scraper this quarter and the regex stage was the thing I deleted first. The 3-item checklist Before you write another re.findall against HTML, check: Is there a stable accessibility role or label? ( getByRole('heading', { name: /price/i }) — survives class renames.) Is the data actually in the rendered page, or is it injected via JSON? (Often the JSON-LD <script> block has everything you need, no DOM walking.) Can you assert the schema fails loud?…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

If Your Scraper Uses Regex on HTML, You're Already Broken