Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

What to do when websites change and your spider doesn't know

DEV Community·John Rooney·22 days ago
#shQDD4wr
Reading 0:00
15s threshold

Empty-field-rate monitoring catches selectors that return nothing. It does not catch selectors that return something wrong. The most damaging form of schema drift is the kind where a selector keeps producing values, the values are syntactically reasonable, and they are no longer the values you wanted. A price selector that quietly starts returning the financing instalment instead of the sticker price will pass every non-empty check while corrupting your data for as long as the drift goes unnoticed. That is the failure mode this post is about. Drift comes in several flavours People talk about "schema drift" as if it were one thing, but in scraping practice there are several kinds of drift, each of which fails differently and demands a different defence.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More