A user on r/webscraping runs 30 scrapers. His biggest problem isn't that they crash. It's that they silently return wrong data.
"I run around 30 scrapers and the thing that annoys me most is when they don't crash but just silently return bad data because a selector changed or something on the site moved around. Sometimes I don't notice for days."
— pavlito88, r/webscraping (23 upvotes, 41 comments)
He asked the same follow-up question six times in that thread. Each time, someone suggested a monitoring approach. Each time, he found the gap:
"What about when the structure is fine but the selector shifted — you're still getting a string for price, but it's now pulling the wrong price from the page?"
Nobody had a complete answer. Prometheus doesn't help. Pydantic doesn't help. OTel doesn't help. They all validate shape. None of them validate semantics — is this the right price, from the right element, on the right page?
Here's what everyone told him to do, and where each approach breaks down:
| Approach | What it catches | What it misses |
|---|---|---|
| Row count threshold | Scraper returned 0 rows | Scraper returned 100 rows from wrong section |
| Type checking (pydantic) | Price is not a string | Price is a valid number — from the ad block |
| Non-empty checks | Field is missing | Field is present but semantically wrong |
| Prometheus + Grafana | Scraper process crashed | Scraper ran fine, data is garbage |
| OpenTelemetry | Infrastructure health | "For data, you have to write the contract" — armanfixing |
The last one is the most revealing. Even the OTel advocate admitted: data correctness requires a contract. But who writes the contract? And what does it look like?
A health contract is a declarative spec that lives inside the program itself. Not in a separate monitoring system. Not in a dashboard. In the code.
// A .tap.js program with a health contract export default { site: "ecommerce", name: "products", health: { // Layer 1: Shape validation (what everyone already does) min_rows: 10, non_empty: ["title", "price", "url"], unique: ["url"], // Layer 2: Semantic validation (what nobody does) range: { price: { min: 1, max: 50000 }, // catches ad prices ($0.01) and glitches ($999999) rating: { min: 0, max: 5 } // catches wrong-column data }, pattern: { url: "^https://store\\.example\\.com/products/", // catches URLs from related-items section sku: "^[A-Z]{2}-\\d{6}$" // catches garbled SKUs }, drift: { price: 50 // alert if median price shifts >50% between runs } }, async tap(handle) { // ... extraction logic ... } }
Three layers of validation, each catching failures the previous one can't:
range — numeric bounds per columnThe site redesigns. Your .price selector now matches a promotional badge showing "$0.99 shipping" instead of the product price. The value is a valid number. It's non-empty. Row count is fine. Pydantic passes.
But range: { price: { min: 1, max: 50000 } } catches it instantly: value 0.99 outside [1, 50000] in 3/25 rows.
pattern — regex per columnThe site adds a "Related Products" section below the main listing. Your selector now grabs URLs from both sections. The URLs are valid strings. They're non-empty. But half of them point to a different product category.
pattern: { url: "^https://store\\.example\\.com/products/" } catches it: failed regex in 12/25 rows. The related-products URLs don't match the expected path prefix.
drift — cross-run distribution shiftThis is the hardest scenario to catch. The site doesn't break. The selectors don't move. But the site starts showing a different sort order — putting sponsored products first. Your top-10 results shift from $50–$200 products to $5–$15 sponsored items.
Every row passes every validation. Shape is fine. Types are fine. Even ranges are fine (the values are within bounds). But the distribution has shifted dramatically.
drift: { price: 50 } compares the median price against the previous run. If it shifts more than 50%, you get an alert. Not because any individual value is wrong — but because the population has changed.
tap doctor Does With This$ tap doctor ecommerce products ✘ ecommerce/products FAIL score: 0.72 range: "price" value 0.99 outside [1, 50000] in 3/25 rows ⚠ fingerprint: semantic_hash changed on .product-card ✓ non_empty: all fields present ✓ min_rows: 25 ≥ 10 ✓ pattern: all URLs match
Two signals converge: the health contract caught the bad values, and the fingerprint diff detected the structural change that caused them. Together, they tell you exactly what happened and where to look.
Compare this to the current state of the art:
"I just do manual spot checks which obviously doesn't scale."
— pavlito88
"It is a manual process to recreate selectors each time we see a major redesign."
— No-Appointment9068 (36 upvotes), r/webscraping
"They have developers (like me) working 24/7 to keep them running."
— army_of_wan (10 upvotes), r/webscraping
Every monitoring approach suggested in that thread is external to the scraper: Prometheus, Grafana, Dagster, n8n, Slack alerts. They monitor the infrastructure. They can tell you the process died or the request timed out.
But data correctness is a domain problem, not an infrastructure problem. A price of $0.99 is a perfectly healthy HTTP response. The server returned 200. The JSON parsed. The field exists. Infrastructure monitoring has nothing to report.
The contract has to live where the domain knowledge lives — in the extraction code itself. That's why it's a field on the module, not a separate config file:
// The person who writes the extractor knows:
// - prices should be $1-$50,000
// - URLs should match /products/
// - SKUs follow a specific format
// Nobody else knows this.
health: {
range: { price: { min: 1, max: 50000 } },
pattern: { url: "^https://store\\.example\\.com/products/" },
}
When the site changes and the contract fails, tap doctor packages the diagnostics: the contract violation, the fingerprint diff showing what changed on the page, the current code, and the last 3 git commits. You (or your AI agent) read the diagnostics and decide how to fix it.
No retry loops. No black-box "intelligence." A contract that either passes or doesn't, with a precise explanation of why.
curl -fsSL https://taprun.dev/install.sh | sh # Check your taps right now tap doctor # See which ones are silently broken tap doctor --format json | jq '.results[] | select(.status == "fail")'