Health Contracts Catch What Pydantic Can't

April 13, 2026 · Leon Ting · 8 min read

A user on r/webscraping runs 30 scrapers. His biggest problem isn't that they crash. It's that they silently return wrong data.

"I run around 30 scrapers and the thing that annoys me most is when they don't crash but just silently return bad data because a selector changed or something on the site moved around. Sometimes I don't notice for days."
— pavlito88, r/webscraping (23 upvotes, 41 comments)

He asked the same follow-up question six times in that thread. Each time, someone suggested a monitoring approach. Each time, he found the gap:

"What about when the structure is fine but the selector shifted — you're still getting a string for price, but it's now pulling the wrong price from the page?"

Nobody had a complete answer. Prometheus doesn't help. Pydantic doesn't help. OTel doesn't help. They all validate shape. None of them validate semantics — is this the right price, from the right element, on the right page?

The Gap in Every Monitoring Stack

Here's what everyone told him to do, and where each approach breaks down:

Approach	What it catches	What it misses
Row count threshold	Scraper returned 0 rows	Scraper returned 100 rows from wrong section
Type checking (pydantic)	Price is not a string	Price is a valid number — from the ad block
Non-empty checks	Field is missing	Field is present but semantically wrong
Prometheus + Grafana	Scraper process crashed	Scraper ran fine, data is garbage
OpenTelemetry	Infrastructure health	"For data, you have to write the contract" — armanfixing

The last one is the most revealing. Even the OTel advocate admitted: data correctness requires a contract. But who writes the contract? And what does it look like?

Health Contracts: The Missing Layer

A health contract is a declarative spec that lives inside the program itself. Not in a separate monitoring system. Not in a dashboard. In the code.

// A .tap.js program with a health contract
export default {
  site: "ecommerce",
  name: "products",

  health: {
    // Layer 1: Shape validation (what everyone already does)
    min_rows: 10,
    non_empty: ["title", "price", "url"],
    unique: ["url"],

    // Layer 2: Semantic validation (what nobody does)
    range: {
      price: { min: 1, max: 50000 },     // catches ad prices ($0.01) and glitches ($999999)
      rating: { min: 0, max: 5 }          // catches wrong-column data
    },
    pattern: {
      url: "^https://store\\.example\\.com/products/",  // catches URLs from related-items section
      sku: "^[A-Z]{2}-\\d{6}$"                          // catches garbled SKUs
    },
    drift: {
      price: 50                           // alert if median price shifts >50% between runs
    }
  },

  async tap(handle) {
    // ... extraction logic ...
  }
}

Three layers of validation, each catching failures the previous one can't:

1. `range` — numeric bounds per column

The site redesigns. Your .price selector now matches a promotional badge showing "$0.99 shipping" instead of the product price. The value is a valid number. It's non-empty. Row count is fine. Pydantic passes.

But range: { price: { min: 1, max: 50000 } } catches it instantly: value 0.99 outside [1, 50000] in 3/25 rows.

2. `pattern` — regex per column

The site adds a "Related Products" section below the main listing. Your selector now grabs URLs from both sections. The URLs are valid strings. They're non-empty. But half of them point to a different product category.

pattern: { url: "^https://store\\.example\\.com/products/" } catches it: failed regex in 12/25 rows. The related-products URLs don't match the expected path prefix.

3. `drift` — cross-run distribution shift

This is the hardest scenario to catch. The site doesn't break. The selectors don't move. But the site starts showing a different sort order — putting sponsored products first. Your top-10 results shift from $50–$200 products to $5–$15 sponsored items.

Every row passes every validation. Shape is fine. Types are fine. Even ranges are fine (the values are within bounds). But the distribution has shifted dramatically.

drift: { price: 50 } compares the median price against the previous run. If it shifts more than 50%, you get an alert. Not because any individual value is wrong — but because the population has changed.

What `tap doctor` Does With This

$ tap doctor ecommerce products
✘ ecommerce/products  FAIL  score: 0.72
  range: "price" value 0.99 outside [1, 50000] in 3/25 rows
  ⚠ fingerprint: semantic_hash changed on .product-card
  ✓ non_empty: all fields present
  ✓ min_rows: 25 ≥ 10
  ✓ pattern: all URLs match

Two signals converge: the health contract caught the bad values, and the fingerprint diff detected the structural change that caused them. Together, they tell you exactly what happened and where to look.

Compare this to the current state of the art:

"I just do manual spot checks which obviously doesn't scale."
— pavlito88

"It is a manual process to recreate selectors each time we see a major redesign."
— No-Appointment9068 (36 upvotes), r/webscraping

"They have developers (like me) working 24/7 to keep them running."
— army_of_wan (10 upvotes), r/webscraping

Why This Has to Be in the Program, Not in the Dashboard

Every monitoring approach suggested in that thread is external to the scraper: Prometheus, Grafana, Dagster, n8n, Slack alerts. They monitor the infrastructure. They can tell you the process died or the request timed out.

But data correctness is a domain problem, not an infrastructure problem. A price of $0.99 is a perfectly healthy HTTP response. The server returned 200. The JSON parsed. The field exists. Infrastructure monitoring has nothing to report.

The contract has to live where the domain knowledge lives — in the extraction code itself. That's why it's a field on the module, not a separate config file:

// The person who writes the extractor knows:
// - prices should be $1-$50,000
// - URLs should match /products/
// - SKUs follow a specific format
// Nobody else knows this.

health: {
  range: { price: { min: 1, max: 50000 } },
  pattern: { url: "^https://store\\.example\\.com/products/" },
}

When the site changes and the contract fails, tap doctor packages the diagnostics: the contract violation, the fingerprint diff showing what changed on the page, the current code, and the last 3 git commits. You (or your AI agent) read the diagnostics and decide how to fix it.

No retry loops. No black-box "intelligence." A contract that either passes or doesn't, with a precise explanation of why.

Your Scraper Is Broken Right Now — the silent failure problem and why it matters
The Interface Protocol — 8 operations that replace every browser automation SDK
Programs Beat Prompts — why AI should write code, not run it

Try it

curl -fsSL https://taprun.dev/install.sh | sh

# Check your taps right now
tap doctor

# See which ones are silently broken
tap doctor --format json | jq '.results[] | select(.status == "fail")'

Health Contracts Catch What Pydantic Can't

The Gap in Every Monitoring Stack

Health Contracts: The Missing Layer

1. range — numeric bounds per column

2. pattern — regex per column

3. drift — cross-run distribution shift

What tap doctor Does With This