Here's the token cost of scraping the Hacker News top 30, a hundred times, across one drift event:
| Approach | Tokens for 100 queries, 1 drift |
|---|---|
| Agent reads the page every query (Claude-for-Chrome style) | 962,500 |
Agent rewrites the scraper once per drift (tap fix style) | ~1,075 |
| Tap router — diff the drift, minimal patch (production, Sonnet) | 1,134 |
| Information-theoretic floor (oracle MDL) | 1 |
The gap between the top row and the third row is 849× at 100 queries. At 1,000 queries it's 8,489×. The gap grows linearly with queries per drift.
This isn't about smarter models — we used the same model for every arm. It's about where the LLM enters the loop.
Every AI agent that scrapes a page — Claude for Chrome, OpenAI Atlas, Browser Use, Skyvern, Stagehand — runs the same loop:
Steps 2 and 3 cost tokens. Every single invocation.
When the site changes — a class renamed, an element reordered, a selector moved — the agent doesn't know. It just reads the (now different) page and extracts from the (now different) DOM. Same token cost. Silent correctness drift.
This is the default architecture today. It's also the reason a scraper costs $1 per run.
We ran a controlled experiment comparing four approaches to recovering from a single drift event. Target: the hackernews/hot tap, scraping the HN front page. Drift: .athing → .athing_v99 in the scraper's selector. After drift, the scraper returns an empty array.
Arm A — naive LLM extraction. Every query: fetch the page HTML, feed it to the LLM with "return top 30 as JSON with these columns." No tap, no forge, no amortization. The Claude-for-Chrome / OpenAI-Atlas path.
Arm B — current tap fix. When the scraper breaks, a doctor check alarms. The agent reads the broken source + doctor diagnostic + a page-inspection snapshot, and rewrites the entire scraper source.
Arm C — Tap router. When the scraper breaks, a verifier V (cross-validated against an independent authoritative source — HN's Firebase API in this case) produces a structured drift report. The agent sees only the broken source + the one-line V report, and emits a minimal patch: {old_fragment, new_fragment}. No page HTML, no full source regeneration.
Arm D — oracle MDL floor. Post-hoc Levenshtein distance between broken and correct source. A theoretical lower bound — the minimum information needed to communicate the repair.
All arms run with Claude Haiku 4.5 as the LLM, measuring actual usage:
| Arm | Raw-call tokens | What succeeded | Time |
|---|---|---|---|
| A — naive LLM, read HTML, extract JSON (Haiku) | 9,625 | 30/30 rows match Firebase | 20.6s |
| B — rewrite source from doctor signal (Haiku) | 1,075 | functional rewrite | 5.0s |
| C — V report + minimal patch (Sonnet, production prompt) | 1,134 | byte-exact source restore | 2.8s |
| D — oracle MDL floor | 1 | by construction | 0 |
Three observations.
Arm A is shockingly expensive per query. 9,625 tokens isn't a one-time cost. It's paid every time a user asks for the data. A daily HN scrape at 9,625 tokens per run is 3.5M tokens per year — for a site that's already publicly structured and has a free JSON API.
Arm B's minimal form can succeed on well-known sites. Our simulation gave Arm B only the doctor diagnostic and the broken source — no page inspection — and haiku correctly fixed the selector from its prior knowledge of HN's markup. But the rewrite introduced subtle deviations: URL semantics changed from HN's item-page URL to the submission's outbound URL; an unnecessary string-prefix strip appeared. These are the kinds of changes that don't fail immediate tests but drift the tap's contract with downstream code. For sites the LLM doesn't know from training, Arm B also needs page inspection — adding ~3–10K tokens and closing most of the gap to Arm A.
Arm C achieves byte-exact restoration with one-tenth the input tokens. The LLM received 655 tokens: the broken source (2 KB) plus a one-line V report ("row count 0 below min_rows 20") plus a constraint ("prefer standard selectors"). It emitted a four-token patch: .athing_v99 → .athing. Applied to the broken source, the result was bit-for-bit identical to the original pre-drift source.
The Tap router's efficiency isn't a model choice. We ran the same prompt against Claude Sonnet and Opus too — sonnet 14,385, opus 19,886 agent-level tokens, both correct. The architecture works across models.
It's a context choice.
The verifier V, produced at forge time, captures a baseline of what the tap should output. When drift occurs, V cross-validates the live output against an independent authoritative source — for HN, the Firebase API; for Reddit, the Atom RSS feed. The disagreement is a structured signal, not "something seems wrong" but "the id field's observed value at row 0 doesn't match the authoritative id of the top story."
The LLM gets this drift signal plus the tap source. It doesn't need to re-read the page, because the page's truth is already encoded in the V report. The patch space is constrained: whatever the LLM emits must pass V when applied.
This is the shape of the insight: drift detection + independent baseline + minimal-patch prompt replaces page re-inspection + source regeneration. The LLM's job shrinks from "re-derive the whole scraper" to "produce a tiny patch."
Per-drift, Arm C is 13.5× cheaper than Arm A. That's a modest win. The real story is amortization.
Arm A pays every query. Arm C pays once per drift — and then the repaired scraper runs deterministically forever. Zero tokens per subsequent query. Over N queries per drift:
| Queries / drift | Arm A cumulative | Arm C cumulative | Arm C advantage |
|---|---|---|---|
| 1 | 9,625 | 1,134 | 8.5× |
| 10 | 96,250 | 1,134 | 85× |
| 100 | 962,500 | 1,134 | 849× |
| 1,000 | 9,625,000 | 1,134 | 8,489× |
For a scraper that runs hourly (24 queries/day) on a site that drifts monthly (~700 queries/drift), Arm C is roughly 10,000× cheaper than Arm A. The gap grows linearly with the invocation-to-drift ratio.
This is why Claude-for-Chrome-style agents don't scale to production scraping workloads. The economics work for one-shot agent tasks where amortization doesn't apply. They collapse when the same data gets fetched a hundred times.
We should be honest about what this shows and doesn't show.
One mutation, one tap, N=1. This is a first data point. The class-rename case is the simplest drift category. Harder mutations — schema changes, auth flow shifts, anti-bot additions — may close the A/C gap. More samples would give variance bounds. Expanding to 2 taps × 2 mutations × 3 arms × N=5 is on deck.
Haiku was enough for class-rename. Harder drifts probably need sonnet or opus for Arm C to produce a correct patch; token cost scales accordingly. The linear-amortization story still holds, but the per-drift constants shift.
Arm B measurement is a lower bound. Real tap fix includes a page inspection snapshot, which we omitted. Production Arm B probably sits between ~6–12K tokens — close to Arm A, not close to Arm C.
Arm C assumes V exists. We spent the prior experiment calibrating V: three pilot taps at FPR = 0, FNR = 0 across 390 samples. First attempt had 100% FPR on GitHub (wrong API endpoint) and 75% FPR on HN (job postings show no author in DOM but do in Firebase). Without verifier substrate, Arm C's promise doesn't hold.
tap fix end-to-end against the broken tap, compare to the minimal simulation above.The reproducibility kit is open under Apache-2.0 in the public tap-skills repo, under experiments/:
w0-verifier/ — 3 pilot taps calibrated at FPR=0, FNR=0 over 390 samplesw1-recover/ — mutation injection, four arms, runner, judgeArms A, B, C need an LLM endpoint (Anthropic API, Ollama local, or any OpenAI-compatible). Arm D runs without any network.
deno run -A experiments/w0-verifier/run.ts # calibrates V deno run -A experiments/w1-recover/runner.ts # measures Arm D + attempts A
brew install LeonTing1010/tap/taprun tap mcp stdio tap hackernews/hot