Compile Once. Run Forever. Diff the Drift.

April 24, 2026 · Leon Ting · 6 min read

Here's the token cost of scraping the Hacker News top 30, a hundred times, across one drift event:

Approach	Tokens for 100 queries, 1 drift
Agent reads the page every query (Claude-for-Chrome style)	962,500
Agent rewrites the scraper once per drift (`tap fix` style)	~1,075
Tap router — diff the drift, minimal patch (production, Sonnet)	1,134
Information-theoretic floor (oracle MDL)	1

The gap between the top row and the third row is 849× at 100 queries. At 1,000 queries it's 8,489×. The gap grows linearly with queries per drift.

(Prior draft quoted 1,350×, from an earlier prototype using Haiku with an abbreviated prompt. The 849× number is the measured production figure on Sonnet with the full Slice 11/12 heal prompt — taken through the same code path that ships in Tap today.)

This isn't about smarter models — we used the same model for every arm. It's about where the LLM enters the loop.

The setup

Every AI agent that scrapes a page — Claude for Chrome, OpenAI Atlas, Browser Use, Skyvern, Stagehand — runs the same loop:

User asks for data from a site.
Agent reads the page.
LLM extracts fields into structured output.
Return to user.

Steps 2 and 3 cost tokens. Every single invocation.

When the site changes — a class renamed, an element reordered, a selector moved — the agent doesn't know. It just reads the (now different) page and extracts from the (now different) DOM. Same token cost. Silent correctness drift.

This is the default architecture today. It's also the reason a scraper costs $1 per run.

Three architectures

We ran a controlled experiment comparing four approaches to recovering from a single drift event. Target: the hackernews/hot tap, scraping the HN front page. Drift: .athing → .athing_v99 in the scraper's selector. After drift, the scraper returns an empty array.

Arm A — naive LLM extraction. Every query: fetch the page HTML, feed it to the LLM with "return top 30 as JSON with these columns." No tap, no forge, no amortization. The Claude-for-Chrome / OpenAI-Atlas path.

Arm B — current tap fix. When the scraper breaks, a doctor check alarms. The agent reads the broken source + doctor diagnostic + a page-inspection snapshot, and rewrites the entire scraper source.

Arm C — Tap router. When the scraper breaks, a verifier V (cross-validated against an independent authoritative source — HN's Firebase API in this case) produces a structured drift report. The agent sees only the broken source + the one-line V report, and emits a minimal patch: {old_fragment, new_fragment}. No page HTML, no full source regeneration.

Arm D — oracle MDL floor. Post-hoc Levenshtein distance between broken and correct source. A theoretical lower bound — the minimum information needed to communicate the repair.

The numbers

All arms run with Claude Haiku 4.5 as the LLM, measuring actual usage:

Arm	Raw-call tokens	What succeeded	Time
A — naive LLM, read HTML, extract JSON (Haiku)	9,625	30/30 rows match Firebase	20.6s
B — rewrite source from doctor signal (Haiku)	1,075	functional rewrite	5.0s
C — V report + minimal patch (Sonnet, production prompt)	1,134	byte-exact source restore	2.8s
D — oracle MDL floor	1	by construction	0

Three observations.

Arm A is shockingly expensive per query. 9,625 tokens isn't a one-time cost. It's paid every time a user asks for the data. A daily HN scrape at 9,625 tokens per run is 3.5M tokens per year — for a site that's already publicly structured and has a free JSON API.

Arm B's minimal form can succeed on well-known sites. Our simulation gave Arm B only the doctor diagnostic and the broken source — no page inspection — and haiku correctly fixed the selector from its prior knowledge of HN's markup. But the rewrite introduced subtle deviations: URL semantics changed from HN's item-page URL to the submission's outbound URL; an unnecessary string-prefix strip appeared. These are the kinds of changes that don't fail immediate tests but drift the tap's contract with downstream code. For sites the LLM doesn't know from training, Arm B also needs page inspection — adding ~3–10K tokens and closing most of the gap to Arm A.

Arm C achieves byte-exact restoration with one-tenth the input tokens. The LLM received 655 tokens: the broken source (2 KB) plus a one-line V report ("row count 0 below min_rows 20") plus a constraint ("prefer standard selectors"). It emitted a four-token patch: .athing_v99 → .athing. Applied to the broken source, the result was bit-for-bit identical to the original pre-drift source.

Why Arm C works

The Tap router's efficiency isn't a model choice. We ran the same prompt against Claude Sonnet and Opus too — sonnet 14,385, opus 19,886 agent-level tokens, both correct. The architecture works across models.

It's a context choice.

The verifier V, produced at forge time, captures a baseline of what the tap should output. When drift occurs, V cross-validates the live output against an independent authoritative source — for HN, the Firebase API; for Reddit, the Atom RSS feed. The disagreement is a structured signal, not "something seems wrong" but "the id field's observed value at row 0 doesn't match the authoritative id of the top story."

The LLM gets this drift signal plus the tap source. It doesn't need to re-read the page, because the page's truth is already encoded in the V report. The patch space is constrained: whatever the LLM emits must pass V when applied.

This is the shape of the insight: drift detection + independent baseline + minimal-patch prompt replaces page re-inspection + source regeneration. The LLM's job shrinks from "re-derive the whole scraper" to "produce a tiny patch."

The linear-growth table

Per-drift, Arm C is 13.5× cheaper than Arm A. That's a modest win. The real story is amortization.

Arm A pays every query. Arm C pays once per drift — and then the repaired scraper runs deterministically forever. Zero tokens per subsequent query. Over N queries per drift:

Queries / drift	Arm A cumulative	Arm C cumulative	Arm C advantage
1	9,625	1,134	8.5×
10	96,250	1,134	85×
100	962,500	1,134	849×
1,000	9,625,000	1,134	8,489×

For a scraper that runs hourly (24 queries/day) on a site that drifts monthly (~700 queries/drift), Arm C is roughly 10,000× cheaper than Arm A. The gap grows linearly with the invocation-to-drift ratio.

This is why Claude-for-Chrome-style agents don't scale to production scraping workloads. The economics work for one-shot agent tasks where amortization doesn't apply. They collapse when the same data gets fetched a hundred times.

Caveats

We should be honest about what this shows and doesn't show.

One mutation, one tap, N=1. This is a first data point. The class-rename case is the simplest drift category. Harder mutations — schema changes, auth flow shifts, anti-bot additions — may close the A/C gap. More samples would give variance bounds. Expanding to 2 taps × 2 mutations × 3 arms × N=5 is on deck.

Haiku was enough for class-rename. Harder drifts probably need sonnet or opus for Arm C to produce a correct patch; token cost scales accordingly. The linear-amortization story still holds, but the per-drift constants shift.

Arm B measurement is a lower bound. Real tap fix includes a page inspection snapshot, which we omitted. Production Arm B probably sits between ~6–12K tokens — close to Arm A, not close to Arm C.

Arm C assumes V exists. We spent the prior experiment calibrating V: three pilot taps at FPR = 0, FNR = 0 across 390 samples. First attempt had 100% FPR on GitHub (wrong API endpoint) and 75% FPR on HN (job postings show no author in DOM but do in Firebase). Without verifier substrate, Arm C's promise doesn't hold.

What's next

Phase 1 scale. 2 taps × 2 mutations × 3 arms × N=5 = 60 runs. Variance, per-mutation difficulty curve, cross-model comparison.
Arm B real measurement. Instrument the heal pipeline to emit token usage, run tap fix end-to-end against the broken tap, compare to the minimal simulation above.
Arm C productionization. The hypothetical router here is a prompt-engineering prototype. Real implementation needs V integration into the heal path, baseline fallback routes (the "multiple equivalent selectors per field" idea), and a repair cache so subsequent queries really are 0 tokens.

Reproduce it

The reproducibility kit is open under Apache-2.0 in the public tap-skills repo, under experiments/:

Verifier substrate: w0-verifier/ — 3 pilot taps calibrated at FPR=0, FNR=0 over 390 samples
Arm measurements: w1-recover/ — mutation injection, four arms, runner, judge

Arms A, B, C need an LLM endpoint (Anthropic API, Ollama local, or any OpenAI-compatible). Arm D runs without any network.

deno run -A experiments/w0-verifier/run.ts   # calibrates V
deno run -A experiments/w1-recover/runner.ts   # measures Arm D + attempts A

Install Tap

brew install LeonTing1010/tap/taprun
tap mcp stdio
tap hackernews/hot

This is a prototype experiment, not a production claim. Tap is a work-in-progress exploring the Compile-once-Run-forever-Diff-the-drift architecture. If you're interested in the space — AI agents, scraping economics, verifier design — the code is open, and I want feedback.