How Do Companies Keep Scrapers Reliable? I Asked the Internet.

April 7, 2026 · Leon Ting · 7 min read

Last week, someone on r/webscraping asked a question that's been on my mind for years:

"I'm looking for patterns or best practices for building low-maintenance scrapers. Right now it feels like every time a website updates its layout or class names, the scraper dies and I have to patch selectors again. How do you turn scrapers into actual reliable infrastructure instead of something constantly on fire?"
— r/webscraping, 63 upvotes, 40 comments

I read every comment in that thread. Then I read 11 more high-signal threads on Hacker News covering the same topic — Browser MCP, Vibium, Skyvern, Smooth, BrowserBook, Webctl, Muscle-Mem, and a few others. About 350 comments total.

Six independent builders — each shipping their own project, none citing the others — arrived at the same answer.

Here's what they're saying. Then here's why they're right.

Six Voices, One Answer

1. "Infrastructure, not scripts"

"Scraping is never 'done'. The big shift is treating scraping like infrastructure, not scripts. Config-driven extractors, fallback endpoints (API > HTML > browser), and retries by default. That's how you scale without everything being constantly on fire."
— HockeyMonkeey, r/webscraping

2. "JSON-LD and schema markup before selectors"

"Not using selectors, it's the worst option. Use json-ld, schema markup, and others — then if everything fails, use selectors."
— yoperuy, r/webscraping

3. "ARIA roles, not CSS"

"I rely heavily on ARIA roles/semantics (e.g. role=button name='Save') rather than injected IDs or CSS selectors. I find this makes the automation much more robust to UI changes."
— cosinusalpha, HN (author of Webctl)

4. "Modern sites load JSON first. Just hit the JSON."

"In my experience, most modern websites load their data as JSON first and then render it on the page. Sometimes via a direct API call, other times the JSON is embedded in a <script> tag. Once you figure that out, you can parse the data directly. This method is far more stable — frequent changes in JSON structure are not common."
— Born-Professor9062, r/webscraping

5. "MCP for triggers. Programs for execution."

"MCP is useful as a trigger layer — if you already have working scrapers and want your agent to kick them off and get structured data back. But as the execution engine itself? Nope. The actual scraping challenges (rate limits, anti-bot, retries) all live outside MCP."
— ScrapeerCom, r/webscraping

6. "AI writes the program. The program runs forever."

"I want to use LLM for what they're good for (edge cases, fuzzy instructions, data) and have it turn around to write reusable tools — so that the next time, it doesn't have to run the full LLM. It can use a cached program."
— joshstrange, HN (Muscle-Mem thread)

And one more, because it's the cleanest summary of the whole debate:

"Right now most browser tools extend what's possible at the act layer. They make it easier for an LLM to click, type, and observe. That's useful, but it mostly enables one-off demos. Every run starts from scratch."
— hugs, HN (author of Vibium and Selenium)

The Pattern

These voices, plus another dozen I won't quote, converge on five principles:

API > DOM, always. Modern sites load JSON first, render second. Hit the JSON. JSON shapes are stable; CSS classes change every deploy.
Semantic > syntactic. When you must touch the DOM, use ARIA roles, schema.org markup, data-testids — anything that doesn't churn with the next CSS rebuild.
Programs > prompts. AI writes the extraction logic once. The program runs forever at zero cost. AI is for authoring, not for runtime.
Health contracts > silent failure. Every scraper declares what "healthy" looks like. The system catches breakage before your data goes bad.
Detection + recovery > perfect selectors. You can't write a selector that never breaks. You can write a system that knows when something broke and re-heals it.

Notice none of these are about writing better selectors. They're all about what surrounds the selector — observability, validation, recovery, semantic addressing, deterministic execution.

Why This Pattern Wins

Each principle solves a different failure mode. Old fixes fight the environment. New fixes acknowledge the environment is hostile and build infrastructure that absorbs the hostility.

Failure mode	Old fix	New fix
Site redesign breaks selectors	Hire a dev to patch it	Switch to JSON endpoint or ARIA
Scraper returns empty data	Hope someone notices	Health contract fails loudly
AI agent burns $3,600/month	"Use a smaller model"	Compile to a program once, run at $0
Agent loops and retries	Add retry limits	Don't use an agent for deterministic work
Site adds bot detection	Rotate residential proxies	Use real browser via extension

Look at the right column. That's what one of the r/webscraping commenters called "infrastructure." It isn't a single feature. It's a whole layer that sits between your data needs and the chaos of the web.

What That Infrastructure Looks Like

Here's the same pattern, applied. This is exactly the loop the six voices above were describing — just packaged into commands you can run today.

# 1. Write the program once (AI does this)
$ tap forge "scrape Hacker News top stories"
✔ Saved: hackernews/hot.tap.js
  Strategy: API (api.hnpwa.com)
  Health: min_rows: 5, non_empty: ["title"]

# 2. Run forever at zero AI cost
$ tap hackernews hot
30 rows (245ms)  Cost: $0.00

# 3. Health checks catch silent failure
$ tap doctor
hackernews/hot    ✔ ok     30 rows  (245ms)
google/trends     ✘ fail   0 rows   min_rows: expected ≥5, got 0
github/trending   ✔ ok     25 rows  (1.2s)

# 4. Watch for legitimate changes (not silent ones)
$ tap watch hackernews hot --every 10m
2026-04-07T10:00  +added   "Show HN: Tap"  score=342
2026-04-07T10:10  +added   "Rust 2.0 announced"  score=128

# 5. Auto-heal when something breaks
$ tap doctor --auto
google/trends → re-forging strategy → ✔ healed (now 18 rows)

This is not a hypothesis. It's what happens when you take the six voices above and build the missing infrastructure underneath them.

What's New, What's Old

The interesting thing about this convergence isn't that any single insight is new. It isn't:

Cache validation: Voyager (2023) used JS scripts as stored agent trajectories.
A11y trees: Stagehand has been using them since 2024.
API-first scraping: every veteran knows this rule.
Health checks: test automation has had assertions forever.
Semantic addressing: ARIA has been a W3C standard for a decade.

What's new is that all five principles are landing in the same place at the same time, driven by the same forcing function: AI agents are too expensive and too unreliable to use as the runtime for deterministic interface work. Token costs exposed the truth that experienced scrapers already knew — the answer was never a smarter agent. It was always a thinner runtime under a smarter compiler.

The market is no longer asking "should we use agents or scripts?" The market is asking "how do we make scripts that don't need a human babysitter?"

That's the question Tap was built to answer. The answer was hiding in plain sight in the comments.

Programs Beat Prompts — why AI should write code, not run it
Your Scraper Is Broken Right Now — the silent failure problem in detail
Websites Change. Your Automation Shouldn't Stop. — the self-healing loop
Your AI Browser Agent Costs $3,600/month — the cost math

Try the pattern

# Install (macOS / Linux)
curl -fsSL https://taprun.dev/install.sh | sh

# Pull 200+ community taps
tap update

# Run one
tap hackernews hot

# Check health of all of them
tap doctor

# Watch one for changes
tap watch hackernews hot --every 10m

Getting started guide · GitHub · 200+ community taps included