Last week, someone on r/webscraping asked a question that's been on my mind for years:
"I'm looking for patterns or best practices for building low-maintenance scrapers. Right now it feels like every time a website updates its layout or class names, the scraper dies and I have to patch selectors again. How do you turn scrapers into actual reliable infrastructure instead of something constantly on fire?"
— r/webscraping, 63 upvotes, 40 comments
I read every comment in that thread. Then I read 11 more high-signal threads on Hacker News covering the same topic — Browser MCP, Vibium, Skyvern, Smooth, BrowserBook, Webctl, Muscle-Mem, and a few others. About 350 comments total.
Six independent builders — each shipping their own project, none citing the others — arrived at the same answer.
Here's what they're saying. Then here's why they're right.
"Scraping is never 'done'. The big shift is treating scraping like infrastructure, not scripts. Config-driven extractors, fallback endpoints (API > HTML > browser), and retries by default. That's how you scale without everything being constantly on fire."
— HockeyMonkeey, r/webscraping
"Not using selectors, it's the worst option. Use json-ld, schema markup, and others — then if everything fails, use selectors."
— yoperuy, r/webscraping
"I rely heavily on ARIA roles/semantics (e.g. role=button name='Save') rather than injected IDs or CSS selectors. I find this makes the automation much more robust to UI changes."
— cosinusalpha, HN (author of Webctl)
"In my experience, most modern websites load their data as JSON first and then render it on the page. Sometimes via a direct API call, other times the JSON is embedded in a <script> tag. Once you figure that out, you can parse the data directly. This method is far more stable — frequent changes in JSON structure are not common."
— Born-Professor9062, r/webscraping
"MCP is useful as a trigger layer — if you already have working scrapers and want your agent to kick them off and get structured data back. But as the execution engine itself? Nope. The actual scraping challenges (rate limits, anti-bot, retries) all live outside MCP."
— ScrapeerCom, r/webscraping
"I want to use LLM for what they're good for (edge cases, fuzzy instructions, data) and have it turn around to write reusable tools — so that the next time, it doesn't have to run the full LLM. It can use a cached program."
— joshstrange, HN (Muscle-Mem thread)
And one more, because it's the cleanest summary of the whole debate:
"Right now most browser tools extend what's possible at the act layer. They make it easier for an LLM to click, type, and observe. That's useful, but it mostly enables one-off demos. Every run starts from scratch."
— hugs, HN (author of Vibium and Selenium)
These voices, plus another dozen I won't quote, converge on five principles:
Notice none of these are about writing better selectors. They're all about what surrounds the selector — observability, validation, recovery, semantic addressing, deterministic execution.
Each principle solves a different failure mode. Old fixes fight the environment. New fixes acknowledge the environment is hostile and build infrastructure that absorbs the hostility.
| Failure mode | Old fix | New fix |
|---|---|---|
| Site redesign breaks selectors | Hire a dev to patch it | Switch to JSON endpoint or ARIA |
| Scraper returns empty data | Hope someone notices | Health contract fails loudly |
| AI agent burns $3,600/month | "Use a smaller model" | Compile to a program once, run at $0 |
| Agent loops and retries | Add retry limits | Don't use an agent for deterministic work |
| Site adds bot detection | Rotate residential proxies | Use real browser via extension |
Look at the right column. That's what one of the r/webscraping commenters called "infrastructure." It isn't a single feature. It's a whole layer that sits between your data needs and the chaos of the web.
Here's the same pattern, applied. This is exactly the loop the six voices above were describing — just packaged into commands you can run today.
# 1. Write the program once (AI does this) $ tap forge "scrape Hacker News top stories" ✔ Saved: hackernews/hot.tap.js Strategy: API (api.hnpwa.com) Health: min_rows: 5, non_empty: ["title"] # 2. Run forever at zero AI cost $ tap hackernews hot 30 rows (245ms) Cost: $0.00 # 3. Health checks catch silent failure $ tap doctor hackernews/hot ✔ ok 30 rows (245ms) google/trends ✘ fail 0 rows min_rows: expected ≥5, got 0 github/trending ✔ ok 25 rows (1.2s) # 4. Watch for legitimate changes (not silent ones) $ tap watch hackernews hot --every 10m 2026-04-07T10:00 +added "Show HN: Tap" score=342 2026-04-07T10:10 +added "Rust 2.0 announced" score=128 # 5. Auto-heal when something breaks $ tap doctor --auto google/trends → re-forging strategy → ✔ healed (now 18 rows)
This is not a hypothesis. It's what happens when you take the six voices above and build the missing infrastructure underneath them.
The interesting thing about this convergence isn't that any single insight is new. It isn't:
What's new is that all five principles are landing in the same place at the same time, driven by the same forcing function: AI agents are too expensive and too unreliable to use as the runtime for deterministic interface work. Token costs exposed the truth that experienced scrapers already knew — the answer was never a smarter agent. It was always a thinner runtime under a smarter compiler.
The market is no longer asking "should we use agents or scripts?" The market is asking "how do we make scripts that don't need a human babysitter?"
That's the question Tap was built to answer. The answer was hiding in plain sight in the comments.
# Install (macOS / Linux) curl -fsSL https://taprun.dev/install.sh | sh # Pull 200+ community taps tap update # Run one tap hackernews hot # Check health of all of them tap doctor # Watch one for changes tap watch hackernews hot --every 10m
Getting started guide · GitHub · 200+ community taps included