Open r/webscraping on any given day and you'll find the same five questions in different costumes:
Five different surfaces, one shared architectural cause. And — for the per-user use cases at least — one boring exit that almost nobody in those threads mentions.
The standard scraping stack — Playwright/Puppeteer with a stealth plugin, rotating user agents, datacenter or residential proxies, a CAPTCHA-solving API, sometimes an SMS-forwarder for OTP — is a stack of compensations. Each layer exists because the layer above it is being detected as not-a-real-browser.
That's a losing arms race over time. Incapsula, Cloudflare, DataDome, PerimeterX all share the same incentive structure: discriminate "real human in a real browser" from "scripted automation," and they get richer signals every quarter. Mouse-jitter analysis, TLS fingerprinting (JA3/JA4), behavioral consistency across navigations, canvas/WebGL/AudioContext entropy, residential-IP reputation services. Whatever stealth plugin you're running last month is in their training data this month.
The Reddit threads above are the visible top of the iceberg: every few weeks someone hits the next layer and posts. The OP of post #3 — at 50K req/day across 200 retailer domains — articulated it explicitly: "sites that worked fine 3 months ago are now fortress-level protected."
If detection is the problem, the cheapest fix is to not get detected — by actually being the thing detection systems look for: a real human browser with a real session, real cookies, real prior navigation history, real TLS fingerprint.
Concretely:
This is what the muted "reuse the cookie" reply in post #4 was gesturing at, and nobody elaborated. The technique is correct; the missing piece was "how do you actually do that without it falling apart every two days when cookies refresh or SSO redirects."
Tap (MIT, GitHub) is a Chrome extension plus CLI plus MCP server that does exactly the above. The interesting design choices:
storage_state.json file to manage. Your cookies never leave the machine..plan.json is an array of typed ops — nav, fetch, eval, wait, etc. Compile-time AI helps you write them; runtime replay is deterministic and uses zero LLM tokens.credentials: "page-session" — meaning the fetch is dispatched from within the authenticated tab's context. Your live cookies attach automatically. Same-origin policy applies as if you typed the URL into the address bar yourself.Three scenarios from the threads above, sketched as plan steps:
tap capture <url> — Tap forges a plan against the page you can now see.tap run <name> on schedule. The session cookie carries auth; OTP doesn't re-trigger.If the site eventually expires the session, you redo step 1. That's once a week or once a month depending on the site, not once per scrape.
Same as above. You solve the CAPTCHA once with your eyes. The challenge issues a session token that's good for as long as the site lets it. The automation never sees a CAPTCHA because it never logs in — it just inherits the logged-in tab's state.
This sidesteps the entire "$1 per 1,000 CAPTCHAs + 80% accuracy" subdebate, and the "train a CNN on 200 hand-annotated samples" subdebate, by changing the problem. You're not solving CAPTCHAs anymore; you're not encountering them.
Honest answer first: if your use case is 50K requests/day across 200 retailer domains, Tap is not your tool. That's a cloud-scale, multi-tenant, residential-proxy job — go pay Bright Data or ScraperAPI, or buy the data from an enterprise provider. Tap is built around one user × N sites, not N users × one site.
For per-user use cases — your own dashboards, your own SaaS accounts, your own paywalled subscriptions, niche sites you watch personally — Cloudflare and Incapsula don't fire. They're tuned to catch traffic that looks unlike normal human behavior. Your authenticated browser, navigating like you'd navigate, doesn't trip the heuristics. The infrastructure that costs SaaS scrapers thousands a month is invisible to single-seat use.
Local-first browser automation is the right tool when:
It's not the right tool for:
If you're in those categories, the existing tooling (Playwright + residential proxies + a CAPTCHA solver + sometimes an OTP relay) is genuinely the right shape, despite the arms-race cost. Don't pretend otherwise.
"Credentials in the cloud" is not a neutral architectural choice. It's the choice that makes most modern anti-scraping detection possible — because the cloud is where suspicious traffic comes from, by default. The moment your credentials live on the user's machine and your traffic comes from the user's actual residential IP through the user's actual TLS stack, you're indistinguishable from the user, because you are the user.
This isn't a fancy bypass. It's just the boring observation that the cheapest way to look human is to be operated by a human, even if only briefly, once per session.
r/webscraping has been having the same five conversations for years, each cycle getting harder as detection improves. The architectural exit ramp has been there the whole time. It's not for everyone — but for the per-user category, it's the cheapest, most durable, and least adversarial answer available.
Tell your agent a browser task on any site that needs your login — it runs in your real, already-logged-in Chrome and compiles it once into a deterministic, auditable .plan.json program: a versioned, reviewable record of exactly what it did. Every replay after is local, zero tokens, same result every time. Cookies and sessions never leave your machine — by architecture, not policy. Cloud browser SDKs can't match this; they need your session in their database to function. tap verify catches substrate drift before your data goes stale. Works with Claude Code, Cursor, Cline, Windsurf, and any MCP host. 70+ community taps.
curl -fsSL https://taprun.dev/install.sh | sh
taprun.dev · GitHub · More posts
Follow new engineering notes: RSS · Watch on GitHub