Scraping behind login walls: stop fighting OTP, CAPTCHA, and Cloudflare Turnstile

May 16, 2026 · Leon Ting · 7 min read · Why the cheapest path through auth-gated scraping is not a better proxy

Open r/webscraping on any given day and you'll find the same five questions in different costumes:

"I built a phone-to-backend SMS forwarder so the scraper can complete OTP login" — 50 upvotes, accidentally useful tool
"Handling CAPTCHA in Playwright (Python)" — 45 upvotes, 19 comments arguing between $1/1k CAPTCHA APIs and rolling your own CNN
"Scraping blocked by Incapsula at 50K req/day" — a price-monitoring SaaS author who tried "rotating user agents, adding delays, the whole usual playbook" and concluded "the whole 'just use puppeteer with stealth' advice is not cutting it anymore"
"Any method to bypass OTP verification?" — top reply: "Reuse the cookie", follow-up reply: "How?", never answered
"OAuth2 PKCE + Cloudflare Turnstile, Invalid request after the challenge" — Keycloak debugging buried in an enterprise SSO flow

Five different surfaces, one shared architectural cause. And — for the per-user use cases at least — one boring exit that almost nobody in those threads mentions.

Why the conventional playbook gets harder every quarter

The standard scraping stack — Playwright/Puppeteer with a stealth plugin, rotating user agents, datacenter or residential proxies, a CAPTCHA-solving API, sometimes an SMS-forwarder for OTP — is a stack of compensations. Each layer exists because the layer above it is being detected as not-a-real-browser.

That's a losing arms race over time. Incapsula, Cloudflare, DataDome, PerimeterX all share the same incentive structure: discriminate "real human in a real browser" from "scripted automation," and they get richer signals every quarter. Mouse-jitter analysis, TLS fingerprinting (JA3/JA4), behavioral consistency across navigations, canvas/WebGL/AudioContext entropy, residential-IP reputation services. Whatever stealth plugin you're running last month is in their training data this month.

The Reddit threads above are the visible top of the iceberg: every few weeks someone hits the next layer and posts. The OP of post #3 — at 50K req/day across 200 retailer domains — articulated it explicitly: "sites that worked fine 3 months ago are now fortress-level protected."

The boring exit: be a real browser, not a fake one

If detection is the problem, the cheapest fix is to not get detected — by actually being the thing detection systems look for: a real human browser with a real session, real cookies, real prior navigation history, real TLS fingerprint.

Concretely:

Log into the target site once, manually, in your real Chrome. You complete the OTP, you click the Turnstile checkbox, you let Incapsula score you as human. None of that is automated. None of it has to be.
Drive the automation from that authenticated session. The cookies, localStorage, and indexedDB stay in your real browser profile. Your script doesn't need a proxy because it's not sending traffic from anywhere suspicious — it's literally your browser's tab making the request.
The challenge already passed. You're inside the moat. Re-prompting per-request is rare; most sites gate on "is there a valid post-OTP session cookie?" and stop checking once they see one.

This is what the muted "reuse the cookie" reply in post #4 was gesturing at, and nobody elaborated. The technique is correct; the missing piece was "how do you actually do that without it falling apart every two days when cookies refresh or SSO redirects."

How Tap implements it

Tap (MIT, GitHub) is a Chrome extension plus CLI plus MCP server that does exactly the above. The interesting design choices:

Tap runs inside your authenticated Chrome. The extension hooks into your existing browser session. There is no headless instance, no remote profile, no storage_state.json file to manage. Your cookies never leave the machine.
Plans are JSON, not scripts. A saved .plan.json is an array of typed ops — nav, fetch, eval, wait, etc. Compile-time AI helps you write them; runtime replay is deterministic and uses zero LLM tokens.
Fetches use credentials: "page-session" — meaning the fetch is dispatched from within the authenticated tab's context. Your live cookies attach automatically. Same-origin policy applies as if you typed the URL into the address bar yourself.
Replay is reproducible. A plan that worked today produces the same DOM/JSON tomorrow, modulo upstream data changes. When selectors drift, the doctor surfaces it; when they don't, the plan runs at the speed of HTTP, not the speed of an LLM.

Three scenarios from the threads above, sketched as plan steps:

OTP-gated site (post #1, post #4)

Open Chrome, log into the site, complete the SMS OTP once, like a person.
tap capture <url> — Tap forges a plan against the page you can now see.
tap run <name> on schedule. The session cookie carries auth; OTP doesn't re-trigger.

If the site eventually expires the session, you redo step 1. That's once a week or once a month depending on the site, not once per scrape.

CAPTCHA-on-login site (post #2)

Same as above. You solve the CAPTCHA once with your eyes. The challenge issues a session token that's good for as long as the site lets it. The automation never sees a CAPTCHA because it never logs in — it just inherits the logged-in tab's state.

This sidesteps the entire "$1 per 1,000 CAPTCHAs + 80% accuracy" subdebate, and the "train a CNN on 200 hand-annotated samples" subdebate, by changing the problem. You're not solving CAPTCHAs anymore; you're not encountering them.

Incapsula / Cloudflare Turnstile (post #3, post #5)

Honest answer first: if your use case is 50K requests/day across 200 retailer domains, Tap is not your tool. That's a cloud-scale, multi-tenant, residential-proxy job — go pay Bright Data or ScraperAPI, or buy the data from an enterprise provider. Tap is built around one user × N sites, not N users × one site.

For per-user use cases — your own dashboards, your own SaaS accounts, your own paywalled subscriptions, niche sites you watch personally — Cloudflare and Incapsula don't fire. They're tuned to catch traffic that looks unlike normal human behavior. Your authenticated browser, navigating like you'd navigate, doesn't trip the heuristics. The infrastructure that costs SaaS scrapers thousands a month is invisible to single-seat use.

The honest limits

Local-first browser automation is the right tool when:

You're scraping for yourself, your team, or one customer at a time.
The site allows a human session and you have a legitimate account.
The data isn't being re-sold to thousands of customers (where you'd want centralized infrastructure).
You can tolerate "log in once, refresh occasionally" instead of "fully headless 24/7 fleet."

It's not the right tool for:

Cloud-scale crawling at >5K req/day per site (residential proxy territory).
Multi-account fan-out — Tap is one browser per machine.
Sites with per-request session validation that re-prompts OTP/CAPTCHA every few hits (rare, mostly banking).
Headless server environments where there's no real Chrome to attach to.

If you're in those categories, the existing tooling (Playwright + residential proxies + a CAPTCHA solver + sometimes an OTP relay) is genuinely the right shape, despite the arms-race cost. Don't pretend otherwise.

The deeper architectural point

"Credentials in the cloud" is not a neutral architectural choice. It's the choice that makes most modern anti-scraping detection possible — because the cloud is where suspicious traffic comes from, by default. The moment your credentials live on the user's machine and your traffic comes from the user's actual residential IP through the user's actual TLS stack, you're indistinguishable from the user, because you are the user.

This isn't a fancy bypass. It's just the boring observation that the cheapest way to look human is to be operated by a human, even if only briefly, once per session.

r/webscraping has been having the same five conversations for years, each cycle getting harder as detection improves. The architectural exit ramp has been there the whole time. It's not for everyone — but for the per-user category, it's the cheapest, most durable, and least adversarial answer available.

Resources

Taprun: your agent runs the browser task — you keep the audit trail

Tell your agent a browser task on any site that needs your login — it runs in your real, already-logged-in Chrome and compiles it once into a deterministic, auditable .plan.json program: a versioned, reviewable record of exactly what it did. Every replay after is local, zero tokens, same result every time. Cookies and sessions never leave your machine — by architecture, not policy. Cloud browser SDKs can't match this; they need your session in their database to function. tap verify catches substrate drift before your data goes stale. Works with Claude Code, Cursor, Cline, Windsurf, and any MCP host. 70+ community taps.

curl -fsSL https://taprun.dev/install.sh | sh

taprun.dev · GitHub · More posts

Follow new engineering notes: RSS · Watch on GitHub