Open StackOverflow, filter to [playwright], sort by frequency. About every fifth question is some variant of this:
"I save
storageStateafter login and load it on the next run, but the site logs me out anyway. What am I missing?"
The shape recurs on r/Playwright, on the Playwright GitHub issue tracker, in r/webscraping. The answers are usually some combination of "wait longer," "check your cookies," "maybe it's CSRF," "you need headless: false." Most of those answers are technically right about one case and wrong about most cases. After fixing this bug class enough times across enough sites, it stops looking like a collection of edge cases and starts looking like a single architectural mismatch.
This post is the architectural version of the answer: what a "logged-in" web session actually depends on, what Playwright captures, what it silently drops, and the operational cost of trying to simulate a session at all.
The web platform has accumulated, over twenty-five years, at least four orthogonal places where a server-issued session token can live in your browser:
Set-Cookie on a response, sent back on every same-origin request that matches the path/secure/SameSite predicate. Visible to JavaScript unless HttpOnly.Plus a load-bearing fifth thing that isn't storage at all:
/api/me call the SPA makes on first paint, cookies set as side effects of OAuth callback URLs the user already navigated through. None of this survives a fresh navigation. It exists because the previous navigation happened."Are you logged in?" is the question "can the server identify you as a known user, on the next request you make from this page?" A real human's browser usually has all five mechanisms in agreement. Your Playwright run almost never does.
storageState capturesFrom the Playwright docs:
storageState()returns storage state for this browser context, contains current cookies and local storage snapshot.
Cookies. localStorage. Two of the five.
The other three — sessionStorage, IndexedDB, in-memory state — are not in the snapshot. If a site's login depends on any of them, your scraper will log in once during the priming run, save state, load state on run #2, and discover it's logged out again. The cookies are present; the access token in sessionStorage is not; the SPA's bootstrap code sees no token in sessionStorage, decides you're a fresh visitor, and redirects to the login page.
The Playwright team is not hiding this. The behavior is documented. The problem is that the documentation describes a snapshot, and developers reach for it assuming it captures the session. Those are different things.
A real site (anonymized; this shape recurs across at least four B2B SaaS apps I've debugged this year). Login flow:
POST /api/auth/login.Set-Cookie: refresh_token=...; HttpOnly; SameSite=Strict and a JSON body containing { accessToken: "ey..." }.accessToken from the body and stashes it in sessionStorage["accessToken"].Authorization: Bearer ....POST /api/auth/refresh. The refresh cookie identifies the user; the server returns a new access token; the SPA stashes it in sessionStorage and continues.This is a normal, defensible design. The refresh token gets HttpOnly + SameSite=Strict protection. The access token is short-lived and dies when the tab closes. The user gets persistence across tabs (refresh cookie) without giving a long-lived bearer token to document.cookie.
Now you run Playwright. page.goto('/login'), fill, submit, context.storageState({ path: 'auth.json' }). On run #2: chromium.launchPersistentContext with storageState: 'auth.json'. Navigate to the dashboard. You are not logged in.
Why: auth.json contains the refresh cookie (cookies are captured) but the access token is in sessionStorage (not captured). The SPA bootstraps, looks for sessionStorage["accessToken"], finds nothing, decides you're unauthenticated, redirects to login. The refresh cookie is sitting right there in your request headers, perfectly valid, and the client-side code never asks for it because nothing told it the user was mid-session.
The "fix" in the StackOverflow answers is usually also wait for the refresh endpoint to fire automatically. That works on sites where the SPA bootstrap optimistically calls /api/auth/refresh on every load. It doesn't work when the bootstrap checks sessionStorage first and only calls refresh when it has reason to.
You can manually round-trip the missing storage:
const sessionData = await page.evaluate(() => ({
accessToken: sessionStorage.getItem('accessToken'),
// …other keys you've reverse-engineered
}));
// On reload:
await page.addInitScript((data) => {
for (const [k, v] of Object.entries(data)) {
sessionStorage.setItem(k, v);
}
}, sessionData);
This works. It also rots fast. Every time the site rotates which storage key holds which token, your scraper breaks. Every time they add IndexedDB to the mix, your scraper breaks. Every time they introduce a service worker that intercepts the /api/me call and caches the response, your scraper breaks. You're now maintaining a parallel, reverse-engineered model of someone else's auth implementation. That model has a half-life measured in deploys.
chromium.launchPersistentContext(userDataDir, options) reuses a Chrome user-data directory across runs. Cookies, localStorage, sessionStorage, IndexedDB — all of it persists, because you're running against the same on-disk Chrome profile both times.
This actually works. The catch is operational: a persistent context dir is a stateful artifact. It holds cookies that expire, service workers that get invalidated, cache entries that go stale. CI runs that share a persistent context across multiple jobs hit race conditions on the lock file. Persistent contexts also can't be containerized cleanly — the whole point is on-disk state, which doesn't survive a fresh container.
And the credential exposure surface gets larger: anything that can read userDataDir can impersonate the user on every site they were logged into. If that directory ends up in a CI artifact, a build cache, a developer's home directory backed up to a corporate sync — the blast radius is your entire browser history's worth of sessions.
The third option is to skip the simulation entirely. Connect Playwright (or a CDP client) to the Chrome the user is already running, the one they logged in to last week, the one with all five storage mechanisms already in their normal lived-in state.
The Chrome DevTools Protocol exposes this directly. So does the Chrome extension API. The trade-off space is different from headless Playwright:
| storageState reload | Persistent context | Attach to real Chrome | |
|---|---|---|---|
| Captures all 5 storage mechanisms | No (2 of 5) | Yes | Yes |
| Reproducible across machines | Yes (small file) | Partial (large dir) | No (machine-bound) |
| Survives credential rotation by the site | No (breaks) | Until next forced re-auth | Yes (user re-auths normally) |
| Containerizable | Yes | Awkward | No |
| Credential blast radius | One site, in a small file | All sites, in a large dir | None — credentials don't leave the user's machine |
| Right for… | Sites with cookie-only auth | Sites with sessionStorage/SW auth, single-machine CI | Per-user automation, anything credential-sensitive |
The third column is the architecture Taprun defaults to and the reason this whole bug class disappears when you use it. You aren't simulating a session. You're driving the actual session the user is already in.
The decision tree is shorter than it looks:
document.cookie auth, no JS-side token in sessionStorage.) storageState reload is fine.That last row is what "scrape behind login" usually means in practice. The user has an account. They've logged in. They don't want to hand the cookie jar to a CI job or a vendor. They want a tool that runs in their browser, scoped to whatever they're already authorized to see. The right architecture is to stop pretending the scraper is a separate browser at all.
For completeness — this is the thing the post is implicitly about. Tap's runtime is a Chrome extension that exposes the user's already-authenticated tabs to a local MCP server. Plans (.plan.json) reference site state by URL and DOM selector, not by re-authentication script. The plan never sees the cookies; the extension never sends them anywhere; the agent that called the plan doesn't have credential scope. Re-login bugs of the kind in this post can't happen because there's no re-login: the user logged in last Tuesday, the session is whatever Chrome says it is right now, and the scraper inherits that state instead of constructing its own.
The cost is real: a Tap plan only runs on the machine where the user is logged in. You can't lift it to a CI runner and have it scrape behalf-of-user accounts at scale. For per-user automation, that's the constraint you want. For mass anonymous scraping, you want a different tool.
If the bug-class enumeration in this post matches what you've been fighting — the Playwright re-login loop, the sessionStorage trap, the storageState that worked yesterday and doesn't today — try driving real Chrome for one of the per-user scrapers in your fleet and see whether the bug class survives the move. In my experience across roughly a dozen sites with this shape, it doesn't.
Tap is a local-first browser MCP — your agent drives your already-logged-in Chrome, plans compile once to .plan.json, replays are deterministic and free. taprun.dev · github.
Tell your agent a browser task on any site that needs your login — it runs in your real, already-logged-in Chrome and compiles it once into a deterministic, auditable .plan.json program: a versioned, reviewable record of exactly what it did. Every replay after is local, zero tokens, same result every time. Cookies and sessions never leave your machine — by architecture, not policy. Cloud browser SDKs can't match this; they need your session in their database to function. tap verify catches substrate drift before your data goes stale. Works with Claude Code, Cursor, Cline, Windsurf, and any MCP host. 70+ community taps.
curl -fsSL https://taprun.dev/install.sh | sh
taprun.dev · GitHub · More posts
Follow new engineering notes: RSS · Watch on GitHub