Why your Playwright scraper keeps getting logged out

May 28, 2026 · Leon Ting · 8 min read · storageState, sessionStorage, service workers, and the boring fix

Open StackOverflow, filter to [playwright], sort by frequency. About every fifth question is some variant of this:

"I save storageState after login and load it on the next run, but the site logs me out anyway. What am I missing?"

The shape recurs on r/Playwright, on the Playwright GitHub issue tracker, in r/webscraping. The answers are usually some combination of "wait longer," "check your cookies," "maybe it's CSRF," "you need headless: false." Most of those answers are technically right about one case and wrong about most cases. After fixing this bug class enough times across enough sites, it stops looking like a collection of edge cases and starts looking like a single architectural mismatch.

This post is the architectural version of the answer: what a "logged-in" web session actually depends on, what Playwright captures, what it silently drops, and the operational cost of trying to simulate a session at all.

What "logged in" actually depends on

The web platform has accumulated, over twenty-five years, at least four orthogonal places where a server-issued session token can live in your browser:

Cookies — the original. Set-Cookie on a response, sent back on every same-origin request that matches the path/secure/SameSite predicate. Visible to JavaScript unless HttpOnly.
localStorage — origin-scoped, persists indefinitely until the user (or your code) clears it. Modern SPA login flows often stash a refresh token here.
sessionStorage — origin-scoped and tab-scoped. Wiped when the tab closes. Many SPAs put short-lived access tokens here so they die when the user closes the tab — a deliberate security property.
IndexedDB — origin-scoped structured storage. Used by service workers, by larger SPAs (Gmail, Linear, Notion), by anything that wants to cache auth-related state across reloads.

Plus a load-bearing fifth thing that isn't storage at all:

In-memory state on a live navigation — CSRF tokens issued by the page render, JWTs returned by an /api/me call the SPA makes on first paint, cookies set as side effects of OAuth callback URLs the user already navigated through. None of this survives a fresh navigation. It exists because the previous navigation happened.

"Are you logged in?" is the question "can the server identify you as a known user, on the next request you make from this page?" A real human's browser usually has all five mechanisms in agreement. Your Playwright run almost never does.

What `storageState` captures

From the Playwright docs:

storageState() returns storage state for this browser context, contains current cookies and local storage snapshot.

Cookies. localStorage. Two of the five.

The other three — sessionStorage, IndexedDB, in-memory state — are not in the snapshot. If a site's login depends on any of them, your scraper will log in once during the priming run, save state, load state on run #2, and discover it's logged out again. The cookies are present; the access token in sessionStorage is not; the SPA's bootstrap code sees no token in sessionStorage, decides you're a fresh visitor, and redirects to the login page.

The Playwright team is not hiding this. The behavior is documented. The problem is that the documentation describes a snapshot, and developers reach for it assuming it captures the session. Those are different things.

An example: the JWT-in-sessionStorage trap

A real site (anonymized; this shape recurs across at least four B2B SaaS apps I've debugged this year). Login flow:

User submits credentials to POST /api/auth/login.
Server responds with Set-Cookie: refresh_token=...; HttpOnly; SameSite=Strict and a JSON body containing { accessToken: "ey..." }.
The SPA's auth interceptor reads accessToken from the body and stashes it in sessionStorage["accessToken"].
Every XHR after that pulls the token from sessionStorage and sets Authorization: Bearer ....
When the access token expires (15 minutes), the interceptor calls POST /api/auth/refresh. The refresh cookie identifies the user; the server returns a new access token; the SPA stashes it in sessionStorage and continues.

This is a normal, defensible design. The refresh token gets HttpOnly + SameSite=Strict protection. The access token is short-lived and dies when the tab closes. The user gets persistence across tabs (refresh cookie) without giving a long-lived bearer token to document.cookie.

Now you run Playwright. page.goto('/login'), fill, submit, context.storageState({ path: 'auth.json' }). On run #2: chromium.launchPersistentContext with storageState: 'auth.json'. Navigate to the dashboard. You are not logged in.

Why: auth.json contains the refresh cookie (cookies are captured) but the access token is in sessionStorage (not captured). The SPA bootstraps, looks for sessionStorage["accessToken"], finds nothing, decides you're unauthenticated, redirects to login. The refresh cookie is sitting right there in your request headers, perfectly valid, and the client-side code never asks for it because nothing told it the user was mid-session.

The "fix" in the StackOverflow answers is usually also wait for the refresh endpoint to fire automatically. That works on sites where the SPA bootstrap optimistically calls /api/auth/refresh on every load. It doesn't work when the bootstrap checks sessionStorage first and only calls refresh when it has reason to.

Three fixes, ranked by leverage

Fix 1: capture more than storageState

You can manually round-trip the missing storage:

const sessionData = await page.evaluate(() => ({
  accessToken: sessionStorage.getItem('accessToken'),
  // …other keys you've reverse-engineered
}));

// On reload:
await page.addInitScript((data) => {
  for (const [k, v] of Object.entries(data)) {
    sessionStorage.setItem(k, v);
  }
}, sessionData);

This works. It also rots fast. Every time the site rotates which storage key holds which token, your scraper breaks. Every time they add IndexedDB to the mix, your scraper breaks. Every time they introduce a service worker that intercepts the /api/me call and caches the response, your scraper breaks. You're now maintaining a parallel, reverse-engineered model of someone else's auth implementation. That model has a half-life measured in deploys.

Fix 2: persistent context

chromium.launchPersistentContext(userDataDir, options) reuses a Chrome user-data directory across runs. Cookies, localStorage, sessionStorage, IndexedDB — all of it persists, because you're running against the same on-disk Chrome profile both times.

This actually works. The catch is operational: a persistent context dir is a stateful artifact. It holds cookies that expire, service workers that get invalidated, cache entries that go stale. CI runs that share a persistent context across multiple jobs hit race conditions on the lock file. Persistent contexts also can't be containerized cleanly — the whole point is on-disk state, which doesn't survive a fresh container.

And the credential exposure surface gets larger: anything that can read userDataDir can impersonate the user on every site they were logged into. If that directory ends up in a CI artifact, a build cache, a developer's home directory backed up to a corporate sync — the blast radius is your entire browser history's worth of sessions.

Fix 3: drive your real Chrome

The third option is to skip the simulation entirely. Connect Playwright (or a CDP client) to the Chrome the user is already running, the one they logged in to last week, the one with all five storage mechanisms already in their normal lived-in state.

The Chrome DevTools Protocol exposes this directly. So does the Chrome extension API. The trade-off space is different from headless Playwright:

	storageState reload	Persistent context	Attach to real Chrome
Captures all 5 storage mechanisms	No (2 of 5)	Yes	Yes
Reproducible across machines	Yes (small file)	Partial (large dir)	No (machine-bound)
Survives credential rotation by the site	No (breaks)	Until next forced re-auth	Yes (user re-auths normally)
Containerizable	Yes	Awkward	No
Credential blast radius	One site, in a small file	All sites, in a large dir	None — credentials don't leave the user's machine
Right for…	Sites with cookie-only auth	Sites with sessionStorage/SW auth, single-machine CI	Per-user automation, anything credential-sensitive

The third column is the architecture Taprun defaults to and the reason this whole bug class disappears when you use it. You aren't simulating a session. You're driving the actual session the user is already in.

When each fix is actually right

The decision tree is shorter than it looks:

Is the site cookie-only? (document.cookie auth, no JS-side token in sessionStorage.) storageState reload is fine.
Does the site use sessionStorage or IndexedDB for auth, and do you control the CI environment? Persistent context is fine — accept the operational complexity.
Is this a per-user automation? Does the user need to keep their credentials? Is the site behind MFA or OTP? Will the site rotate auth implementation faster than you can maintain a reverse-engineered model? Drive the real Chrome.

That last row is what "scrape behind login" usually means in practice. The user has an account. They've logged in. They don't want to hand the cookie jar to a CI job or a vendor. They want a tool that runs in their browser, scoped to whatever they're already authorized to see. The right architecture is to stop pretending the scraper is a separate browser at all.

What this looks like in Taprun

For completeness — this is the thing the post is implicitly about. Tap's runtime is a Chrome extension that exposes the user's already-authenticated tabs to a local MCP server. Plans (.plan.json) reference site state by URL and DOM selector, not by re-authentication script. The plan never sees the cookies; the extension never sends them anywhere; the agent that called the plan doesn't have credential scope. Re-login bugs of the kind in this post can't happen because there's no re-login: the user logged in last Tuesday, the session is whatever Chrome says it is right now, and the scraper inherits that state instead of constructing its own.

The cost is real: a Tap plan only runs on the machine where the user is logged in. You can't lift it to a CI runner and have it scrape behalf-of-user accounts at scale. For per-user automation, that's the constraint you want. For mass anonymous scraping, you want a different tool.

If the bug-class enumeration in this post matches what you've been fighting — the Playwright re-login loop, the sessionStorage trap, the storageState that worked yesterday and doesn't today — try driving real Chrome for one of the per-user scrapers in your fleet and see whether the bug class survives the move. In my experience across roughly a dozen sites with this shape, it doesn't.

Tap is a local-first browser MCP — your agent drives your already-logged-in Chrome, plans compile once to .plan.json, replays are deterministic and free. taprun.dev · github.

Taprun: your agent runs the browser task — you keep the audit trail

Tell your agent a browser task on any site that needs your login — it runs in your real, already-logged-in Chrome and compiles it once into a deterministic, auditable .plan.json program: a versioned, reviewable record of exactly what it did. Every replay after is local, zero tokens, same result every time. Cookies and sessions never leave your machine — by architecture, not policy. Cloud browser SDKs can't match this; they need your session in their database to function. tap verify catches substrate drift before your data goes stale. Works with Claude Code, Cursor, Cline, Windsurf, and any MCP host. 70+ community taps.

curl -fsSL https://taprun.dev/install.sh | sh

taprun.dev · GitHub · More posts

Follow new engineering notes: RSS · Watch on GitHub