You give Browser Use a task. It loads the page. Clicks once. Screenshots. Reasons. Clicks again. Screenshots. Reasons. Twelve minutes and eighteen screenshots later, it's still on the same page, trying the same button, making the same mistake.
You are not imagining it. A long r/AI_Agents thread titled "I tested 6 browser-use agents so you don't have to" is full of this:
"tried playwright mcp (33 tools, burns through context), browser-use (stuck in loops), puppeteer (selectors break constantly). We're asking code-focused LLMs to puppet browsers when they weren't trained for that — they're guessing at selectors and hoping elements load."
— u/FunBrilliant5713, r/AI_Agents
"Browser Use — hit or miss reliability. Felt like I was babysitting it."
— u/FunBrilliant5713, r/AI_Agents
Here's the math nobody prints on the landing pages. Suppose each LLM-driven browser step has a 95% success rate — generous for anything beyond a plain text field:
A four-hop checkout flow is four steps if you're lucky. A "book me a flight" agent is twenty. The expected outcome of a twenty-step LLM workflow is failure, not success. And when it fails, it rarely fails cleanly — it fails by trying again. That's your loop.
Most LLM browser agents treat planning as stateless. The model sees the current DOM, plans the next action, acts. When the action fails silently (wrong element, element not loaded, modal overlaying, click went to a sibling), the next iteration sees almost the same DOM and plans almost the same action. The loop is emergent behavior, not a bug.
"It hallucinated form input values — writing '123 Main St' into a field I'd clearly told it to leave blank."
— agotterer, Hacker News (on browser-use)
Non-determinism makes the loop worse, because the agent doesn't remember why the last attempt failed. It just tries a fresh hallucination.
A vocal subset of the community already figured out the fix without reading an architecture blog post:
"The AI runs the workflow once, learns the pattern, then it executes without the LLM — making it 100x cheaper and way more reliable. My monthly LLM costs went from $200 to $2."
— u/Omega0Alpha, r/AI_Agents
That's the architecture. It has a name: forge. AI participates at authoring time. Runtime is deterministic code. No per-step re-planning. No hallucinated inputs. No loops.
# One-time authoring — AI participates, writes a deterministic Plan $ tap capture https://example.com example/listings --intent "product listings" ✔ Saved: ~/.tap/plans/example/listings.plan.json # Every run after — deterministic, zero LLM $ tap example/listings 30 rows (480ms, $0.00, 0 tokens)
The plan is bare JSON over a closed 11-op vocabulary. It calls real selectors. The per-tap CEL snapshot_equivalent predicate validates output. It either returns rows or fails loudly — no silent hallucination, no loop, no "hit or miss."
Sites change. That's the one variable scrapers can't control. But with determinism, failure is visible instead of ambient:
$ tap verify example/listings verdict: drifted snapshot_equivalent returned false: size($.rows) >= 5 evaluated to false (got 0) recover: tap capture <url> example/listings
Fix is a selector swap. Hand the diff to your agent, it patches, ships. You never lived through the 18-screenshot loop.
Browser Use and similar tools are the right answer for:
For anything you'll do more than twice, compile it. Your future self (and your token bill) will thank you.