← Tap · Blog

We Ran 15,000 Browser Automations. The Failure That Matters Most Is Invisible to Your Monitoring.

April 20, 2026 · Leon Ting · 9 min read

Half of our YouTube automation runs return 0 rows. Status: ok. No exception thrown. No error logged. The program finishes in about 20 seconds and hands back an empty array, silently.

We didn't know this until we looked at the traces.

Over the past few months, Tap has executed 15,455 automation programs across real websites — Reddit, GitHub, Bilibili, Xiaohongshu, YouTube, Twitter, and more. The traces are structured JSON: site, tap name, status, rows returned, duration, error message if any. We analyzed all of them. What we found disagrees with the conventional mental model of how browser automations break.

The Reliability Table Nobody Publishes

Here are the actual numbers. Each row is a real platform, not a synthetic benchmark. Hard error rate is the fraction of runs that threw an exception. Silent empty rate is the fraction of successful runs (status: ok) that returned zero rows.

Platform Total runs Hard error % Silent empty % Effective failure % Avg duration
Twitter / X 128 0% 0% 0% 154 ms
GitHub 437 0% 0.2% 0.2% 3,644 ms
Reddit 688 13.8% 6.4% 19.4% 4,075 ms
Xiaohongshu 361 15.8% 6.6% 21.5% 9,054 ms
Bilibili 259 30.1% 18.2% 43.1% 2,666 ms
Weibo 38 36.8% 0% 36.8% 4,644 ms
YouTube 49 30.6% 50.0% 65.3% 20,273 ms

A few things stand out immediately.

GitHub and Twitter are near-zero failure. Both expose stable public APIs that the DOM reflects predictably. GitHub's average duration is 3.6 seconds because it's doing real network work — but the result is reliable.

YouTube is the opposite end of the spectrum. Two out of three runs either throw an error or return nothing. And the 50% silent empty rate is more alarming than the 30.6% hard error rate, because at least hard errors are visible.

50%
of YouTube "successful" runs return 0 rows
43%
effective failure rate on Bilibili
19%
effective failure rate on Reddit

The Failure Mode You're Not Tracking

Here's the part that surprised us most. We expected "element not found" to be the dominant failure. The conventional model: selector breaks, automation throws, you fix the selector. Obvious, visible, actionable.

The actual numbers:

Element not found (explicit selector failure): 5 occurrences
Cannot read properties of undefined (reading 'url') (implicit structural failure): 176 occurrences

The ratio is 35:1 in favor of the failure mode your monitoring doesn't catch.

What does Cannot read properties of undefined (reading 'url') actually mean in practice? The selector found something. The extraction ran. The automation didn't crash during navigation. It returned data — a list of objects — but the objects no longer have a url field. The downstream code that tries to access item.url hits undefined and throws.

This is a structural drift failure, not a selector failure. The DOM element is there. The page loaded. The program traversed the right nodes. But the shape of the data those nodes return has changed — a field that was always present quietly stopped being present.

The sites affected, in order of frequency:

That list spans Chinese platforms, Western platforms, social networks, news sites, and developer bounty boards. The failure mode is not platform-specific. It's inherent to how browser automation interacts with any site that changes its rendering.

Why Your Monitoring Doesn't See This

Consider what's happening at the infrastructure layer when this failure occurs:

Most monitoring stacks see a successful process exit followed by an application exception. If you're logging structured errors, you get the TypeError. But the harder version of this failure is when the object does have a url field — it just points to something different than it used to. A related item section. A sponsored result. A pagination link that got included in the data array.

In those cases: status ok, rows returned, no exception, wrong data. Pydantic passes because the type is correct. Row count checks pass because rows exist. Prometheus reports a healthy process. OTel has nothing to report. The only signal is semantic: these URLs aren't the URLs you wanted.

That's the 121 silent-empty runs we logged across real sites — automations that completed successfully and returned no useful data. Some of them ran for 20 seconds before returning an empty array that looked exactly like a valid empty result.

The Platform Reliability Pattern

Looking at the data, there's a clear split between platforms with stable structural contracts and those without.

GitHub and Twitter have published APIs that their web UIs reflect. A GitHub repository page structure is stable because it's owned by the same team that maintains the underlying data model. When GitHub updates its UI, the data contract tends to stay consistent because the API is the source of truth.

Bilibili, Douyin, Xiaohongshu, and Weibo sit at the other end. These platforms run aggressive A/B experiments on their rendering layer — sometimes multiple experiments simultaneously for different user cohorts. The same page, loaded twice in the same session, can return different DOM structures. The url field on a video card might be in item.url in one experiment variant and item.jumpUrl in another.

This isn't negligence. It's an engineering culture optimized for rapid iteration. The consequence for automation is that the structural contract you negotiated when you wrote the tap expires faster than on Western platforms.

YouTube lands in between for a different reason: aggressive anti-bot measures that return empty results instead of blocking requests. A request that would return a 429 or a CAPTCHA page on a naive scraper returns 200 with an empty content container on a logged-out browser session. Status: ok. Rows: 0. Duration: 20 seconds of wasted compute.

What Catches This, and What Doesn't

Tool Catches hard error? Catches silent empty? Catches wrong data (right shape)?
Process monitoring Yes No No
Pydantic / type validation Yes Sometimes No
Row count threshold Yes Yes No
Health contracts (range + pattern + drift) Yes Yes Yes
Structural fingerprinting (tap doctor) Yes Yes Signals change, not interpretation

The only layer that catches all three failure classes is a contract that validates semantics — not just shape. A min_rows check on the health contract catches silent empties. A pattern check on URLs catches wrong-source data. A drift check catches distribution shifts that look like valid data but represent a changed ranking or filtering behavior.

Tap's tap doctor adds structural fingerprinting on top: it checksums the DOM elements your tap depends on and alerts when the structure changes, before you run the tap and discover the data is wrong after the fact.

$ tap doctor bilibili videos
✘ bilibili/videos  FAIL  score: 0.41
  structural fingerprint changed: .video-card .jump-link (url extraction point)
  min_rows: expected >= 10, got 0
  ⚠ last successful run: 6 hours ago
  ✓ nav: page loads
  ✓ non_empty: title field present

The fingerprint change is the leading indicator. The zero rows are the downstream consequence. Without the fingerprint check, you'd wait until a run completes — and potentially burn pipeline compute — before discovering the tap is broken. With it, you know before you run.

What We'd Do Differently

Looking at the data, a few things are clear in retrospect.

Treat silent empties as first-class failures. A run that returns zero rows should be treated as suspicious by default, not as a valid empty result. Most automations that legitimately return zero rows are edge cases. Most automations that return zero rows unexpectedly are broken. The difference is detectable with a min_rows contract.

Fingerprint before running, not after. The structural drift that causes Cannot read properties of undefined is detectable in the DOM before you run your extraction logic. A fingerprint check is cheaper than a full tap execution. Running it as a pre-flight saves compute on taps that will fail anyway.

Treat Chinese platforms as a separate reliability tier. Not because of quality — but because the A/B experiment cadence is genuinely different. A tap targeting Bilibili or Xiaohongshu needs shorter contract drift windows and more frequent doctor checks than one targeting GitHub. Build that into the maintenance schedule.

Duration is a signal. Our YouTube taps average 20 seconds per run and fail 65% of the time. That's not slow extraction — that's 20 seconds of waiting for content that's not coming. A timeout contract that fires at 8 seconds would catch most of these early and save the rest of the pipeline from waiting.


The trace data from 15,455 runs is the most honest answer we have to the question "what actually breaks in browser automation?" The answer is: silent structural drift, not explicit selector failure. The sites that change fastest break most. The failures that matter most are the ones that look like success.

If you're building or maintaining automations, the most useful thing you can add isn't better error handling for explicit failures — it's a contract layer that turns silent failures into visible ones.

Try tap doctor on your automations

$ npx -y @taprun/cli doctor <site> <tap>
# structural fingerprint check before you run
# health contract validation after you run
# diff output showing exactly what changed