The Interface Protocol: 8 Operations That Replace Every Browser Automation SDK

April 13, 2026 · Leon Ting · 7 min read

Playwright has 400+ API methods. Puppeteer has 300+. Selenium has its own taxonomy of WebDriver, WebElement, Actions, Options. Every browser automation framework invents its own surface area, and every program written against one framework is locked to that framework forever.

Tap takes a different approach. 8 core operations. That's the entire interface between a tap program and any runtime. A Chrome extension implements them. A Playwright runtime implements them. A macOS desktop runtime implements them. The programs don't know or care which one is running underneath.

The Problem: SDK Lock-in

Write a Playwright script today. Tomorrow you need it to run inside a Chrome extension (because the site needs real login cookies). Rewrite. Next month you need to automate a native macOS app. Rewrite again.

Each framework carries implicit assumptions:

Playwright assumes a browser it launched and controls via CDP.
Puppeteer assumes a Chromium process it owns.
Selenium assumes a WebDriver binary running somewhere.
AppleScript assumes macOS with accessibility permissions.

Your automation logic — "click this button, read that table, type into that field" — is identical across all of them. The code is completely different. The abstraction layer is missing.

8 Core Operations

Every interface interaction — browser, desktop, mobile — reduces to eight irreducible operations:

eval        Execute code in the target context, return result
pointer     Move / click / drag at coordinates
keyboard    Type text or press key combinations
nav         Navigate to a URL
wait        Wait for a condition (time, selector, network idle)
screenshot  Capture the current visual state
run         Execute another tap program
capabilities Report what this runtime supports

That's it. Not 400 methods. Not a class hierarchy. Eight functions that every runtime implements, and every tap program calls.

The insight is irreducibility. We didn't start with "what features should we support?" and enumerate. We started with "what is the minimum set of operations from which every other operation can be composed?" and reduced.

17 Built-in Operations, Composed from 8

If 8 operations are all you have, how do you click a button by its text? How do you fill a form? How do you upload a file?

You compose:

// click("Submit") is not a core operation.
// It's composed from two core operations:

click(target) =
  eval(find target → get coordinates)
  + pointer(x, y, 'click')

// type("hello") into a field:

fill(selector, value) =
  eval(find element → focus it)
  + keyboard(value)

// upload a file:

upload(selector, path) =
  eval(find file input)
  + runtime-specific file injection

Tap ships 17 built-in operations composed from core: click, type, fill, hover, scroll, pressKey, select, upload, dialog, fetch, find, cookies, download, waitFor, waitForNetwork, ssrState, storage. Every one of them works on every runtime because every one of them is built from the same 8 primitives.

A runtime can override a built-in for performance — Chrome's extension runtime uses CDP for file uploads because it's faster than the composed path — but it doesn't have to. The default composition works everywhere.

What This Means for Programs

A .tap.js program calls built-in operations. It never calls Playwright methods, Chrome APIs, or AppleScript commands. This makes every tap program runtime-agnostic by construction:

// This tap works on Chrome, Playwright, and macOS
// without a single line changed

export default {
  site: "github",
  name: "trending",

  async tap(handle) {
    await handle.nav("https://github.com/trending");
    const repos = await handle.eval(() => {
      return [...document.querySelectorAll("article.Box-row")]
        .map(el => ({
          name: el.querySelector("h2 a")?.textContent?.trim(),
          stars: el.querySelector("span.d-inline-block")?.textContent?.trim(),
        }));
    });
    return { rows: repos };
  }
}

Switch runtimes with a flag:

$ tap github trending                        # Chrome extension (default)
$ tap github trending --runtime playwright   # Headless, CI-friendly
$ tap github trending --runtime chrome       # User's real browser, logged in

Same program. Same output. Different runtime underneath. The program doesn't know and doesn't need to know.

Why Not Just Use Playwright Everywhere?

Playwright is excellent. It's the best browser automation library ever built. But it's a library, not a protocol. The difference matters:

	Library (Playwright)	Protocol (Tap)
Browser you control	Yes	Yes
User's real browser (cookies, sessions)	No	Yes (Chrome runtime)
Native desktop apps	No	Yes (macOS runtime)
Mobile apps (future)	No	Yes (protocol is extensible)
CI / headless	Yes	Yes (Playwright runtime)
Program portability	Locked to Playwright	Any runtime

A protocol abstracts the runtime. A library is the runtime. When you write against a library, you get that library's capabilities and constraints. When you write against a protocol, you get every present and future runtime that implements it.

The TCP/IP Analogy

Before TCP/IP, every network vendor had its own protocol. IBM had SNA. DEC had DECnet. Novell had IPX/SPX. Applications written for one couldn't talk to another. TCP/IP defined a minimal, composable interface — and within a decade, every other protocol was either dead or tunneled over TCP/IP.

Browser automation in 2026 is pre-TCP/IP networking. Every vendor has its own interface. Programs are locked to vendors. The abstraction layer that should exist — a minimal protocol for interface operations — doesn't.

Tap's 8 core operations are that layer. Not because 8 is a magic number, but because 8 is what you get when you reduce instead of enumerate. You don't need 400 methods. You need the minimum set from which 400 methods can be composed.

How a New Runtime Gets Added

Implementing a Tap runtime is implementing 8 functions:

// A runtime is this interface. Nothing more.

interface TapRuntime {
  eval(code: string): Promise<any>
  pointer(x: number, y: number, action: string): Promise<void>
  keyboard(text: string, opts?: KeyOpts): Promise<void>
  nav(url: string): Promise<void>
  wait(condition: WaitCondition): Promise<void>
  screenshot(opts?: ScreenshotOpts): Promise<Buffer>
  run(site: string, name: string, args?: object): Promise<Result>
  capabilities(): Promise<Capabilities>
}

Implement these 8 methods and you get all 17 built-in operations for free. click, fill, upload, scroll — they all compose from your 8 implementations without any additional code.

Tap's Chrome extension runtime is ~600 lines. The Playwright runtime is ~400 lines. The macOS runtime (JXA + CGEvent + Accessibility API) is ~500 lines. That's how small a runtime is when the protocol is right.

Capability Negotiation

Not every runtime supports everything. The macOS runtime can't do cookies (there's no browser). A headless Playwright runtime can't access a user's logged-in session. A mobile runtime might not support keyboard the same way.

This is what capabilities() is for:

// Chrome extension runtime
capabilities() {
  return {
    eval: true,
    pointer: true,
    keyboard: true,
    cookies: true,       // can read/write browser cookies
    upload: true,        // CDP file injection
    sessions: true,      // user's real login sessions
  }
}

// Playwright runtime
capabilities() {
  return {
    eval: true,
    pointer: true,
    keyboard: true,
    cookies: true,
    upload: true,
    sessions: false,     // no user sessions, fresh context
    headless: true,      // can run without display
  }
}

A tap can declare requires: ["sessions"] and Tap will route it to a runtime that has sessions. A CI pipeline can request --runtime playwright knowing it supports headless. The protocol handles the negotiation; the program stays clean.

Protocol vs. Implementation

The deepest design principle behind the interface protocol: the tap API is the protocol, not the implementation.

The Chrome extension is not Tap. Playwright is not Tap. The macOS runtime is not Tap. They are implementations of the protocol. The protocol is the 8 operations, the 17 built-in compositions, and the capability negotiation contract.

This distinction is why Tap programs survive runtime changes. When Chrome ships a breaking API change, the Chrome runtime updates — zero tap programs change. When a new platform appears (Android, iOS, Electron), a new runtime implements 8 methods — and every existing tap program runs on it immediately.

This is what "programs beat prompts" looks like at the architecture level. The program is written against a protocol, not a vendor. The protocol is minimal enough to implement anywhere. The implementation is swappable. The program is permanent.

Programs Beat Prompts — why AI should write code, not run it
Composable Taps Are Just JavaScript — why the pipeline DSL is JS, not YAML
Deterministic vs AI Browser Automation — compiler model vs interpreter model

Try it now

# Install
curl -fsSL https://taprun.dev/install.sh | sh

# Same tap, different runtimes
tap github trending                         # Chrome
tap github trending --runtime playwright    # Headless

Home · GitHub