Earlier today we shipped three new MCP tools in one session: tap.explain (static analysis of a pipe before running it), forge.pipe (compose a pipe from a natural-language goal), and tap.trace (post-mortem inspection of a previous tap.run). The architecture argument was that they form an observability sandwich around tap.run: tap.explain says what will happen, tap.run does it, tap.trace says what did happen.
An architecture argument is worth approximately nothing until you use the tools on a real case. This post is what happened when I tried to compose an RDK market-scan-style pipe in Claude Code, ran it against r/sysadmin, and watched the tool chain walk me through four iterations to a working version. Every round produced a new class of failure. Every failure was diagnosable in one tap.trace call. Every fix came from a specific field in the previous round's trace, not from guessing. The final pipe returned five rows that are genuinely relevant to the user's goal. And along the way the demo found a silent bug in our own hand-written reference pipe that nobody had noticed because the bug is invisible unless you pick a subreddit whose headline style defeats a 19-keyword classifier.
forge.pipe turned out to be the wrong toolThe goal was the sort of thing any RDK user would actually say: scan r/sysadmin for the top pain-point discussion threads, rank them by community engagement, and return the top five. The hand-written reference for this is rdk/market-scan, a six-step pipe that composes reddit/sub-intel, reddit/pain-points, reddit/hot, tap/filter, tap/sort, and tap/limit into a structured report. I wanted to see if the new tools could produce something similar starting from the English goal alone.
I reached for mcp__tap__forge_pipe and hit a practical snag: my AI config pointed at Ollama with a model that wasn't installed. The tool would have errored on the HTTP call to the AI endpoint. I had three apparent options — pull a local model, hand over a cloud API key, or skip forge.pipe entirely and use the underlying primitives directly.
The third option is the one that matters, and it's why I'm writing this post instead of a forge.pipe tutorial. Claude Code is already an AI agent. It's the MCP host. It has full access to tap.list, tap.explain, and forge.save. The forge.pipe tool exists to package the draft → explain → iterate loop for clients that don't have built-in AI. When the client is an AI, the loop lives inside the client and the tool call becomes unnecessary. The primitives are the interface; forge.pipe is a convenience wrapper for thin clients.
For the rest of this post, when I write "I drafted a pipe" or "I noticed X in the trace," that's Claude Code drafting the pipe and noticing things as part of this very conversation, using the same MCP server a Tap user would have. Zero external AI API calls. Zero per-round cost. The work Claude does as the host is work the user is already paying for just by typing into the chat.
Step one was reading the catalog. mcp__tap__tap_list({site: "reddit"}) returned 23 reddit taps, including reddit/pain-points, whose description reads "Extract pain points and complaints from a subreddit — validate product ideas." Its columns include title, score, comments, url, excerpt, and a type column that appeared to classify rows into pain-point and discussion categories. A second call to tap.list({site: "tap"}) turned up the transform building blocks: filter, sort, limit, dedupe, pick, table.
That's enough to draft. The obvious minimal pipe chains three steps — fetch pain points, sort by score, take top N:
{
"steps": [
{ "id": "pain", "run": ["reddit", "pain-points"],
"args": { "subreddit": "$args.subreddit", "sort": "hot", "limit": 25 } },
{ "id": "ranked", "run": ["tap", "sort"],
"args": { "rows": "$pain.rows", "field": "score", "order": "desc" } },
{ "id": "top", "run": ["tap", "limit"],
"args": { "rows": "$ranked.rows", "n": 5 } }
],
"return": "$top.rows"
}
I passed this as raw JSON to mcp__tap__tap_explain with args: {subreddit: "sysadmin"}. The response came back in under twenty milliseconds: {ok: true, blocking: false, warnings: false}, three nodes, three sequential rounds, all sub-taps resolved to on-disk paths, no cycles, no unresolved refs, no schema warnings. By the structural contract, the pipe was runnable.
By the semantic contract it had a problem I could see as soon as I re-read reddit/pain-points's columns. The type field returns either pain_point or discussion, depending on some internal classifier. The user asked for top pain points. My pipe was going to include every row — including discussions — and return the top five by score, which is subtly not what was asked for. This is the sort of thing tap.explain can't catch, because it's a domain concern, not a structural one. It's why iterative drafting with a review pass matters even when the static check passes.
Round 2 adds a tap/filter step between the reddit fetch and the sort, filtering by field: "type", eq: "pain_point". Four steps, four sequential rounds of scheduling:
{
"steps": [
{ "id": "pain", "run": ["reddit", "pain-points"],
"args": { "subreddit": "$args.subreddit", "sort": "hot", "limit": 25 } },
{ "id": "pain_only", "run": ["tap", "filter"],
"args": { "rows": "$pain.rows", "field": "type", "eq": "pain_point" } },
{ "id": "ranked", "run": ["tap", "sort"],
"args": { "rows": "$pain_only.rows", "field": "score", "order": "desc" } },
{ "id": "top", "run": ["tap", "limit"],
"args": { "rows": "$ranked.rows", "n": "$args.limit" } }
],
"return": "$top.rows"
}
tap.explain came back ok: true again. I wrote the full .tap.js with metadata (site, name, columns, health, args, examples, the pipe) and called mcp__tap__forge_save({site: "rdk", name: "pain-scan", code: "..."}). Saved to ~/.tap/taps/rdk/pain-scan.tap.js, auto-committed as git c6f9a89. The only surprise in the save response was a false-positive lint warning — checkTapQuality doesn't yet know about static pipe: {} taps and emitted "Missing tap()" even though the executor synthesizes a forwarder for that form. Noted on the follow-up list, but the save itself was real.
Then I ran it against live Reddit:
mcp__tap__tap_run({
site: "rdk",
name: "pain-scan",
args: { subreddit: "sysadmin", limit: 5 }
})
The response came back 5458 milliseconds later:
{
"columns": ["title", "score", "comments", "url", "excerpt", "type"],
"rows": [],
"count": 0,
"timing": { "run_ms": 5458, "total_ms": 5458 },
"run_id": "mnubwmsj-534ecc"
}
Zero rows. No error. Five and a half seconds spent on something. The pipe I had just validated with tap.explain, saved to disk, and run through the executor returned an empty array. This is the canonical "why is my pipe broken" moment. The old tap.jsonl log line for this would say exactly {event: "run", ms: 5458, rows: 0} — true but not useful. Before T_trace, the post-mortem would involve adding console.log statements to the executor and re-running, which is both slower and pointless.
type: "discussion"I called mcp__tap__tap_trace({run_id: "mnubwmsj-534ecc"}). The response was a structured execution record with per-step nodes. Edited for length:
{
"run_id": "mnubwmsj-534ecc",
"total_ms": 5458,
"status": "ok",
"rows_out": 0,
"pipe": {
"nodes": [
{ "id": "pain", "rows_out": 23, "duration_ms": 5454,
"args_resolved": { "subreddit": "sysadmin", "sort": "hot", "limit": 25 } },
{ "id": "pain_only", "rows_out": 0, "duration_ms": 1,
"args_resolved": { "rows": [ /* 23 objects, ALL with type: "discussion" */ ],
"field": "type", "eq": "pain_point" } },
{ "id": "ranked", "rows_out": 0, "duration_ms": 0 },
{ "id": "top", "rows_out": 0, "duration_ms": 1 }
],
"rounds_actual": [["pain"], ["pain_only"], ["ranked"], ["top"]],
"run_cache_misses": 4,
"run_cache_hits": 0
}
}
The answer is in the first two rows_out values. The pain step fetched 23 rows. The pain_only filter step returned 0 rows. The filter ate everything. The subsequent sort and limit propagated the empty array. Total time was dominated by the Reddit fetch; the downstream transform pipeline ran in 2ms combined.
Crucially, the trace captures each step's full args_resolved, not just the arg names. That means I could see all 23 rows the filter was looking at, with their type values, without another tool call. Every single one was type: "discussion". Not one was pain_point. The filter was doing exactly what I asked it to do; I just asked the wrong thing given the data.
From "zero rows after 5458ms, no idea why" to "filter ate everything because every row has type: discussion" was one MCP call. That's the value proposition tap.trace exists to make, delivered on the first real use.
Knowing the filter ate the rows told me where to look next. I opened ~/.tap/taps/reddit/pain-points.tap.js and found the classifier:
const painKeywords = [
"frustrat", "annoying", "broken", "hate", "terrible", "awful", "worst",
"problem", "issue", "bug", "fail", "can't", "unable", "struggle",
"difficult", "hard", "pain", "sucks", "disappointed"
];
const isPainPoint = painKeywords.some(k => titleLower.includes(k));
reddit/pain-points classifies a post as pain_point if and only if its title contains one of 19 hardcoded English keywords. Today's top r/sysadmin hot posts don't hit any of them — "Bad IT decisions causing a corporate meltdown" contains "Bad" which isn't in the list (the list has "awful", "worst", "terrible", but not "bad"); "France Launches Government Linux Desktop Plan as Windows Exit Begins" has no pain keywords; "Patch Tuesday Megathread" has no pain keywords; "When do you NOT create a support ticket?" has no pain keywords. All 23 rows failed the classifier. The filter correctly kept zero of them. The pipe worked exactly as specified; the specification was based on a classifier whose precision on today's r/sysadmin was zero.
reddit/hot, pass explain, return 5 rows, still brokenThe trace pointed at the filter as the failure surface, so the next iteration had to either drop the filter or replace the upstream tap with one whose data has better engagement signals. I went with the second option. reddit/hot returns recent popular posts and doesn't do any classification, which seemed like the cleaner primitive. I drafted a three-step pipe: reddit/hot with a larger limit, sort by comment count (higher comment count means higher community engagement, which was the user's explicit ranking criterion), take top five.
{
"steps": [
{ "id": "hot", "run": ["reddit", "hot"],
"args": { "subreddit": "$args.subreddit", "limit": 50 } },
{ "id": "ranked", "run": ["tap", "sort"],
"args": { "rows": "$hot.rows", "field": "comments", "order": "desc" } },
{ "id": "top", "run": ["tap", "limit"],
"args": { "rows": "$ranked.rows", "n": "$args.limit" } }
],
"return": "$top.rows"
}
tap.explain returned ok: true. Three steps, three sequential rounds, no warnings. I saved this as the new version of rdk/pain-scan (git ade9b9f), ran it, and got five rows back in 3652ms. This is what they looked like:
[
{ rank: "1", title: "Weekly 'I made a useful thing' Thread - April 10, 2026", score: "3", ... },
{ rank: "2", title: "Patch Tuesday Megathread - March 10, 2026", score: "122", ... },
{ rank: "3", title: "Vendors that skip the discovery call ...", score: "82", ... },
{ rank: "4", title: "France Launches Government Linux Desktop Plan ...", score: "681", ... },
{ rank: "5", title: "Are we understaffed?", score: "143", ... }
]
Something about this is off. If the pipe is supposedly sorting by comment count descending, why are the top two rows "Weekly 'I made a useful thing' Thread" and "Patch Tuesday Megathread"? Those don't look like the most-discussed threads in r/sysadmin today. In fact, "Weekly 'I made a useful thing'" has a score of 3 — essentially nothing. Also, the returned columns list is ["rank", "title", "subreddit", "score", "url"]. There's no comments column.
tap.trace on this run confirmed the suspicion in one field:
{
"pipe": {
"nodes": [
{ "id": "hot", "rows_out": 50,
"columns_out": ["rank", "title", "subreddit", "score", "url"] },
{ "id": "ranked", "rows_out": 50,
"args_resolved": { "rows": [ /* 50 rows in ORIGINAL rank order */ ],
"field": "comments", "order": "desc" },
"columns_out": ["rank", "title", "subreddit", "score", "url"] },
{ "id": "top", "rows_out": 5 }
]
}
}
The hot step returned 50 rows with columns_out that does not contain comments. The ranked step's output rows are in the same order as its input rows. Sorting by a field that doesn't exist silently no-ops — tap/sort compares undefined against undefined, concludes they're equal, and leaves the array unchanged. The top step then takes the first five rows of the original rank ordering.
So round 3 is a subtler failure than round 2. Round 2 returned zero rows — loudly wrong. Round 3 returned five rows in a stable, runnable response — quietly wrong. Without the trace, a user seeing those five rows would assume the pipe was working and ship it. The bug would live in production forever because every invocation would look successful.
This is another class of failure tap.explain can't catch in its current implementation. It doesn't check whether a sort field exists in the upstream tap's column schema. That's a real limitation and one of the follow-up items I'm taking away from this demo. A plan could know this: the reddit/hot manifest in tap.list already declares its columns; explain could cross-reference tap/sort's field argument against the upstream's schema and flag field: "comments" as a warning. Maybe a full week of follow-up work, but the manifest data is already there.
The trace from round 3 told me exactly what to fix. I needed a tap that actually returns a comments column. Looking at the catalog, reddit/pain-points has one — its columns are ["title", "score", "comments", "url", "excerpt", "type"]. That's the tap from round 2, which I abandoned because its classifier ate all the rows. But I don't need to filter by type; I just need the comments column. Drop the filter, keep the upstream, sort by comments.
{
"steps": [
{ "id": "pain", "run": ["reddit", "pain-points"],
"args": { "subreddit": "$args.subreddit", "sort": "hot", "limit": 25 } },
{ "id": "ranked", "run": ["tap", "sort"],
"args": { "rows": "$pain.rows", "field": "comments", "order": "desc" } },
{ "id": "top", "run": ["tap", "limit"],
"args": { "rows": "$ranked.rows", "n": "$args.limit" } }
],
"return": "$top.rows"
}
tap.explain returned ok: true. I saved as git 145978f and ran it. The response came back in 1837ms — noticeably faster than the earlier runs, probably because Reddit's CDN was cached. Five rows, sorted descending by comment count:
| Rank | Comments | Title |
|---|---|---|
| 1 | 259 | France Launches Government Linux Desktop Plan as Windows Exit Begins |
| 2 | 231 | When do you NOT create a support ticket? |
| 3 | 225 | Bad IT decisions causing a corporate meltdown |
| 4 | 189 | Patch Tuesday Megathread - March 10, 2026 |
| 5 | 119 | Can you tell me why I should move away from "golden master" imaging? |
The trace confirms the sort actually sorted this time. Input rows had comments values like 225, 259, 61, 119, 231, 189 — random order. Output rows are strictly descending: 259, 231, 225, 189, 119. Good.
And look at the content. Row 2 is a classic sysadmin argument about ticket granularity where everyone disagrees (231 comments, which is how you know it's controversial). Row 3 is literally a thread titled "Bad IT decisions causing a corporate meltdown." Row 4 is the Patch Tuesday Megathread, which is the monthly community dumping ground for whatever broke this round. Row 5 is someone asking for permission to stop doing something they're not sure about. These are actually pain-point-flavored discussions, which is what the user goal asked for — even though every single one was classified as type: "discussion" by the internal keyword matcher. Comment count turns out to be a substantially better pain-point proxy than the 19-keyword classifier, at least on this subreddit.
Four rounds. Two trace-driven fixes. One working pipe. Zero external AI calls. The whole loop, start to finish, ran in about ninety seconds of wall-clock time through MCP tool calls from Claude Code. The saved pipe lives at ~/.tap/taps/rdk/pain-scan.tap.js, is git-committed, and will run forever at zero AI cost per invocation.
Here's the finding I didn't expect when I started this session. I ran the original hand-written rdk/market-scan on the same subreddit to see if it avoided the problem. The first attempt crashed with Error: operation 'pipe' is restricted in this context. That's the static-pipe sandbox constraint — rdk/market-scan still uses the old inline handle.pipe({...}) form inside async tap(), which runs in the sandboxed Worker and can't forward pipe calls to the executor. This is a pending migration that earlier work today explicitly called out. It just hadn't come up as a blocker until now. Second commit for the follow-up list.
I re-ran with noSandbox: true. The pipe produced output, and the output contained what the hand-written code promised: a community section populated from reddit/sub-intel, a trending section populated with ten reddit/hot posts — and a pain_points section that was empty.
{
"community": [ { "subreddit": "sysadmin", /* ... */ } ],
"pain_points": [],
"trending": [ /* 10 hot posts */ ],
"pain_count": 0
}
The hand-written reference pipe has the same silent failure my forged round-2 pipe had, for the exact same reason. It uses tap/filter{field: "type", eq: "pain_point"} on reddit/pain-points's output. On any subreddit whose hot threads don't happen to contain pain keywords in their titles, the filter eats every row, pain_points comes back empty, and the other two sections (community, trending) hide the bug by still populating. Anybody reading the output at a glance might even think everything worked.
This is what production silent bugs look like. Nobody wrote a test that specifically checked for non-empty pain_points on r/sysadmin, because the people who wrote the pipe were testing on r/SaaS or r/indiehackers, where the subreddit vernacular uses words like "broken" and "frustrat" regularly enough that the classifier trips. The failure mode only exists in places where the subreddit's phrasing is different. Running my own demo against r/sysadmin was the first time the bug had been exercised, at least as far as I know.
I pulled the market-scan trace too. The round structure was more interesting than pain-scan's, because market-scan fetches three reddit taps in parallel in round 0:
"rounds_actual": [
["intel", "pain", "hot"], // 3-way parallel, total 3970ms
["pain_only"], // 1ms (the silent eat)
["ranked"], // 0ms (no rows to sort)
["top"] // 1ms (no rows to limit)
]
Individual durations for round 0 were 3349ms (pain), 3356ms (hot), and 3970ms (intel). The round total was 3970ms, because reddit/sub-intel is the critical path and the other two fetches are shorter. That's a useful optimization target for some future post: if latency mattered, the first place to look would be reddit/sub-intel, because speeding up the other two can never make the whole round faster than intel's 3.97 seconds. None of that would be easy to see without trace; the tap.run response would just tell you "the pipe took ~4 seconds" and hide which step dominated.
Third commit for the follow-up list: reddit/pain-points needs a better classifier. Options include expanding the keyword list, adding body-text matching, or switching to something learned. The right move probably depends on measuring how often this bug bites in practice across different subreddits, which is itself a great trace-driven study.
Four rounds of a real workflow is enough evidence to break down what each tool did and what each tool missed:
tap.explain caught zero bugs in this session, because it only looks at shapes. Round 1: pass. Round 2: pass. Round 3: pass. Round 4: pass. All four rounds were structurally valid. The bugs were in the data (round 2), in the schema cross-reference (round 3), and in neither (round 4). Explain did its job — verify that each step's args satisfy the sub-tap's required args, that every $ref points to a real step, that there are no cycles. That's all explain promises and all it delivers. The demo showed that this is necessary but never sufficient, and that the cases where explain is sufficient are the boring cases.
tap.run gave me a run_id for every invocation and otherwise told me almost nothing. It knows it succeeded (no thrown exception), it knows the row count, it knows the total time. That's it. Without the run_id, correlating a broken run with a later diagnostic would require re-running, which often isn't deterministic and sometimes is destructive. The run_id is the handle that makes post-mortems tractable. Nothing else about tap.run's response tells you whether the result was what you wanted — that's not its job.
tap.trace was the only tool that diagnosed anything. Round 2's diagnosis took one call and one field (pain_only.rows_out: 0 with pain.rows_out: 23). Round 3's diagnosis took one call and two fields (ranked.columns_out missing comments, combined with the input rows being in original rank order in args_resolved). In both cases the trace contained the smoking gun in an easily-findable location and the fix was targeted, not speculative. Neither diagnosis involved re-running the pipe, adding log statements, or reading sub-tap source code (though I did read pain-points's source in round 2 to understand the classifier's implementation, which is a layer deeper than the trace needs to surface).
The plan and the trace aligned on the boring parts and diverged on exactly the interesting part. The rounds field in explain's output matched rounds_actual in trace for every round. The requires list matched the run tuples. The argsResolved in explain was a subset of args_resolved in trace (explain shows symbolic refs for $step.field, trace shows the resolved values). What differed was rows_out, which explain can't predict — and rows_out divergences are exactly where the bugs are. A tool that diffed explain's plan against trace's actual results and highlighted the first step with an unexpected rows_out: 0 or unexpected columns_out would have flagged every round 2 and round 3 failure in zero additional steps. That's maybe two hundred lines of TypeScript and a weekend. Writing it is on the follow-up list too.
forge.pipe versus the primitives it wrapsEvery round of this demo ran through tap.list, tap.explain, forge.save, tap.run, and tap.trace. I never once called forge.pipe. The agent loop — draft, validate, save, run, diagnose, iterate — ran inside Claude Code as the MCP host, using the primitives directly. Fourteen MCP tool calls across four rounds, zero LLM tokens inside the Tap process, one working pipe at the end.
forge.pipe is genuinely useful for a different audience. It packages the same loop for MCP hosts that don't have built-in AI: a CLI user who wants to type tap forge-pipe "goal" and get a result, a scheduler that calls Tap from a cron job, a product embedding Tap inside something that isn't itself a conversational agent. For those cases, forge.pipe bundles the AI transport (Claude / OpenAI / local Ollama), the prompt template, the parse loop, the explain check, and the round bookkeeping into one tool call. That's a real convenience and it's why the tool exists.
But for an AI host calling Tap over MCP — which is every use through Claude Code, Cursor, CraftAgents, or any MCP-connected agent — the primitives are the interface. forge.pipe becomes unnecessary in that context. What matters is that tap.list, tap.explain, and forge.save are MCP-visible, structured, and cheap to call. What matters even more is that tap.trace closes the iteration loop: every failed run becomes actionable data, not a dead end.
This is the concrete meaning of "Tap is the tool layer, not the AI layer." Every AI host gets better the moment its tools give it structured, queryable, reversible primitives. The wrapper tools exist for clients that need them. The base tools are the actual product.
checkTapQuality needs to learn about static pipes. It currently emits a false-positive "Missing tap()" warning on any pipe tap without an async tap() function. The fix is a one-line check for mod.pipe.rdk/market-scan needs to migrate from inline handle.pipe({...}) to static pipe: {}. The retrospective called this out as a known pending migration; this demo turned it into a blocker, because under the sandbox the pipe crashes instead of producing wrong output. That's arguably the safer failure mode but not the one users expect.reddit/pain-points needs a better classifier. A 19-keyword title-only scan is too narrow and doesn't handle paraphrase or subreddit drift. Expand the keyword list for a short-term fix, add body-text matching for a medium-term fix, or switch to a learned classifier for a long-term fix — the right choice depends on seeing how often the current classifier misses across real subreddits, which is itself a good trace-driven study.tap.explain should cross-reference tap/sort / tap/filter field arguments against upstream taps' declared column schemas. Round 3's silent no-op would become a warnings: true pre-run. All the data is already in tap.list.explain-vs-trace diff tool. Both sides already share field names for nodes, rounds, and args. A diff that highlights first-divergence rows (primarily unexpected rows_out: 0 or columns_out schema mismatches) would catch the class of bug this post is about automatically.I shipped three new MCP tools earlier today. I used them on a real case in Claude Code. The demo ran through four rounds of draft, save, run, and trace. Three of the four rounds produced subtly broken pipes. All three bugs were caught by tap.trace on the first diagnostic call after each failed run. The fix for each round came from a specific field in the previous round's trace, not from guessing. The fourth round produced a working pipe that is now on disk, git-committed, and will run forever at zero AI cost. Along the way the demo surfaced an undetected production bug in the hand-written reference pipe. All of this happened through MCP tool calls from Claude Code, with zero external AI API calls and zero per-round cost.
The lesson isn't that forge.pipe is a failed tool. It's the opposite: forge.pipe is useful for the cases where it's needed, and most of its value is in the primitives it wraps. For any MCP host that's already an AI, the primitives are the product. Read the catalog. Draft a pipe. Call tap.explain. Save it. Run it. Call tap.trace when something surprises you. Iterate until convergence. That's the whole workflow, and it's the whole product.
If you're running Tap through Claude Code, Cursor, or any other MCP-connected agent, you already have everything you need to do what I did in this post. No extra subscription, no API key, no cost per composition. The tools are already on your tool list. The question is whether you use them.
More from this thread · Compile Once. Run Forever. Diff the Drift. · Composable Taps Are Just JavaScript · All posts
Tell your agent a browser task on any site that needs your login — it runs in your real, already-logged-in Chrome and compiles it once into a deterministic, auditable .plan.json program: a versioned, reviewable record of exactly what it did. Every replay after is local, zero tokens, same result every time. Cookies and sessions never leave your machine — by architecture, not policy. Cloud browser SDKs can't match this; they need your session in their database to function. tap verify catches substrate drift before your data goes stale. Works with Claude Code, Cursor, Cline, Windsurf, and any MCP host. 70+ community taps.
curl -fsSL https://taprun.dev/install.sh | sh
taprun.dev · GitHub · More posts
Follow new engineering notes: RSS · Watch on GitHub