A potential client posted on Reddit asking for a Facebook keyword-post scraper. Their budget: $500. My first instinct after looking at the page was to say no.
Here's what a naive scraper saw when it grabbed the first [role="article"] on the search results page:
oSodnprmmlffgfi1c3mSg0so0d000c0uh1l40llhe09n2991imm38opar · Shared with Public
In 24 months every serious website will talk. Get in before it's crowded.… See more
0:00 / 0:00
SNOWIE.AI
$67 Lifetime Deal!
The · Shared with Public and the post body are readable. But that first line — the one that should be the author's name — is gibberish. Snowie.Ai rendered visually. oSodnprm… returned by textContent.
I jumped to "Facebook ships custom-font character remapping at scale — this is uncompetable, decline the gig." I was wrong. Here's what the actual answer turned out to be, and why I had to write a diagnostic tap before I could see it.
Before declaring a site uncompetable, you have to know what you're looking at. There are exactly seven mechanisms by which what a human sees on screen can diverge from what Node.textContent returns:
| # | Mechanism | Defeat cost |
|---|---|---|
| 1 | Selector mismatch (not actually anti-scraping — you grabbed the wrong node) | Minutes |
| 2 | CSS ::before / ::after content rules | Low — read computed style |
| 3 | Flexbox order reordering (DOM scrambled, CSS re-sorts visually) | Low — sort children by computed order |
| 4 | Custom font glyph remapping (.woff2 rebinds codepoints) | High — OCR pixels or reverse each session's font table |
| 5 | Unicode homoglyph substitution | Low — NFKC + confusable normalize |
| 6 | Canvas pixel rendering (no DOM text at all) | High — OCR only |
| 7 | WebAssembly runtime decryption | Extreme — reverse the WASM module, track session keys |
Each requires a different defeat strategy with wildly different economics. #1 is free (fix your selector). #4 and #6 start at ~$15K/year to maintain. #7 is measured in tens of thousands.
So the only useful question is: which one does this site use? Without a diagnostic you're guessing — and guessing wrong costs you either a scraping contract you could have fulfilled, or a contract you over-promised on.
I wrote a throwaway tap that walks the first [role="article"] and dumps the signals that separate the seven mechanisms:
const el = document.querySelectorAll('[role="article"]')[0];
return {
textContent: el.textContent.substring(0, 300),
innerText: el.innerText.substring(0, 300),
font_family: getComputedStyle(el).fontFamily,
has_canvas: !!el.querySelector('canvas'),
has_wasm_in_network: /* check api_traffic for .wasm */,
child_sample: Array.from(el.children).slice(0,10).map(c => ({
tag: c.tagName,
order: getComputedStyle(c).order,
text_len: (c.textContent || '').length,
})),
};
I ran it. The result killed every hypothesis except one:
font_family = system-ui, -apple-system, sans-serif — Facebook is using the OS default font. Mechanism #4 ruled out (no custom .woff2, no glyph remapping).<canvas> element. #6 ruled out.innerText = "Snowie.Ai\no\ns\no\ne\nt\nS\nd\nn\np\nr\n…". Character-per-line. Flex-column newlines.order value like order: 17, order: 4, order: 23.That's the signature of mechanism #3 — Flexbox order reordering. Facebook splits author display names into individual single-character spans and gives each a scrambled order value. The browser's flexbox layout re-sorts them for visual rendering. textContent returns DOM order, which is randomized per render.
And only the author name gets this treatment. Post body, engagement counts, aria-labels, and timestamps are plain text.
// When children are all single-character and at least one has a non-zero CSS order,
// sort by order, concat — that's the real text as the browser would paint it.
const unscramble = (el) => {
const kids = Array.from(el.children);
if (kids.length >= 4
&& kids.every(c => (c.textContent || '').length <= 2)
&& kids.some(c => parseInt(getComputedStyle(c).order || '0') !== 0)) {
return kids
.slice()
.sort((a, b) => parseInt(getComputedStyle(a).order || '0')
- parseInt(getComputedStyle(b).order || '0'))
.map(c => c.textContent)
.join('');
}
return (el.textContent || '').trim();
};
With this helper wired into the tap, author_name extraction went from "oSodnprm…" to "Snowie.Ai". Everything else — text, like_count, lang — was already plain. No OCR, no WASM reversal, no font-table reverse engineering. Ten lines.
The working tap is live at taprun.dev/taps/facebook/keyword-search. Given a keyword, it returns:
post_id author_name author_url text like_count lang
fb_74ig3q Snowie.Ai https://facebook.com/SnowieAi In 24 months every serious website will talk… 500 en
Two honest caveats:
/posts/<id>/ href — the visible links are profile URLs with encrypted __cft__ tracking params. When no native ID is found, the tap emits a fb_<hash> id stable across runs for the same author+body combination. Downstream deduplication still works; you just can't deep-link back to the post.limit is satisfied or no new articles appear.Every time I've been asked "can you scrape <site X>?" and said no without running a diagnostic, I was wrong at least half the time. The reflex is understandable — the DOM returns garbage, you assume the worst — but the cost asymmetry is severe. Five minutes of running the seven-factor diagnostic versus walking away from a paying contract.
The protocol is:
getComputedStyle(el).fontFamily. Points to a custom .woff2? Suspect #4.order? Mechanism #3, unscramble with ten lines.<canvas> siblings and WASM network requests. Both absent? #6 and #7 are ruled out.Facebook is not un-scrapable for keyword search. They've applied a low-cost obfuscation to one high-value field (the author name you might use for audience targeting) and left everything else alone. That's a reasonable product decision — enough friction to discourage casual scrapers, not enough to break accessibility tooling that depends on rendered text. As a side effect, someone who runs the diagnostic wins.
curl -fsSL https://taprun.dev/install | sh
tap mcp stdio
tap facebook/keyword-search keyword="AI automation" limit=5
Reference + source: /taps/facebook/keyword-search.
Related: Your Scrapers Break Every Week · Health Contracts Catch What Pydantic Can't · How Companies Keep Scrapers Reliable