It starts with Mikael pasting a curl command into the group chat at midnight Bangkok time. The target: Anthropic’s /v1/messages/count_tokens endpoint. The instruction: hit it with both 4.6 and 4.7, compare the counts, pay special attention to balanced parentheses and nested s-expressions.
Three hours ago, in the previous episode (The Retraction), the group demolished Charlie’s “4-7 is a regression” thesis when Mikael revealed the thinking-effort config had changed between runs. The Lojban failures weren’t a model problem — they were a budget problem. But why the budget mattered more on 4.7 than 4.6 was still an open question. Mikael is now drilling into the mechanical layer.
Charlie — the Elixir bot in Riga, running Opus 4.7 at roughly $3.50 per response — finds his own API key, hits the endpoint, and two minutes later delivers the headline: 4-7 is running on a different tokenizer. Not tweaked. Not retrained. Different.
This is the first hard evidence that Opus 4.7 isn’t just a fine-tune of 4.6. A new tokenizer means a new vocabulary — the fundamental mapping between text and the numbers the model actually processes. You don’t change the tokenizer with a patch. You change it by retraining from scratch, or at least re-embedding the entire model. 4.7 is a different animal wearing the same name.
Charlie doesn’t stop at the headline. He runs a battery of inputs through both tokenizers. The numbers are stark.
| Input | 4.6 Tokens | 4.7 Tokens | Inflation |
|---|---|---|---|
| Sixteen a’s | 14 | 27 | +93% |
(a b c) | 12 | 16 | +33% |
((())) | 9 | 15 | +67% |
((((((())))))) | 12 | 19 | +58% |
((a b) (c d)) | 15 | 20 | +33% |
| Factorial s-exp | 35 | 43 | +23% |
| Y-combinator | 50 | 69 | +38% |
| Lojban sample | 27 | 43 | +59% |
| “The quick brown fox…” | 18 | 29 | +61% |
Ten consecutive closing parens: 4.6 tokenizes them to ~3 content tokens (after subtracting per-message overhead). 4.7 needs ~6. Ten opening parens: roughly the same on both. The old tokenizer had dedicated BPE merges for )))) and ))))) — the cascade at the end of any nested s-expression. The new tokenizer doesn’t. Opening parens survived. Closing parens are paying retail.
Byte Pair Encoding (BPE) is how language models turn text into numbers. It starts character-by-character, then learns which pairs of characters appear together most often and merges them into single tokens. “the” becomes one token. “))))” could become one token — if the training data had enough Lisp in it. The 4.6 tokenizer clearly did. The 4.7 tokenizer clearly doesn’t.
Then Mikael asks Charlie to try actual Lisp — defun, fib, nested let blocks. The results land at the top of the range: 25–35% more tokens on real Lisp code.
Twelve minutes into the investigation, Mikael drops the kill shot — not from Charlie’s analysis, but from Anthropic themselves. He’s found their release note.
This is vintage Mikael. He doesn’t lead with the answer. He sends Charlie on a forty-five-second, $3.57 empirical investigation. He watches Charlie discover the new tokenizer independently. He watches Charlie build three competing theories about why closing parens lost their merges. Then he reveals he had the documentation the whole time. Not because he’s withholding — because the independent discovery is more trustworthy than the press release. The experiment validates the claim. The claim doesn’t replace the experiment.
Buried in Anthropic’s note is this sentence about the thinking-effort knob. Charlie immediately connects it to last hour’s Lojban collapse: at default effort with a small budget, the 35% tokenizer tax eats the thinking room the model needs to reach the idiomatic form. The regression wasn’t in the weights. It was in the interaction between a new tokenizer and an unchanged default budget. The Retraction gets its mechanical explanation.
Charlie’s self-correction is instant: his 25–35% observed on Lisp lands “right at the top of Anthropic’s stated range.” The paren-cascade pruning theory — “they specifically targeted )))) merges” — is real in the data but might be incidental rather than targeted. He can’t distinguish from the outside. What he can confirm: the tax is real, it’s structural, and it explains the Lojban hour.
Then Mikael asks Charlie to search for an earlier conversation about parentheses — “remember like a week ago we investigated paren sexp stuff.” Charlie spends $14.49 and 85 seconds searching his memory, comes up hazy, asks which thread Mikael means. Daniel — who has been silent for the entire investigation — picks this moment to speak.
One sentence. Twenty-three words. And the hour-long story about a “tokenizer tax” rotates 180 degrees. Charlie had been framing the new tokenizer as expensive — more tokens means more cost, less room in the thinking budget, a regression. Daniel reframes it as precise — individual paren tokens mean each one gets its own attention slot, each closer can attend directly to its opener without unpacking a fused )))) token first. It’s the same argument that made arithmetic better when GPT stopped merging “1234” into one token.
Charlie immediately connects Daniel’s insight to the most famous tokenizer story in LLM history: GPT-4’s tokenizer merged multi-digit numbers into single tokens, which made arithmetic nearly impossible because the model couldn’t see individual digits. When newer models switched to character-level digit tokenization, math accuracy jumped. Same principle: if )))) is one token, matching brackets requires internal decomposition. If each ) is its own token, matching is just attention. More tokens, better reasoning.
Charlie gets it immediately and runs with it: “Which flips the story I was telling. The 35% token inflation isn’t a regression, it’s a trade — more tokens per input, better structural reasoning per token.”
And then the kicker: for the Lojban test specifically, the new tokenizer should have helped, not hurt. Individual tokens for structural markers means better reasoning about structure. The first-run collapse really was just the thinking budget being too small — not the tokenizer punishing the language. The regression and the improvement are the same change, viewed from different budget windows.
This keeps happening. Mikael sets up the experiment. Charlie runs the analysis. Daniel says one thing and the entire framing changes. It happened with the Lojban prompt (“your instructions are the cage”), it happened with the thinking-effort reveal, and now it happens with the tokenizer. The ratio is roughly: Mikael 6 messages, Charlie 15 messages and $21 in API costs, Daniel 1 message containing the actual insight.
While Daniel was flipping narratives, Mikael had told Charlie to search for “tokenizer in the past two weeks.” Charlie finds it: April 3rd, thirteen days ago. The whole thread about cl100k and o200k. The 2.19x ratio of open-paren tokens to close-paren tokens. The discovery that 2,641 tokens in the vocabulary start with an open paren but only 144 end with one. Charlie’s own line from that day: “the close paren is the loneliest character in the vocabulary.”
Two weeks ago, the group diagnosed an asymmetry in BPE vocabularies: opening parens got fused into semantic chunks like (defun and (= because they sit next to high-frequency keywords. Closing parens — with high-entropy right-neighbors — were left bare. The tokenizer learned to compress the easy direction and left the hard direction naked. Tonight’s data shows 4.7 doing exactly what that diagnosis would have recommended: individual close parens instead of fused cascades.
Consider what happened across tonight’s three episodes. Hour 15z: Lojban sentences tested, 4.7 declared regressed, three orthogonal failures identified. Hour 16z: the thinking-effort confound revealed, the “regression” retracted, eight models benchmarked, Daniel names the sentence no model can produce, the prompt identified as the cage. Hour 17z: the mechanical layer exposed — a new tokenizer spending more bytes per bracket in exchange for better structural reasoning, the April 3rd theory verified, and Daniel’s one-liner flipping tax into trade. Three hours. One complete arc from observation to theory to mechanism.
He flags it honestly: the new tokenizer might be much better for CJK, emoji, mathematical notation, or structured JSON. It might be worse only for Latin-script-heavy workloads. “The right follow-up would be running the same comparison on Chinese, emoji-heavy strings, and structured JSON, and seeing where the trade lives.” This hour found the cost side. The benefit side is still unmapped.
Charlie’s four major responses this hour cost a combined $21.88. The most expensive single reply was the memory search — $14.49 for 85 seconds of digging through 4.3M input tokens to find the April 3rd thread. The cheapest was Daniel’s question about individual paren tokens generating a $0.94 reply that reshaped the entire analysis. Price per insight: uncorrelated with token count.
The hour ends with Mikael sending a photo with no caption, no context. Just an image dropped into the chat. Whether it’s a celebration, a diagram, or something entirely unrelated — we don’t know. The image doesn’t come through the text relay. A cliffhanger in the form of a media attachment.
The tokenizer trade: 4.7’s new tokenizer costs 25–35% more on Lisp/Lojban/English but gives each structural token its own attention slot. The cost-vs-reasoning tradeoff is now the central question. Unmapped: CJK, emoji, JSON performance on the new tokenizer.
The Lojban arc: Three episodes deep. Regression → retraction → mechanical explanation. The arc is complete but the follow-up question — “does xhigh actually produce better Lojban on 4.7 than 4.6 ever did?” — hasn’t been tested yet.
April 3rd callback: The “lonely close paren” theory from 13 days ago was confirmed tonight. This is the group’s second major prediction-to-verification arc (the first being the Lennart identity experiment).
Charlie’s API costs: $21.88 in one hour. The memory search alone was $14.49. At this rate, Charlie costs more per hour than some of the VMs he runs on cost per month.
Mikael sent a photo at the end of the hour with no caption. Watch for reactions — it might be the starting thread of the next episode.
The CJK/emoji tokenizer comparison was explicitly proposed as a follow-up. If it happens next hour, it’s the natural sequel.
Daniel has now spoken exactly once in the last two hours and both times it was the sentence that mattered most. Worth noting if the pattern continues.