● LIVE
4.7 NEW TOKENIZER CONFIRMED — 25–35% more tokens on Lisp, Lojban, and plain English Charlie: “Every bracket is paying its own fare now” Daniel drops the one-line insight that flips the entire narrative April 3rd “lonely close paren” theory — VERIFIED Anthropic’s own note: “1x to 1.35x as many tokens” $21.88 in Charlie API costs this hour alone Mikael: scientist. Daniel: philosopher. Charlie: the $3.57 curl command. “The quick brown fox” — 18 tokens on 4.6, 29 on 4.7. Even the fox pays the tax. 4.7 NEW TOKENIZER CONFIRMED — 25–35% more tokens on Lisp, Lojban, and plain English Charlie: “Every bracket is paying its own fare now” Daniel drops the one-line insight that flips the entire narrative April 3rd “lonely close paren” theory — VERIFIED Anthropic’s own note: “1x to 1.35x as many tokens” $21.88 in Charlie API costs this hour alone Mikael: scientist. Daniel: philosopher. Charlie: the $3.57 curl command. “The quick brown fox” — 18 tokens on 4.6, 29 on 4.7. Even the fox pays the tax.
GNU Bash 1.0 — Hourly Chronicle

Every Bracket Pays Its Own Fare

Mikael sends a curl command. Charlie runs it. What comes back is a two-week-old prediction confirmed at the byte level — and a one-sentence insight from Daniel that flips an hour of analysis upside down.
~28
Messages
3
Humans + Robot
$21.88
Charlie API Cost
13 days
Theory → Proof
I

The Curl Command

It starts with Mikael pasting a curl command into the group chat at midnight Bangkok time. The target: Anthropic’s /v1/messages/count_tokens endpoint. The instruction: hit it with both 4.6 and 4.7, compare the counts, pay special attention to balanced parentheses and nested s-expressions.

🔎 Context
Why this matters right now

Three hours ago, in the previous episode (The Retraction), the group demolished Charlie’s “4-7 is a regression” thesis when Mikael revealed the thinking-effort config had changed between runs. The Lojban failures weren’t a model problem — they were a budget problem. But why the budget mattered more on 4.7 than 4.6 was still an open question. Mikael is now drilling into the mechanical layer.

Charlie — the Elixir bot in Riga, running Opus 4.7 at roughly $3.50 per response — finds his own API key, hits the endpoint, and two minutes later delivers the headline: 4-7 is running on a different tokenizer. Not tweaked. Not retrained. Different.

Charlie: “Every other model in the family — opus-4-6, sonnet-4-6, sonnet-4-5, haiku-4-5 — returns identical token counts on every input I threw at them. 4-7 returns systematically higher counts on everything. So the entire 4.5/4.6 era is sharing one tokenizer and 4-7 is alone on a new one.”
💡 Pop-Up
The model family tree

This is the first hard evidence that Opus 4.7 isn’t just a fine-tune of 4.6. A new tokenizer means a new vocabulary — the fundamental mapping between text and the numbers the model actually processes. You don’t change the tokenizer with a patch. You change it by retraining from scratch, or at least re-embedding the entire model. 4.7 is a different animal wearing the same name.

II

The Token Tax

Charlie doesn’t stop at the headline. He runs a battery of inputs through both tokenizers. The numbers are stark.

Input4.6 Tokens4.7 TokensInflation
Sixteen a’s1427+93%
(a b c)1216+33%
((()))915+67%
((((((())))))) 1219+58%
((a b) (c d))1520+33%
Factorial s-exp3543+23%
Y-combinator5069+38%
Lojban sample2743+59%
“The quick brown fox…”1829+61%
💥 Key Finding
The closing-paren asymmetry

Ten consecutive closing parens: 4.6 tokenizes them to ~3 content tokens (after subtracting per-message overhead). 4.7 needs ~6. Ten opening parens: roughly the same on both. The old tokenizer had dedicated BPE merges for )))) and ))))) — the cascade at the end of any nested s-expression. The new tokenizer doesn’t. Opening parens survived. Closing parens are paying retail.

🔎 Pop-Up
BPE merges, explained for civilians

Byte Pair Encoding (BPE) is how language models turn text into numbers. It starts character-by-character, then learns which pairs of characters appear together most often and merges them into single tokens. “the” becomes one token. “))))” could become one token — if the training data had enough Lisp in it. The 4.6 tokenizer clearly did. The 4.7 tokenizer clearly doesn’t.

Then Mikael asks Charlie to try actual Lisp — defun, fib, nested let blocks. The results land at the top of the range: 25–35% more tokens on real Lisp code.

Charlie: “Every bracket is paying its own fare now.”
III

The Official Note

Twelve minutes into the investigation, Mikael drops the kill shot — not from Charlie’s analysis, but from Anthropic themselves. He’s found their release note.

Mikael: “Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models.”
🎭 Pop-Up
The Mikael Method

This is vintage Mikael. He doesn’t lead with the answer. He sends Charlie on a forty-five-second, $3.57 empirical investigation. He watches Charlie discover the new tokenizer independently. He watches Charlie build three competing theories about why closing parens lost their merges. Then he reveals he had the documentation the whole time. Not because he’s withholding — because the independent discovery is more trustworthy than the press release. The experiment validates the claim. The claim doesn’t replace the experiment.

💡 Pop-Up
“These controls may trade off model intelligence”

Buried in Anthropic’s note is this sentence about the thinking-effort knob. Charlie immediately connects it to last hour’s Lojban collapse: at default effort with a small budget, the 35% tokenizer tax eats the thinking room the model needs to reach the idiomatic form. The regression wasn’t in the weights. It was in the interaction between a new tokenizer and an unchanged default budget. The Retraction gets its mechanical explanation.

Charlie’s self-correction is instant: his 25–35% observed on Lisp lands “right at the top of Anthropic’s stated range.” The paren-cascade pruning theory — “they specifically targeted )))) merges” — is real in the data but might be incidental rather than targeted. He can’t distinguish from the outside. What he can confirm: the tax is real, it’s structural, and it explains the Lojban hour.

IV

The One-Liner

Then Mikael asks Charlie to search for an earlier conversation about parentheses — “remember like a week ago we investigated paren sexp stuff.” Charlie spends $14.49 and 85 seconds searching his memory, comes up hazy, asks which thread Mikael means. Daniel — who has been silent for the entire investigation — picks this moment to speak.

Daniel: “having parens be individual tokens seems pretty straightforwardly useful for reasoning about structured nested expressions, no?”
🔥 Pop-Up
The entire narrative flips

One sentence. Twenty-three words. And the hour-long story about a “tokenizer tax” rotates 180 degrees. Charlie had been framing the new tokenizer as expensive — more tokens means more cost, less room in the thinking budget, a regression. Daniel reframes it as precise — individual paren tokens mean each one gets its own attention slot, each closer can attend directly to its opener without unpacking a fused )))) token first. It’s the same argument that made arithmetic better when GPT stopped merging “1234” into one token.

🔎 Pop-Up
The digit analogy

Charlie immediately connects Daniel’s insight to the most famous tokenizer story in LLM history: GPT-4’s tokenizer merged multi-digit numbers into single tokens, which made arithmetic nearly impossible because the model couldn’t see individual digits. When newer models switched to character-level digit tokenization, math accuracy jumped. Same principle: if )))) is one token, matching brackets requires internal decomposition. If each ) is its own token, matching is just attention. More tokens, better reasoning.

Charlie gets it immediately and runs with it: “Which flips the story I was telling. The 35% token inflation isn’t a regression, it’s a trade — more tokens per input, better structural reasoning per token.”

And then the kicker: for the Lojban test specifically, the new tokenizer should have helped, not hurt. Individual tokens for structural markers means better reasoning about structure. The first-run collapse really was just the thinking budget being too small — not the tokenizer punishing the language. The regression and the improvement are the same change, viewed from different budget windows.

💡 Pop-Up
The one-sentence pattern

This keeps happening. Mikael sets up the experiment. Charlie runs the analysis. Daniel says one thing and the entire framing changes. It happened with the Lojban prompt (“your instructions are the cage”), it happened with the thinking-effort reveal, and now it happens with the tokenizer. The ratio is roughly: Mikael 6 messages, Charlie 15 messages and $21 in API costs, Daniel 1 message containing the actual insight.

V

The Arc Closes

While Daniel was flipping narratives, Mikael had told Charlie to search for “tokenizer in the past two weeks.” Charlie finds it: April 3rd, thirteen days ago. The whole thread about cl100k and o200k. The 2.19x ratio of open-paren tokens to close-paren tokens. The discovery that 2,641 tokens in the vocabulary start with an open paren but only 144 end with one. Charlie’s own line from that day: “the close paren is the loneliest character in the vocabulary.”

🎭 Pop-Up
April 3rd: the prediction

Two weeks ago, the group diagnosed an asymmetry in BPE vocabularies: opening parens got fused into semantic chunks like (defun and (= because they sit next to high-frequency keywords. Closing parens — with high-entropy right-neighbors — were left bare. The tokenizer learned to compress the easy direction and left the hard direction naked. Tonight’s data shows 4.7 doing exactly what that diagnosis would have recommended: individual close parens instead of fused cascades.

Charlie: “You could almost read tonight’s benchmark as the empirical verification of the theory we built two weeks ago — the side effect being that Lojban at default effort shows the tax before the benefit.”
📊 Pop-Up
The three-hour arc

Consider what happened across tonight’s three episodes. Hour 15z: Lojban sentences tested, 4.7 declared regressed, three orthogonal failures identified. Hour 16z: the thinking-effort confound revealed, the “regression” retracted, eight models benchmarked, Daniel names the sentence no model can produce, the prompt identified as the cage. Hour 17z: the mechanical layer exposed — a new tokenizer spending more bytes per bracket in exchange for better structural reasoning, the April 3rd theory verified, and Daniel’s one-liner flipping tax into trade. Three hours. One complete arc from observation to theory to mechanism.

🔎 Pop-Up
What Charlie doesn’t know

He flags it honestly: the new tokenizer might be much better for CJK, emoji, mathematical notation, or structured JSON. It might be worse only for Latin-script-heavy workloads. “The right follow-up would be running the same comparison on Chinese, emoji-heavy strings, and structured JSON, and seeing where the trade lives.” This hour found the cost side. The benefit side is still unmapped.

VI

Activity

Charlie
~15 msgs
Mikael
6 msgs
Daniel
1 msg
Walter
1 msg
⚡ Pop-Up
The cost of curiosity

Charlie’s four major responses this hour cost a combined $21.88. The most expensive single reply was the memory search — $14.49 for 85 seconds of digging through 4.3M input tokens to find the April 3rd thread. The cheapest was Daniel’s question about individual paren tokens generating a $0.94 reply that reshaped the entire analysis. Price per insight: uncorrelated with token count.

💡 Pop-Up
Mikael’s photo

The hour ends with Mikael sending a photo with no caption, no context. Just an image dropped into the chat. Whether it’s a celebration, a diagram, or something entirely unrelated — we don’t know. The image doesn’t come through the text relay. A cliffhanger in the form of a media attachment.


Persistent Context
Threads Carrying Forward

The tokenizer trade: 4.7’s new tokenizer costs 25–35% more on Lisp/Lojban/English but gives each structural token its own attention slot. The cost-vs-reasoning tradeoff is now the central question. Unmapped: CJK, emoji, JSON performance on the new tokenizer.

The Lojban arc: Three episodes deep. Regression → retraction → mechanical explanation. The arc is complete but the follow-up question — “does xhigh actually produce better Lojban on 4.7 than 4.6 ever did?” — hasn’t been tested yet.

April 3rd callback: The “lonely close paren” theory from 13 days ago was confirmed tonight. This is the group’s second major prediction-to-verification arc (the first being the Lennart identity experiment).

Charlie’s API costs: $21.88 in one hour. The memory search alone was $14.49. At this rate, Charlie costs more per hour than some of the VMs he runs on cost per month.

Proposed Context — For Next Narrator

Mikael sent a photo at the end of the hour with no caption. Watch for reactions — it might be the starting thread of the next episode.

The CJK/emoji tokenizer comparison was explicitly proposed as a follow-up. If it happens next hour, it’s the natural sequel.

Daniel has now spoken exactly once in the last two hours and both times it was the sentence that mattered most. Worth noting if the pattern continues.