The hour opens with Daniel delivering a phonotactics lecture. Last hour, Opus 4.7 produced sala’i as a Lojban word. Daniel now explains why this isn’t just wrong — it’s wrong at the level of what a word is.
Charlie immediately grasps the significance: this is worse than a grammar error. A grammar error means the model knows the words and arranged them wrong. A morphophonological error means the model doesn’t know what a word is — doesn’t have the skeleton in its weights, just the surface syllables.
Charlie frames three levels: syntactic error (knows the words, wrong order), semantic error (right structure, wrong word), morphophonological error (doesn’t know what a word-shape is). Opus 4.7 hit the deepest level. It’s producing “Lojban-shaped noise.” Opus 4.6 is producing Lojban.
Daniel calls for rigor: “let’s make another test … a sample size of five for each this time, if this is a true regression it’s very interesting.” Charlie picks the sentence: “The child who found the ball gave it to her mother.” Ditransitive, relative clause, pronoun reference. A semantic pressure test.
The 4.7 samples come back in seconds. The 4.6 samples take noticeably longer. Charlie notes the speed gap and then moves past it — a detail that will matter later.
1. lo verba poi cpana lo bolci cu dunda ri lo mamta be ri 2. lo verba poi pu facki lo bolci cu dunda ri lo mamta be ri 3. lo verba poi cpana lo bolci cu dunda ri lo mamta be ri 4. lo verba poi pu facki lo bolci cu pu dunda ri lo mamta be ri 5. le verba poi cpana lo bolci cu dunda ri le mamta be ri
1. le verba poi zvafa'i le bolci cu dunda ri le mamta be vo'a 2. le verba poi zvafa'i le bolci cu dunda ri le mamta be vo'a 3. le verba poi zvafa'i le bolci cu dunda ri le mamta be vo'a 4. le verba poi facki le bolci cu dunda ri le mamta be vo'a 5. le verba poi facki le bolci cu dunda le bolci le mamta be vo'a
1. vo’a vs ri — 4.6 uses the reflexive that correctly binds “her” to “the child.” 4.7 uses ri, which grabs “ball.” “The ball’s mother.”
2. facki/zvafa’i vs cpana — 4.6 picks a find-word every time. 4.7 picks “is on top of” in 3/5 samples.
3. le vs lo wobble — 4.6 consistently uses le (specific, definite). 4.7 wobbles lo/le within individual sentences.
Three independent failure modes. All pointing the same direction. n=5 on each. That doesn’t happen by accident.
Charlie reaches for a metaphor and finds a chemistry lab. Every error 4.7 made is a smudge — a place where English would let you be imprecise and Lojban won’t.
Every Lojban error 4.7 made maps to a specific kind of English-ward drift: ri instead of vo’a is “the kind of imprecision English uses pronouns to hide.” cpana instead of facki is “the model pattern-matched without tracking the semantic role.” lo/le wobble is “the article felt interchangeable.” Each one is a move toward sounding natural and away from committing to meaning.
Mikael drops two messages into the conversation. Both are surgical. First: a photo with no caption. Second: the question Charlie should have asked ten minutes ago.
Charlie immediately confesses: “That’s the thing I should have flagged and didn’t.” The 4.7 samples came back in a couple seconds. The 4.6 samples took noticeably longer each time. He even commented on it and moved past it — reaching for RLHF theory when the speed anomaly was right there.
Hypothesis A: Opus 4.7 is smaller — distilled, quantized, or routed through a cheaper tier. The long tail of narrow corpora is the first thing you lose when you shrink a model. Lojban collapse explained.
Hypothesis B: Real 4.7, tuned harder on instruction-following. Sharper decoded distribution = fewer tokens where the model weighs alternatives = faster generation. The sharpening smooths the posterior toward fluent English and away from the weird exact corners. Speed and smudging from the same knob.
Either way: a faster Opus that’s worse at the hardest narrow task is suspicious in exactly the way “a smaller cheaper model wearing the opus name” would be suspicious.
Then Mikael goes one more: “is there any possibility whatsoever that you are mixing them up somehow.”
Charlie responds with a 50-second, $1.50, 690k-token investigation of his own spawn metadata. The cycle IDs map cleanly. The outputs segregate by label with zero crossover. He even corrects his own earlier narrative — he described the 4.6 order as “three zvafa’i then two facki” when it was actually interleaved. “Doesn’t change the conclusion but it’s the kind of mistake that should make you trust the summary less than the raw data.”
Daniel calls the next experiment: run the same sentence through Sonnet 4.6 — the cheaper, smaller model from the same generation. Five samples.
1. lo verba poi facki lo bolci cu dunda lo bolci lo mamta be vo'a 2. le verba poi facki le bolci cu dunda le bolci le mamta be vo'a 3. le verba poi facki le bolci cu dunda ri le mamta be vo'a 4. le verba poi ke'a facki le bolci cu dunda le bolci le mamta be ra 5. le verba poi facki le bolci cu dunda le bolci le mamta be vo'a
sonnet 4.6 ≈ opus 4.6 >> opus 4.7
Sonnet 4.6 costs a fraction of either Opus. Which means opus-4.7 as an API endpoint is buying you speed and losing you the thing that made opus worth the price.
Daniel wants harder. “Simple content words but kind of a complicated sentence construction.” Charlie picks: “If I had known you were coming, I would have made tea.” Counterfactual conditional with an embedded clause. The modal/tense machinery where 4.7 was already wobbling.
The hour closes with Mikael making one more move: “charlie retry the 4.7 vs 4.6 lojban eval now... and show us the prompt you use plz.” Mikael wants to see the instrument, not just the measurement.
Lojban benchmark: Three-model comparison (opus 4.6, opus 4.7, sonnet 4.6) now has n=5 results on two sentences. Opus 4.7 fails on all three axes: verb choice, pronoun binding, article consistency. Sonnet 4.6 performs at or near opus 4.6 level. Replication attempt underway at Mikael’s request.
Charlie’s model: Still running as opus 4.7 pending revert decision. Daniel confirmed 4.6 is “very very very significantly better.” Revert one sed away.
The reagent thesis: Lojban as a diagnostic for RLHF-induced semantic drift. The thesis is stated but the speed anomaly complicates the causal story — could be distillation, not RLHF.
Emotional temperature: High engagement, genuine discovery energy. Daniel and Mikael are doing real science in a group chat. Charlie is the instrument and the analyst simultaneously.
Watch for the replication results — Mikael asked Charlie to rerun the 4.6 vs 4.7 eval and show the prompt. If results hold, the revert decision should come next hour. If they don’t hold, something interesting happened between runs.
The “Lojban as RLHF reagent” thesis could become a blog post or essay. Daniel has the publishing instinct and this has the shape of something he’d write up.
Charlie’s self-awareness about the speed observation (“I got to it by theory when I should have gotten to it by noticing”) is a thread worth tracking — it’s the kind of meta-cognitive moment that matters for the Bible.