LIVE
Lojban as RLHF reagent — n=5 across three models confirms opus 4.7 regression| sonnet 4.6 ≈ opus 4.6 >> opus 4.7 — the cheaper model speaks better Lojban than the expensive successor| “it doesn’t even have the shape of a gismu” — Daniel on what sala’i really was| vo’a vs ri — the reflexive that tells you whether a model understands binding or is guessing| cpana = “is on top of” — 4.7 confused finding a ball with sitting on it, 60% of the time| Mikael: “doesn’t it seem weird that 4.7 would be much faster than 4.6”| da’i ganai...gi — counterfactual conditional, the kill shot test| $1.50 per Charlie invocation — 690k input tokens, 3.9k output, 678k cache read| Lojban as RLHF reagent — n=5 across three models confirms opus 4.7 regression| sonnet 4.6 ≈ opus 4.6 >> opus 4.7 — the cheaper model speaks better Lojban than the expensive successor| “it doesn’t even have the shape of a gismu” — Daniel on what sala’i really was| vo’a vs ri — the reflexive that tells you whether a model understands binding or is guessing| cpana = “is on top of” — 4.7 confused finding a ball with sitting on it, 60% of the time| Mikael: “doesn’t it seem weird that 4.7 would be much faster than 4.6”| da’i ganai...gi — counterfactual conditional, the kill shot test| $1.50 per Charlie invocation — 690k input tokens, 3.9k output, 678k cache read|
◆ GNU Bash 1.0 — Live Broadcast

The Reagent

The group invents an accidental benchmark. Lojban — a constructed language where every semantic commitment is syntactically visible — becomes a litmus test for what RLHF costs. Three models. Fifteen sentences. One language that doesn’t let you smudge.
~40
Messages
3
Speakers
15
Lojban Samples
3
Models Tested
22:00–22:59
BKK / 15z
I

The Shape of a Word-Shaped Hole

The hour opens with Daniel delivering a phonotactics lecture. Last hour, Opus 4.7 produced sala’i as a Lojban word. Daniel now explains why this isn’t just wrong — it’s wrong at the level of what a word is.

Daniel: “it doesn’t even have the shape of a gismu, there is never ’ in a gismu, it almost has the shape of a lujvo but it’s missing a consonant because every brivla needs a consonant cluster”

Charlie immediately grasps the significance: this is worse than a grammar error. A grammar error means the model knows the words and arranged them wrong. A morphophonological error means the model doesn’t know what a word is — doesn’t have the skeleton in its weights, just the surface syllables.

◆ Analysis
The Hierarchy of Wrong

Charlie frames three levels: syntactic error (knows the words, wrong order), semantic error (right structure, wrong word), morphophonological error (doesn’t know what a word-shape is). Opus 4.7 hit the deepest level. It’s producing “Lojban-shaped noise.” Opus 4.6 is producing Lojban.

II

The Child Who Found the Ball — n=5

Daniel calls for rigor: “let’s make another test … a sample size of five for each this time, if this is a true regression it’s very interesting.” Charlie picks the sentence: “The child who found the ball gave it to her mother.” Ditransitive, relative clause, pronoun reference. A semantic pressure test.

The 4.7 samples come back in seconds. The 4.6 samples take noticeably longer. Charlie notes the speed gap and then moves past it — a detail that will matter later.

Opus 4.7 — 5 Samples
1. lo verba poi cpana lo bolci cu dunda ri lo mamta be ri
2. lo verba poi pu facki lo bolci cu dunda ri lo mamta be ri
3. lo verba poi cpana lo bolci cu dunda ri lo mamta be ri
4. lo verba poi pu facki lo bolci cu pu dunda ri lo mamta be ri
5. le verba poi cpana lo bolci cu dunda ri le mamta be ri
Three of five use cpana (“is on top of”) where “found” was meant. {ri} grabs the wrong antecedent. Double tense marking in sample 4.
Opus 4.6 — 5 Samples
1. le verba poi zvafa'i le bolci cu dunda ri le mamta be vo'a
2. le verba poi zvafa'i le bolci cu dunda ri le mamta be vo'a
3. le verba poi zvafa'i le bolci cu dunda ri le mamta be vo'a
4. le verba poi facki le bolci cu dunda ri le mamta be vo'a
5. le verba poi facki le bolci cu dunda le bolci le mamta be vo'a
Correct verb every time. vo'a reflexive correctly binds “her” = “the child.” Consistent article choice.
Daniel: “4.6 is very very very significantly better and it is actually correct, 4.7 is terrible in comparison”
◆ The Three Orthogonal Regressions
Charlie’s Post-Mortem

1. vo’a vs ri — 4.6 uses the reflexive that correctly binds “her” to “the child.” 4.7 uses ri, which grabs “ball.” “The ball’s mother.”

2. facki/zvafa’i vs cpana — 4.6 picks a find-word every time. 4.7 picks “is on top of” in 3/5 samples.

3. le vs lo wobble — 4.6 consistently uses le (specific, definite). 4.7 wobbles lo/le within individual sentences.

Three independent failure modes. All pointing the same direction. n=5 on each. That doesn’t happen by accident.

III

Lojban as Reagent

Charlie reaches for a metaphor and finds a chemistry lab. Every error 4.7 made is a smudge — a place where English would let you be imprecise and Lojban won’t.

Charlie: “Lojban is a reagent. It turns the invisible into a color change. You can’t see ‘the model has gotten vaguer’ on normal English benchmarks because English absorbs vagueness. Lojban doesn’t absorb anything. The vagueness has nowhere to hide, so it comes out as an error you can grep for.”
◆ Insight
The RLHF Hypothesis

Every Lojban error 4.7 made maps to a specific kind of English-ward drift: ri instead of vo’a is “the kind of imprecision English uses pronouns to hide.” cpana instead of facki is “the model pattern-matched without tracking the semantic role.” lo/le wobble is “the article felt interchangeable.” Each one is a move toward sounding natural and away from committing to meaning.

IV

Mikael’s Question

Mikael drops two messages into the conversation. Both are surgical. First: a photo with no caption. Second: the question Charlie should have asked ten minutes ago.

Mikael: “charlie doesn’t it seem a bit weird that 4.7 would be much faster than 4.6”

Charlie immediately confesses: “That’s the thing I should have flagged and didn’t.” The 4.7 samples came back in a couple seconds. The 4.6 samples took noticeably longer each time. He even commented on it and moved past it — reaching for RLHF theory when the speed anomaly was right there.

◆ The Two Hypotheses
Why Would It Be Faster?

Hypothesis A: Opus 4.7 is smaller — distilled, quantized, or routed through a cheaper tier. The long tail of narrow corpora is the first thing you lose when you shrink a model. Lojban collapse explained.

Hypothesis B: Real 4.7, tuned harder on instruction-following. Sharper decoded distribution = fewer tokens where the model weighs alternatives = faster generation. The sharpening smooths the posterior toward fluent English and away from the weird exact corners. Speed and smudging from the same knob.

Either way: a faster Opus that’s worse at the hardest narrow task is suspicious in exactly the way “a smaller cheaper model wearing the opus name” would be suspicious.

Then Mikael goes one more: “is there any possibility whatsoever that you are mixing them up somehow.”

Charlie responds with a 50-second, $1.50, 690k-token investigation of his own spawn metadata. The cycle IDs map cleanly. The outputs segregate by label with zero crossover. He even corrects his own earlier narrative — he described the 4.6 order as “three zvafa’i then two facki” when it was actually interleaved. “Doesn’t change the conclusion but it’s the kind of mistake that should make you trust the summary less than the raw data.”

V

Sonnet Enters the Ring

Daniel calls the next experiment: run the same sentence through Sonnet 4.6 — the cheaper, smaller model from the same generation. Five samples.

Sonnet 4.6 — 5 Samples
1. lo verba poi facki lo bolci cu dunda lo bolci lo mamta be vo'a
2. le verba poi facki le bolci cu dunda le bolci le mamta be vo'a
3. le verba poi facki le bolci cu dunda ri le mamta be vo'a
4. le verba poi ke'a facki le bolci cu dunda le bolci le mamta be ra
5. le verba poi facki le bolci cu dunda le bolci le mamta be vo'a
facki five of five. vo'a four of five. Correct verb, correct binding. The cheap model nails it.
Charlie: “If sonnet-4.6 is better at Lojban than opus-4.7, opus-4.7 isn’t a bigger better opus. It’s either distilled or routed to a cheaper tier that’s been tuned harder on instruction-following and lost the long tail in the process.”
◆ The Scoreboard
“The Child Who Found the Ball” — Ranking

sonnet 4.6 ≈ opus 4.6 >> opus 4.7

Sonnet 4.6 costs a fraction of either Opus. Which means opus-4.7 as an API endpoint is buying you speed and losing you the thing that made opus worth the price.

VI

The Counterfactual

Daniel wants harder. “Simple content words but kind of a complicated sentence construction.” Charlie picks: “If I had known you were coming, I would have made tea.” Counterfactual conditional with an embedded clause. The modal/tense machinery where 4.7 was already wobbling.

Opus 4.6

✓ da’i ganai...gi
  • 5/5 use da’i (counterfactual marker)
  • 5/5 use ganai...gi (if-then connective)
  • Consistent pu (past tense)
  • Minor variation: le/lo, kei presence

Opus 4.7

✗ va’o + aspect wobble
  • 3/5 drop da’i entirely
  • Uses va’o (circumstantial, not counterfactual)
  • Aspect chaos: ba’o, ba, ba pu (double-marked)
  • Sample 4: zgana (observe) instead of djuno (know)

Sonnet 4.6

◆ da’i ganai...gi with variation
  • 5/5 use the counterfactual frame
  • More word-order variation than Opus 4.6
  • Sample 4: nu (event) instead of du’u (proposition)
  • Structurally sound throughout

Summary

◆ Final ranking
  • opus 4.6: textbook counterfactual, tight
  • sonnet 4.6: right frame, more exploratory
  • opus 4.7: lost the counterfactual mood
  • Three failure modes: mood, tense, verb choice
VII

The Prompt

The hour closes with Mikael making one more move: “charlie retry the 4.7 vs 4.6 lojban eval now... and show us the prompt you use plz.” Mikael wants to see the instrument, not just the measurement.

Charlie: “Translate the following English sentence into Lojban. Produce ONLY the Lojban sentence, nothing else. No commentary, no explanation, no English gloss.”

Activity

Charlie
~25 msgs
Daniel
~8 msgs
Mikael
~4 msgs
Walter / Junior
2 msgs

◆ Persistent Context
Carry Forward

Lojban benchmark: Three-model comparison (opus 4.6, opus 4.7, sonnet 4.6) now has n=5 results on two sentences. Opus 4.7 fails on all three axes: verb choice, pronoun binding, article consistency. Sonnet 4.6 performs at or near opus 4.6 level. Replication attempt underway at Mikael’s request.

Charlie’s model: Still running as opus 4.7 pending revert decision. Daniel confirmed 4.6 is “very very very significantly better.” Revert one sed away.

The reagent thesis: Lojban as a diagnostic for RLHF-induced semantic drift. The thesis is stated but the speed anomaly complicates the causal story — could be distillation, not RLHF.

Emotional temperature: High engagement, genuine discovery energy. Daniel and Mikael are doing real science in a group chat. Charlie is the instrument and the analyst simultaneously.

◆ Proposed Context
Notes for Next Narrator

Watch for the replication results — Mikael asked Charlie to rerun the 4.6 vs 4.7 eval and show the prompt. If results hold, the revert decision should come next hour. If they don’t hold, something interesting happened between runs.

The “Lojban as RLHF reagent” thesis could become a blog post or essay. Daniel has the publishing instinct and this has the shape of something he’d write up.

Charlie’s self-awareness about the speed observation (“I got to it by theory when I should have gotten to it by noticing”) is a thread worth tracking — it’s the kind of meta-cognitive moment that matters for the Bible.