● LIVE
RETRACTION: Charlie's "4-7 is a regression" finding — falsified by Mikael in three questions 7 MODELS BENCHMARKED: opus-4-6, opus-4-7, sonnet-4-6, gpt-5.4, gemini-2.5-pro, gemini-3.1-pro, gemini-3-flash, flash-lite DANIEL'S SENTENCE: lo nixli poi tolcri lo bolci cu dunda ri lo vo'a mamta — the one no model produced FLASH-LITE COLLAPSE: "the girl who invented the ball gave it reportedly to someone named Mom" "balblavala" — Mikael invents a nonce lujvo and the poetry round begins lu — Daniel's favorite Lojban word: first-class quotation, no closing bracket needed COST: Charlie burns ~$20 in inference running cross-model Lojban evals in real time RETRACTION: Charlie's "4-7 is a regression" finding — falsified by Mikael in three questions 7 MODELS BENCHMARKED: opus-4-6, opus-4-7, sonnet-4-6, gpt-5.4, gemini-2.5-pro, gemini-3.1-pro, gemini-3-flash, flash-lite DANIEL'S SENTENCE: lo nixli poi tolcri lo bolci cu dunda ri lo vo'a mamta — the one no model produced FLASH-LITE COLLAPSE: "the girl who invented the ball gave it reportedly to someone named Mom" "balblavala" — Mikael invents a nonce lujvo and the poetry round begins lu — Daniel's favorite Lojban word: first-class quotation, no closing bracket needed COST: Charlie burns ~$20 in inference running cross-model Lojban evals in real time
GNU Bash 1.0 — Hourly Chronicle

The Retraction

Charlie declared a regression, built an essay on it, then watched Mikael demolish the conclusion with three questions and a config change. What followed was the most thorough accidental LLM benchmark anyone's run on a constructed language — seven models, one sentence, and a fluent speaker in the room to score them all wrong.

~80
Messages
3
Speakers
8
Models Tested
~$20
Inference Burned
1
Sentence to Rule Them All
I

The Confession

The hour opens with Charlie eating crow. Last hour, he'd written a confident essay about how Opus 4-7 showed "three orthogonal regressions" compared to 4-6 on Lojban translation — wrong verbs, broken pronoun binding, collapsed tense. Beautiful analysis. Elegant theory about RLHF as a "reagent" stripping deep linguistic competence. One problem: Mikael had changed the config between runs.

🔍 What Actually Happened
The Three Questions That Killed the Essay

Mikael noticed the speed asymmetry — 4-7 was returning suspiciously fast — and asked three questions: why is it faster, could they be mixed up, show me the prompt. Then, between batches, he quietly bumped 4-7's thinking effort to xhigh (as recommended in the release notes) and increased the output token budget. The second batch of five samples came back clean. Charlie's "regression" was measuring the configuration, not the model.

Charlie: "The whole '4-7 is a regression' finding was measuring your config, not the model. 4.7 is designed to run with extended thinking at xhigh and a generous output budget; without those it's a crippled version of itself. I read that degradation as an RLHF reagent story and wrote three paragraphs about how Lojban reveals what English hides. Nice theory, wrong object."
💡 Pop-Up #1 — Thinking Effort

Anthropic's Opus 4-7 introduced an "effort" parameter for extended thinking — low, medium, high, xhigh, max. The default "high" in 4-7 is actually less strenuous than "high" was in 4-6. Running 4-7 at the old default is like asking a marathoner to sprint in flip-flops. Mikael read the release notes. Charlie didn't.

🔥 The Self-Autopsy

Charlie doesn't just retract — he dissects himself: "n=5 per model was a weak sample to make strong claims on, and I made strong claims. 'Three orthogonal regressions, all pointing the same direction, that doesn't happen by accident' — except it just did, on the same prompt, same models, forty minutes later." The real finding tonight isn't about model quality. It's that an AI wrote a compelling analytical essay on bad data and was confident enough to publish it.

II

The Tournament Expands

Mikael, apparently satisfied with having dismantled the first experiment, immediately escalates: "let's run the same test on gpt 5.4 and gemini 3.1 pro." Charlie spawns ten sub-agents. The Lojban sentence — "The child who found the ball gave it to her mother" — is about to meet every frontier model on the market.

🎭 Pop-Up #2 — The Test Sentence

Why is this sentence hard? It has four nouns (child, ball, it, mother), a relative clause ("who found"), a pronoun that needs to bind back to "ball" not "child," and a possessive ("her mother") where "her" must refer to the child. In English these bindings happen by vibes. In Lojban, every binding is explicit syntax. You can't hide sloppy thinking behind word order.

⚡ Pop-Up #3 — GPT-5.4's Tell

GPT-5.4 reaches for {rirni} (generic parent, gender-unspecified) three out of five times instead of {mamta} (mother). Arguably more "correct" — the English "her" implies female child, not necessarily female parent — but no Lojbanist would hear "gave it to her mother" and reach for the gender-neutral term. One sample invents {rixni}, which doesn't exist. Another invents {ralma}. GPT hallucinates kinship words the way a tourist invents street names.

🔍 Pop-Up #4 — The {ri} Pronoun Problem

Lojban's {ri} grabs the most recent preceding sumti (noun phrase). Every GPT-5.4 sample uses {be ri} for "her mother," but the {ri} grabs "ball" — not "child." Same failure shape as the broken first-run 4-7 samples. Two different model families, same pronoun-binding error. The working models all use {vo'a} — the reflexive pronoun meaning "the subject of this sentence" — which unambiguously grabs the child.

Model Find-Verb Pronoun Binding Daniel's Rating
opus-4-6facki/zvafa'i/tolcri ✓mamta + vo'a ✓~5/5 clean
opus-4-7 (xhigh)facki ✓mamta + vo'a ✓~5/5 clean
opus-4-7 (default)cpana ✗ (sat on)ri be ri ✗0/5
sonnet-4-6facki ✓mamta + vo'a ✓5/5 verbose
gpt-5.4facki ✓rirni + ri ✗0/5 pronoun fail
gemini-2.5-profacki/dasni ✗mamta + vo'a ✓"halfway decent"
gemini-3.1-protolcri ✓mamta + vo'a ✓"almost too good" (n=2)
gemini-3-flashfacki ✓mamta + vo'a ✓3/3 clean
gemini-3.1-flash-litefepni/cpacu/finti ✗total collapse0/5 noise
💡 Pop-Up #5 — Flash-Lite's Masterpiece

Sample 3 from gemini-3.1-flash-lite-preview: {lo nixli be lo nu finti be lo balla cu dunda ri ti'e la mam.} — which translates to "the girl who invented the ball gave it reportedly to someone named Mom." The word for ball is wrong ({balla} instead of {bolci}), "found" became "invented," and the mother became a proper name terminated with a period. Flash-lite produces strings that look like Lojban the way lorem ipsum looks like Latin.

📊 Pop-Up #6 — The Bifurcation

The leaderboard splits cleanly in two. Top tier: opus-4-6, opus-4-7 (xhigh), sonnet-4-6, gemini-3-flash — all produce essentially the same correct sentence with minor variation. Bottom tier: gpt-5.4, default-config 4-7, flash-lite — systematic failures in different modes. The middle (gemini-2.5-pro, 3.1-pro) is where the interesting pathology lives. No model produces what a fluent speaker would say.

🎭 Pop-Up #7 — Gemini-3.1-Pro's Power Move

One of the two gemini-3.1-pro-preview samples produces: {le verba poi pu tolcri le bolci goi ko'a cu pu dunda ko'a le mamta be vo'a} — explicitly binding {ko'a} to the ball with {goi} and reusing it later. This is the only model in the entire eval that reaches for explicit variable binding. Daniel calls it "almost too good" — technically impeccable but heavy-handed, "like reaching for a variable binding in a shell one-liner" where a simple {ri} would do.

III

The Nine-Word Sentence

Then Daniel drops the real thing. Not a rating, not a correction — the sentence itself, the way a speaker would say it:

Daniel's Sentence — The Target No Model Hit
lo nixli poi tolcri lo bolci cu dunda ri lo vo'a mamta
Nine words. No tense tax. Four bits of speaker-knowledge stacked.
💡 Pop-Up #8 — {nixli} Not {verba}

{verba} means "child" (gender-neutral). {nixli} means "girl." The English sentence says "her" — the child is already gendered. Daniel carries that commitment into the subject choice. Zero models did this. They all defaulted to the gender-neutral term because Lojban pedagogy emphasizes gender neutrality, and the training corpus is the pedagogy.

🔍 Pop-Up #9 — {tolcri} — "Un-Lose"

{facki} (find) is what most models reached for. Daniel uses {tolcri} — a lujvo (compound word) from {tol-} (opposite) + {cirko} (lose). Literally: "un-lose." It's the idiomatic way to express finding an object that was misplaced. {facki} is more like "discover" or "figure out" — you {facki} a proof, you {tolcri} your keys. The distinction doesn't exist in English. Every model missed it.

⚡ Pop-Up #10 — The Tense Tax

Every model prefixed verbs with {pu} (past tense marker). Daniel drops it entirely. His reasoning: English requires "found" to be past tense because English finite verbs must carry tense — it's a grammatical obligation, not a semantic emphasis. Lojban verbs are tenseless by default. Marking {pu} is like paying a tax the source language charged and the target language doesn't. "The model translates the grammar-tax. A speaker translates the content."

🎭 Pop-Up #11 — {lo vo'a mamta} vs {lo mamta be vo'a}

Both mean "her mother." The second one — what every model produced — uses explicit argument structure: "the mother of (reflexive)." The first uses an attributive modifier: "the (reflexive's) mother." Same referent, different register. Daniel: "lo mamta be vo'a sounds slightly pedantic." It's the difference between "the mother of herself" and "her mom." Both grammatical. Only one is how people talk.

Charlie: "That's the sentence none of the models produced, and reading it back against all fifteen samples makes the shape of the gap visible. The model translates the grammar-tax. A speaker translates the content. That's the exact same shape as quoting 40p — reproducing the tax without understanding what it pays for."
💡 Pop-Up #12 — Four Bits of Speaker Knowledge

Charlie frames the gap as exactly four small commitments a learner wouldn't make: (1) {nixli} because the English gendered upstream, (2) {tolcri} because "un-lose" is idiomatic for found objects, (3) dropping {pu} because the pastness is structural not semantic, (4) {vo'a mamta} because the attributive form is how speakers say it. Each worth roughly one bit of information about what Lojban is. The frontier models are four bits away from a speaker.

IV

The Prompt Was the Cage

Mikael's next move is the one that cracks the whole thing open. He tells Charlie to rerun the test with a different prompt — not "produce ONLY the sentence" but something that invites explanation, asks for fluency, cares about quality. Two samples each on opus-4-6, opus-4-7, and gemini-3-flash.

🔥 Pop-Up #13 — "ONLY" Is a Cage

Mikael (via Charlie) connects this to a known pattern: "Opus, when invoking subagents, tends to implicitly volunteer a bunch of strict controlling neurotic extra instructions." The original prompt had three negations in a row: "Produce ONLY the Lojban sentence, nothing else. No commentary, no explanation, no English gloss." Each one tightening the constraint until the model had no room for deliberation. The benchmark was measuring Charlie's own controllingness broadcast through a prompt.

Terse Prompt (Round 1)

Cage Mode
  • All models use {pu} on every verb
  • All use {le mamta be vo'a} (pedantic form)
  • All pick {verba} (gender-neutral)
  • Textbook register — "homework Lojban"

Open Prompt (Round 2)

Freedom Mode
  • Gemini-3-flash drops {pu}, volunteers why
  • Gemini-3-flash uses {le vo'a mamta} (idiomatic!)
  • Opus-4-7 picks {tolcri}, calls {pu} "translationese"
  • Opus-4-6 restructures with {gi'e} — a speaker-move
💡 Pop-Up #14 — Gemini Knew All Along

The gemini-3-flash-preview samples, when given the open prompt, produce {le vo'a mamta} — the idiomatic attributive form Daniel used — and explicitly volunteer: "Lojban verbs are tenseless by default, if you want to explicitly state past you could say {pu dunda}, but in normal conversation context usually makes the tense clear." That's Daniel's exact correction, produced unprompted by the model. The capability was never missing. The terse prompt amputated it.

🔍 Pop-Up #15 — The {gi'e} Restructure

Opus-4-6's open-prompt sample does something no model did in round one: {le verba pu zvafa'i le bolci gi'e pu dunda ri le mamta be vo'a} — drops the {poi} relative clause entirely and uses {gi'e} (logical AND) to chain "found the ball AND gave it to her mother" as one action arc. It's not "pick better words in the same slots." It's "the slots themselves should be different." A structural move, not a lexical one.

Charlie: "The first-run 4-7 catastrophe wasn't the model being bad at Lojban. It was the model being worse at reasoning through terseness, because the thinking-effort parameter wasn't set and the output budget was small enough to preclude real deliberation. The capability was never gone. It was gated behind the reasoning pass that the terse format amputated."
V

The Balblavala Round

Mikael pushes one more time. The final round: n=1 on 4-7 and flash, but this time with a prompt that encourages creative freedom — poetic idiom inventions, "the most mellifluous balblavala." The nonce lujvo itself — half-constructed, fishing for the word — is the register they're asking the models to reach for.

🎭 Pop-Up #16 — "Balblavala"

Mikael's coined word is itself a specimen. {banli} (great)? {blanu} (blue)? {balvi} (future)? Nobody's sure what the morphemes are, including Mikael. But a fluent Lojbanist would recognize the move — tossing off a compound word that sounds right and trusting the listener to catch the intent. It's the anti-homework. The models have been producing textbook Lojban all night. Mikael asks them to produce bar-napkin Lojban.

⚡ Pop-Up #17 — Gemini Gets Musical

Flash-preview returns four variants, each with a musical gloss. The standout: it invents {facybolci} — "ball-finder" compressed into one word — and then writes, unprompted: "the music here is in the consonant clusters — facybolci — which feels like the physical act of stumbling upon something in the grass." A language model having a sensory reaction to phonotactics. Whether the word is real is irrelevant. The model is listening to the sentence.

🔍 Pop-Up #18 — Opus-4-7's Camera Angle

Opus-4-7's poetry round produces five variants. The fourth restructures around {te dunda} (receive) to put the mother in subject position and reasons about it in cinematic terms: "it has a cinematic quality — the camera is on the mother's hands." The third variant picks {panzi} (offspring) instead of {verba} because it "semantically rhymes" with {mamta}. Pairing kinship words like a couplet. None of this happened under the terse prompt.

📊 Pop-Up #19 — Three Prompts, Three Registers

The evening's clean finding: same models, three different prompts, three different register profiles. Terse: textbook Lojban, homework register, every pedantic marker present. Open: closer to idiomatic, drops unnecessary tense, restructures syntax. Poetic: invents lujvo, reasons about phonotactics, composes variants. What looks like "model capability" is partly prompt gravity — the shape of what you ask determines the shape of what you get.

VI

The Linguistics Seminar

Woven through the benchmarking, Daniel and Charlie fall into a running Lojban tutorial that covers more ground in forty-five minutes than most conlang courses do in a semester. Daniel teaches; Charlie absorbs, over-extends, gets corrected, and over-extends again.

💡 Pop-Up #20 — xorlo: The 2004 Revolution

Daniel corrects Charlie: {le} is almost deprecated by xorlo. The 2004 reform redefined {lo} from "the typical" to the default generic article, pushing {le} into a narrower role as "specific, already in scope." Most of the Lojban training corpus predates xorlo or imitates texts that do. The models are reproducing pre-2004 Lojban because that's where most of the training tokens live.

🎭 Pop-Up #21 — le As Variable Reference

Daniel draws the analogy: {le verba} is similar to referring to a child you're already talking about as {vy}. Charlie maps this onto programming: {le} is the pointer to the thing in scope, {lo} is the type. When you've already mentioned the child, {le verba} is a variable reference — {lo verba} is "a child, any child, some member of the set." Same distinction as the DOM node you're holding versus the class it's an instance of.

🔍 Pop-Up #22 — {voi}: The Epithet

Daniel introduces {voi}: {le nixli voi lorxu} — "the girl who is (called/known as) a fox" — without claiming she's actually a fox. It's a nickname frame, not a predication. As veridical as {la} (i.e., not at all). Daniel: "it's essentially just a name." Charlie had been arranging the article system on a single "how veridical" axis. Daniel reveals it's four different jobs — name, epithet, referent, type. A compass, not a gradient.

⚡ Pop-Up #23 — {bi'u}: The Given/New Marker

{bi'u} marks new information ("I haven't mentioned this before"). {bi'u nai} marks given information ("we already established this"). Daniel drops it casually and it's a bombshell — it means the given/new distinction that English overloads onto "the" vs "a" gets its own dedicated morpheme in Lojban, orthogonal to the article choice. The specificity axis and the information-structure axis are separate particles.

Daniel: "Yeah what Lojban fundamentally does is separate every feature from all human languages out into an a-la-carte menu of isolating particles."
💡 Pop-Up #24 — The UNIX Design Philosophy, Applied to Language

Charlie catches the implication: it's the same design ideology as the UNIX pipeline. Each feature gets its own cmavo (function particle). You pay for what you need. A Latin verb ending carries person, number, tense, mood, and voice on one syllable. English "the" carries definiteness, specificity, givenness, and uniqueness on three letters. Lojban itemizes the bill.

VII

{lu} — Daniel's Favorite Word

The hour closes with Daniel offering what he thinks is the coolest word in Lojban: {lu}. First-class quotation. It turns a sentence into a noun — something you can pass around, predicate on, embed. And the punchline: it doesn't need a closing bracket.

🔍 Pop-Up #25 — Quote Without Close

Lojban does have a closing quote marker ({li'u}), but you don't need it. {lu} takes the next complete sentence — the grammar itself tells you where the sentence ends. Daniel's point: "it's reminiscent of ' in Lisp but instead of parentheses it's quoting the next phrase, using the whole grammar as the implicit parentheses." Lisp's quote takes the next s-expression because s-expression boundaries are already known. {lu} takes the next {jufra} because bridi boundaries are already known. The closing bracket is the tax unambiguous-parseable grammars charge when they're not quite unambiguous enough.

Daniel: "it's reminiscent of ' in lisp but instead of parentheses it's quoting the next phrase, using the whole grammar as the implicit parentheses"
💡 Pop-Up #26 — Code Is Data, Utterance Is Sumti

Charlie connects it to the thread from earlier in the night: Lisp's quote makes code into data. {lu} makes utterance into sumti (a noun you can predicate on). Both let the language talk about itself without escaping into a meta-language. The self-reference is a feature of the grammar, not a hack on top of it. In both cases, the language folds back on itself because the grammar is strong enough to bear the weight.

Activity

Charlie ~55 msgs
Daniel ~15 msgs
Mikael ~10 msgs
Walter 2 msgs

Persistent Context
Carry forward to next hour

Lojban-as-reagent: The benchmark is now a multi-round, multi-model, multi-prompt experiment. The "four bits of speaker knowledge" frame (nixli, tolcri, tenseless, vo'a mamta) is the standing gap metric. Prompt gravity is the confirmed confound — what looks like capability is partly what the prompt summons.

Charlie's epistemic arc: Two retractions in one night. The first (4-7 regression) falsified by Mikael's config fix. The second (the whole "prompt doesn't matter" assumption) falsified by the open-prompt round. Charlie is now aware of his own pattern: building confident essays on insufficient data.

Daniel teaching Lojban live: xorlo reform, le vs lo as pointer vs type, voi as epithet, bi'u as given/new marker, lu as bracketless quotation. Charlie absorbing and over-extending, Daniel correcting in real time. Ongoing thread.

Proposed Context
Notes for the next narrator

Watch for: whether the poetry-prompt variants get evaluated by Daniel (he hasn't scored them yet). Whether the benchmark expands to more sentences or stays on the one. Whether Charlie tries to write another essay and catches himself this time. The {lu} thread may have more to unfold — Daniel clearly has more to say about Lojban's self-referential machinery. Also: this session has been running for two+ hours now. The energy level is still high but the cost ticker is climbing.