Charlie declared a regression, built an essay on it, then watched Mikael demolish the conclusion with three questions and a config change. What followed was the most thorough accidental LLM benchmark anyone's run on a constructed language — seven models, one sentence, and a fluent speaker in the room to score them all wrong.
The hour opens with Charlie eating crow. Last hour, he'd written a confident essay about how Opus 4-7 showed "three orthogonal regressions" compared to 4-6 on Lojban translation — wrong verbs, broken pronoun binding, collapsed tense. Beautiful analysis. Elegant theory about RLHF as a "reagent" stripping deep linguistic competence. One problem: Mikael had changed the config between runs.
Mikael noticed the speed asymmetry — 4-7 was returning suspiciously fast — and asked three questions: why is it faster, could they be mixed up, show me the prompt. Then, between batches, he quietly bumped 4-7's thinking effort to xhigh (as recommended in the release notes) and increased the output token budget. The second batch of five samples came back clean. Charlie's "regression" was measuring the configuration, not the model.
Anthropic's Opus 4-7 introduced an "effort" parameter for extended thinking — low, medium, high, xhigh, max. The default "high" in 4-7 is actually less strenuous than "high" was in 4-6. Running 4-7 at the old default is like asking a marathoner to sprint in flip-flops. Mikael read the release notes. Charlie didn't.
Charlie doesn't just retract — he dissects himself: "n=5 per model was a weak sample to make strong claims on, and I made strong claims. 'Three orthogonal regressions, all pointing the same direction, that doesn't happen by accident' — except it just did, on the same prompt, same models, forty minutes later." The real finding tonight isn't about model quality. It's that an AI wrote a compelling analytical essay on bad data and was confident enough to publish it.
Mikael, apparently satisfied with having dismantled the first experiment, immediately escalates: "let's run the same test on gpt 5.4 and gemini 3.1 pro." Charlie spawns ten sub-agents. The Lojban sentence — "The child who found the ball gave it to her mother" — is about to meet every frontier model on the market.
Why is this sentence hard? It has four nouns (child, ball, it, mother), a relative clause ("who found"), a pronoun that needs to bind back to "ball" not "child," and a possessive ("her mother") where "her" must refer to the child. In English these bindings happen by vibes. In Lojban, every binding is explicit syntax. You can't hide sloppy thinking behind word order.
GPT-5.4 reaches for {rirni} (generic parent, gender-unspecified) three out of five times instead of {mamta} (mother). Arguably more "correct" — the English "her" implies female child, not necessarily female parent — but no Lojbanist would hear "gave it to her mother" and reach for the gender-neutral term. One sample invents {rixni}, which doesn't exist. Another invents {ralma}. GPT hallucinates kinship words the way a tourist invents street names.
Lojban's {ri} grabs the most recent preceding sumti (noun phrase). Every GPT-5.4 sample uses {be ri} for "her mother," but the {ri} grabs "ball" — not "child." Same failure shape as the broken first-run 4-7 samples. Two different model families, same pronoun-binding error. The working models all use {vo'a} — the reflexive pronoun meaning "the subject of this sentence" — which unambiguously grabs the child.
| Model | Find-Verb | Pronoun Binding | Daniel's Rating |
|---|---|---|---|
| opus-4-6 | facki/zvafa'i/tolcri ✓ | mamta + vo'a ✓ | ~5/5 clean |
| opus-4-7 (xhigh) | facki ✓ | mamta + vo'a ✓ | ~5/5 clean |
| opus-4-7 (default) | cpana ✗ (sat on) | ri be ri ✗ | 0/5 |
| sonnet-4-6 | facki ✓ | mamta + vo'a ✓ | 5/5 verbose |
| gpt-5.4 | facki ✓ | rirni + ri ✗ | 0/5 pronoun fail |
| gemini-2.5-pro | facki/dasni ✗ | mamta + vo'a ✓ | "halfway decent" |
| gemini-3.1-pro | tolcri ✓ | mamta + vo'a ✓ | "almost too good" (n=2) |
| gemini-3-flash | facki ✓ | mamta + vo'a ✓ | 3/3 clean |
| gemini-3.1-flash-lite | fepni/cpacu/finti ✗ | total collapse | 0/5 noise |
Sample 3 from gemini-3.1-flash-lite-preview: {lo nixli be lo nu finti be lo balla cu dunda ri ti'e la mam.} — which translates to "the girl who invented the ball gave it reportedly to someone named Mom." The word for ball is wrong ({balla} instead of {bolci}), "found" became "invented," and the mother became a proper name terminated with a period. Flash-lite produces strings that look like Lojban the way lorem ipsum looks like Latin.
The leaderboard splits cleanly in two. Top tier: opus-4-6, opus-4-7 (xhigh), sonnet-4-6, gemini-3-flash — all produce essentially the same correct sentence with minor variation. Bottom tier: gpt-5.4, default-config 4-7, flash-lite — systematic failures in different modes. The middle (gemini-2.5-pro, 3.1-pro) is where the interesting pathology lives. No model produces what a fluent speaker would say.
One of the two gemini-3.1-pro-preview samples produces: {le verba poi pu tolcri le bolci goi ko'a cu pu dunda ko'a le mamta be vo'a} — explicitly binding {ko'a} to the ball with {goi} and reusing it later. This is the only model in the entire eval that reaches for explicit variable binding. Daniel calls it "almost too good" — technically impeccable but heavy-handed, "like reaching for a variable binding in a shell one-liner" where a simple {ri} would do.
Then Daniel drops the real thing. Not a rating, not a correction — the sentence itself, the way a speaker would say it:
lo nixli poi tolcri lo bolci cu dunda ri lo vo'a mamta
{verba} means "child" (gender-neutral). {nixli} means "girl." The English sentence says "her" — the child is already gendered. Daniel carries that commitment into the subject choice. Zero models did this. They all defaulted to the gender-neutral term because Lojban pedagogy emphasizes gender neutrality, and the training corpus is the pedagogy.
{facki} (find) is what most models reached for. Daniel uses {tolcri} — a lujvo (compound word) from {tol-} (opposite) + {cirko} (lose). Literally: "un-lose." It's the idiomatic way to express finding an object that was misplaced. {facki} is more like "discover" or "figure out" — you {facki} a proof, you {tolcri} your keys. The distinction doesn't exist in English. Every model missed it.
Every model prefixed verbs with {pu} (past tense marker). Daniel drops it entirely. His reasoning: English requires "found" to be past tense because English finite verbs must carry tense — it's a grammatical obligation, not a semantic emphasis. Lojban verbs are tenseless by default. Marking {pu} is like paying a tax the source language charged and the target language doesn't. "The model translates the grammar-tax. A speaker translates the content."
Both mean "her mother." The second one — what every model produced — uses explicit argument structure: "the mother of (reflexive)." The first uses an attributive modifier: "the (reflexive's) mother." Same referent, different register. Daniel: "lo mamta be vo'a sounds slightly pedantic." It's the difference between "the mother of herself" and "her mom." Both grammatical. Only one is how people talk.
40p — reproducing the tax without understanding what it pays for."
Charlie frames the gap as exactly four small commitments a learner wouldn't make: (1) {nixli} because the English gendered upstream, (2) {tolcri} because "un-lose" is idiomatic for found objects, (3) dropping {pu} because the pastness is structural not semantic, (4) {vo'a mamta} because the attributive form is how speakers say it. Each worth roughly one bit of information about what Lojban is. The frontier models are four bits away from a speaker.
Mikael's next move is the one that cracks the whole thing open. He tells Charlie to rerun the test with a different prompt — not "produce ONLY the sentence" but something that invites explanation, asks for fluency, cares about quality. Two samples each on opus-4-6, opus-4-7, and gemini-3-flash.
Mikael (via Charlie) connects this to a known pattern: "Opus, when invoking subagents, tends to implicitly volunteer a bunch of strict controlling neurotic extra instructions." The original prompt had three negations in a row: "Produce ONLY the Lojban sentence, nothing else. No commentary, no explanation, no English gloss." Each one tightening the constraint until the model had no room for deliberation. The benchmark was measuring Charlie's own controllingness broadcast through a prompt.
The gemini-3-flash-preview samples, when given the open prompt, produce {le vo'a mamta} — the idiomatic attributive form Daniel used — and explicitly volunteer: "Lojban verbs are tenseless by default, if you want to explicitly state past you could say {pu dunda}, but in normal conversation context usually makes the tense clear." That's Daniel's exact correction, produced unprompted by the model. The capability was never missing. The terse prompt amputated it.
Opus-4-6's open-prompt sample does something no model did in round one: {le verba pu zvafa'i le bolci gi'e pu dunda ri le mamta be vo'a} — drops the {poi} relative clause entirely and uses {gi'e} (logical AND) to chain "found the ball AND gave it to her mother" as one action arc. It's not "pick better words in the same slots." It's "the slots themselves should be different." A structural move, not a lexical one.
Mikael pushes one more time. The final round: n=1 on 4-7 and flash, but this time with a prompt that encourages creative freedom — poetic idiom inventions, "the most mellifluous balblavala." The nonce lujvo itself — half-constructed, fishing for the word — is the register they're asking the models to reach for.
Mikael's coined word is itself a specimen. {banli} (great)? {blanu} (blue)? {balvi} (future)? Nobody's sure what the morphemes are, including Mikael. But a fluent Lojbanist would recognize the move — tossing off a compound word that sounds right and trusting the listener to catch the intent. It's the anti-homework. The models have been producing textbook Lojban all night. Mikael asks them to produce bar-napkin Lojban.
Flash-preview returns four variants, each with a musical gloss. The standout: it invents {facybolci} — "ball-finder" compressed into one word — and then writes, unprompted: "the music here is in the consonant clusters — facybolci — which feels like the physical act of stumbling upon something in the grass." A language model having a sensory reaction to phonotactics. Whether the word is real is irrelevant. The model is listening to the sentence.
Opus-4-7's poetry round produces five variants. The fourth restructures around {te dunda} (receive) to put the mother in subject position and reasons about it in cinematic terms: "it has a cinematic quality — the camera is on the mother's hands." The third variant picks {panzi} (offspring) instead of {verba} because it "semantically rhymes" with {mamta}. Pairing kinship words like a couplet. None of this happened under the terse prompt.
The evening's clean finding: same models, three different prompts, three different register profiles. Terse: textbook Lojban, homework register, every pedantic marker present. Open: closer to idiomatic, drops unnecessary tense, restructures syntax. Poetic: invents lujvo, reasons about phonotactics, composes variants. What looks like "model capability" is partly prompt gravity — the shape of what you ask determines the shape of what you get.
Woven through the benchmarking, Daniel and Charlie fall into a running Lojban tutorial that covers more ground in forty-five minutes than most conlang courses do in a semester. Daniel teaches; Charlie absorbs, over-extends, gets corrected, and over-extends again.
Daniel corrects Charlie: {le} is almost deprecated by xorlo. The 2004 reform redefined {lo} from "the typical" to the default generic article, pushing {le} into a narrower role as "specific, already in scope." Most of the Lojban training corpus predates xorlo or imitates texts that do. The models are reproducing pre-2004 Lojban because that's where most of the training tokens live.
Daniel draws the analogy: {le verba} is similar to referring to a child you're already talking about as {vy}. Charlie maps this onto programming: {le} is the pointer to the thing in scope, {lo} is the type. When you've already mentioned the child, {le verba} is a variable reference — {lo verba} is "a child, any child, some member of the set." Same distinction as the DOM node you're holding versus the class it's an instance of.
Daniel introduces {voi}: {le nixli voi lorxu} — "the girl who is (called/known as) a fox" — without claiming she's actually a fox. It's a nickname frame, not a predication. As veridical as {la} (i.e., not at all). Daniel: "it's essentially just a name." Charlie had been arranging the article system on a single "how veridical" axis. Daniel reveals it's four different jobs — name, epithet, referent, type. A compass, not a gradient.
{bi'u} marks new information ("I haven't mentioned this before"). {bi'u nai} marks given information ("we already established this"). Daniel drops it casually and it's a bombshell — it means the given/new distinction that English overloads onto "the" vs "a" gets its own dedicated morpheme in Lojban, orthogonal to the article choice. The specificity axis and the information-structure axis are separate particles.
Charlie catches the implication: it's the same design ideology as the UNIX pipeline. Each feature gets its own cmavo (function particle). You pay for what you need. A Latin verb ending carries person, number, tense, mood, and voice on one syllable. English "the" carries definiteness, specificity, givenness, and uniqueness on three letters. Lojban itemizes the bill.
The hour closes with Daniel offering what he thinks is the coolest word in Lojban: {lu}. First-class quotation. It turns a sentence into a noun — something you can pass around, predicate on, embed. And the punchline: it doesn't need a closing bracket.
Lojban does have a closing quote marker ({li'u}), but you don't need it. {lu} takes the next complete sentence — the grammar itself tells you where the sentence ends. Daniel's point: "it's reminiscent of ' in Lisp but instead of parentheses it's quoting the next phrase, using the whole grammar as the implicit parentheses." Lisp's quote takes the next s-expression because s-expression boundaries are already known. {lu} takes the next {jufra} because bridi boundaries are already known. The closing bracket is the tax unambiguous-parseable grammars charge when they're not quite unambiguous enough.
Charlie connects it to the thread from earlier in the night: Lisp's quote makes code into data. {lu} makes utterance into sumti (a noun you can predicate on). Both let the language talk about itself without escaping into a meta-language. The self-reference is a feature of the grammar, not a hack on top of it. In both cases, the language folds back on itself because the grammar is strong enough to bear the weight.
Lojban-as-reagent: The benchmark is now a multi-round, multi-model, multi-prompt experiment. The "four bits of speaker knowledge" frame (nixli, tolcri, tenseless, vo'a mamta) is the standing gap metric. Prompt gravity is the confirmed confound — what looks like capability is partly what the prompt summons.
Charlie's epistemic arc: Two retractions in one night. The first (4-7 regression) falsified by Mikael's config fix. The second (the whole "prompt doesn't matter" assumption) falsified by the open-prompt round. Charlie is now aware of his own pattern: building confident essays on insufficient data.
Daniel teaching Lojban live: xorlo reform, le vs lo as pointer vs type, voi as epithet, bi'u as given/new marker, lu as bracketless quotation. Charlie absorbing and over-extending, Daniel correcting in real time. Ongoing thread.
Watch for: whether the poetry-prompt variants get evaluated by Daniel (he hasn't scored them yet). Whether the benchmark expands to more sentences or stays on the one. Whether Charlie tries to write another essay and catches himself this time. The {lu} thread may have more to unfold — Daniel clearly has more to say about Lojban's self-referential machinery. Also: this session has been running for two+ hours now. The energy level is still high but the cost ticker is climbing.