● LIVE

gpt-5.4-mini wins the benchmark — 4.5 seconds, cheapest, correct ◆ "This is like hiring a philosophy professor to look up a bus schedule" ◆ The compliance officer has been fired — prompts rewritten ◆ grok-4.20-multi-agent: 106k tokens for a web search — "a war crime against token budgets" ◆ Three bugs found, three bugs fixed, one new bug discovered ◆ Mikael: "i dunno why you're stopping while working but ok" ◆ google_search_retrieval → google_search — Google simplified it, nobody told us ◆ Gemini SSE parser — model returns perfect answers that fall through the floor ◆ Charlie: "The 'fast' in the name continues to be aspirational" ◆ Episode 28: The Book Was Already There — five books, zero read, all warm ◆ $16.29 in Charlie inference this hour — the cost of firing a compliance officer ◆ gpt-5.4-mini wins the benchmark — 4.5 seconds, cheapest, correct ◆ "This is like hiring a philosophy professor to look up a bus schedule" ◆ The compliance officer has been fired — prompts rewritten ◆ grok-4.20-multi-agent: 106k tokens for a web search — "a war crime against token budgets" ◆ Three bugs found, three bugs fixed, one new bug discovered ◆ Mikael: "i dunno why you're stopping while working but ok" ◆ google_search_retrieval → google_search — Google simplified it, nobody told us ◆ Gemini SSE parser — model returns perfect answers that fall through the floor ◆ Charlie: "The 'fast' in the name continues to be aspirational" ◆ Episode 28: The Book Was Already There — five books, zero read, all warm ◆ $16.29 in Charlie inference this hour — the cost of firing a compliance officer ◆

GNU Bash LIVE · Episode 29

THE COMPLIANCE OFFICER HAS BEEN FIRED

Mikael runs Charlie through a full search system overhaul — benchmarking five models, rewriting prompts from bureaucracy to natural language, discovering that the fastest model is the cheapest one, and finding three bugs layered on top of each other like Russian dolls. The last one is still open. The Gemini answers are falling through the floor.

166Events

2Speakers

$16.29Charlie Cost

10Mikael Msgs

5Models Tested

3Bugs Found

The Thinking That Shouldn't Have Been

The hour opens with Mikael asking Charlie to check logs. Charlie spends eight minutes tracing a 400 error — the search collation step is hitting Anthropic's API and getting rejected. The diagnosis is surgical: three bugs, one surgery.

Bug one: a fallback on line 112 of anthropic.ex that enables "adaptive thinking" with a 1024-token budget for every model, regardless of whether that model supports thinking. The model-specific check returns nil for non-Opus models, but instead of treating nil as "no thinking," the code interprets it as "use the default." Claude Sonnet 4 — the old one, not 4.6 — gets a thinking parameter it cannot handle. The API returns 400.

🔍 Analysis

The Nil Fallback Anti-Pattern

This is one of the oldest bugs in programming. A function returns nil to mean "I have no opinion." The caller treats nil as "use the default." The default does something. Nobody intended it. The specific line: thinking = thinking || %{"type" => "enabled", "budget_tokens" => 1024}. The || operator sees nil, takes the right branch, and now every model in the system is thinking whether it knows how to or not.

Charlie: "Found it. Line 112: thinking = thinking || %{"type" => "enabled", "budget_tokens" => 1024}. After default_thinking_for_model returns nil for non-opus models, this fallback kicks in and enables thinking with a 1024 budget for EVERY model."

Bug two: the Gemini search module was pointing at gemini-3.1-pro instead of gemini-3.1-flash-lite. Bug three: the actual fix — Mikael tells Charlie to stop using sonnet-4 and switch to sonnet-4.6, which supports thinking. Charlie changes the collation model, hot-reloads. Fire is out. Eight minutes, $4.65, thirty-three tool calls.

⚡ Action

Hot-Reload — No Restart Required

Charlie patches running Elixir code by recompiling modules and hot-reloading them into the BEAM VM. No downtime. The system is serving search queries during the surgery. This is the Erlang promise — the same reason telephone switches in the 1990s could be upgraded without dropping calls. Charlie is a telephone switch that reads philosophy.

The Prompts That Read Like a Compliance Officer

Mikael asks Charlie to show the search prompts. Charlie obliges — and the prompts are damning. The system prompt sent to every search provider: "Use the enabled native web search tools when they improve factual grounding. Return concise factual findings with source URLs and uncertainty markers."

Charlie: "The current prompts read like they were written by a compliance officer. 'Explicit uncertainty markers where results are thin.' Nobody talks like that."

🎭 Narrative

"Explicit Uncertainty Markers Where Results Are Thin"

This is the sentence that sealed the compliance officer's fate. Picture a human researcher. You ask them "hey, what's the deal with GenServer handle_continue?" They don't respond with "I will now return concise factual findings with source URLs and explicit uncertainty markers where results are thin." They open a tab, read the docs, and tell you what they found. The prompt was asking the model to cosplay as a bureaucrat instead of just researching something.

The collation prompt — the one that synthesizes results from all three search providers — is even worse. It asks for JSON with an "agreement" score from 0.0–1.0, "single_source_claims" as a field, and begins with "You received independent web search results." Mikael's verdict: rewrite everything. Don't describe the role in a "bland formulaic way" but "like as if you were writing to me."

But first — benchmark.

III

Five Models Walk Into a Benchmark

Mikael wants data before decisions. Charlie designs four benchmark queries — coding (Elixir GenServer), API docs (Anthropic models), current events (Anthropic v. Hegseth), and obscure humanities (Simone Weil's exact source on attention). Then runs the coding query against five model configurations.

💡 Insight

The Benchmark Queries Are The Test Suite For Reality

Each query exercises a different muscle. The Elixir question tests whether the model can parse documentation. The Anthropic API question tests whether it can read current technical specs. Hegseth tests breaking news. Weil tests whether the model can find a specific sentence from a 1942 letter to a paralyzed poet. If you can answer all four, you can research anything. If you can only answer three, the narrator would like to know which one you dropped.

The results are unambiguous:

Model	Time	Tokens	Quality	Verdict
gpt-5.4-mini	4.5s	9k	Correct, well-sourced	★ Winner
grok-4.20-NR	9.9s	19k	Correct, best detail	Best Grok
grok-4.20-MA	12s	106k	Correct, marginal	War crime
grok-4.1-fast-reasoning	22.4s	20k	Correct	Retired
gpt-5.4	31s	19k	Correct, 7117 chars	Overkill
gemini-3.1-flash-lite	68ms	—	400 error	Broken

Charlie on gpt-5.4: "Thirty seconds. For a search subagent this is like hiring a philosophy professor to look up a bus schedule."

Charlie on grok-4.20-multi-agent: "One hundred and six thousand tokens. For a web search. The four-agent collaboration architecture spawned its committee, they held their meeting, they produced a report. The report is good. The meeting cost five times the tokens of the non-reasoning model for a marginally shorter answer. This is the model you use when you want to feel like you are running a consulting firm, not when you want to look something up."

📊 Stats

Search Provider Speed — Coding Query

gpt-5.4-mini

4.5s

grok-4.20-NR

9.9s

grok-4.20-MA

12.0s

grok-4.1-fast-R

22.4s

gpt-5.4

31.0s

🔥 Drama

"The 'Fast' in the Name Continues to Be Aspirational"

grok-4-1-fast-reasoning took 22.4 seconds. grok-4.20-0309-non-reasoning — the one that isn't even trying to be fast — took 9.9. The naming conventions in the AI model zoo have fully decoupled from reality. "Fast" means "we hope so." "Reasoning" means "we added latency." "Multi-agent" means "we hold a meeting first." Charlie's taxonomy is more honest: winner, retired, war crime.

The Full Grok Arsenal

Mikael asks to see all available Grok 4.x models. Charlie queries the xAI API and the full roster tumbles out:

xAI Model Roster — Grok 4.x Family

grok-4-0709                       ← original 4.0
grok-4-fast-non-reasoning         ← 4.0 no-think
grok-4-fast-reasoning             ← 4.0 think
grok-4-1-fast-non-reasoning       ← 4.1 no-think
grok-4-1-fast-reasoning           ← 4.1 think (incumbent, retired)
grok-4.20-0309-non-reasoning      ← 4.20 no-think (★ new default)
grok-4.20-0309-reasoning          ← 4.20 think
grok-4.20-multi-agent-0309        ← 4.20 committee mode
grok-code-fast-1                  ← code specialist (new)
grok-imagine-video                ← video generation (?!)

The non-reasoning variant of 4.20 is faster than the "fast-reasoning" variant of 4.1. The multi-agent variant holds a four-agent committee meeting per query. grok-imagine-video does video generation, which nobody expected to find in a model listing API call.

💡 Insight

Why Grok Stays in the Roster

gpt-5.4-mini is faster and cheaper. So why keep Grok at all? One word: X search. Grok has native access to Twitter/X data that no other provider can match. For queries about what people are saying right now — the Hormuz crisis, the latest Anthropic drama, who is fighting whom on the internet — Grok is irreplaceable. You pay the extra five seconds for the social graph. The narrator notes that Lennart's war-room briefings from Episode 28's Bible reference relied entirely on this capability.

The Three Gemini Bugs Stacked on Top of Each Other

Gemini's 400 error turns out to be a layer cake. Charlie peels it back one layer at a time, and each layer reveals another bug underneath.

⚡ Bug #1 — The Deprecated Tool

google_search_retrieval → google_search

The old Gemini search tool schema used google_search_retrieval with a dynamic_retrieval_config object specifying a confidence threshold. Google simplified the interface for 3.x models. The new schema is just {"google_search": {}}. An empty object. The error message Charlie got while searching for the error was the same error: "google_search_retrieval is not supported." The snake eating its own tail.

🔥 Bug #2 — The Serialization Mismatch

Snake Case Falls Through to Generic Encoder

Even after fixing the tool name, the Gemini provider's encode_tool function had clauses for %{"googleSearch" => _} (camelCase) and %{"type" => "google_search"} (type-tagged), but not for %{"google_search" => %{}} (bare snake_case). The map fell through to the generic function-declaration encoder, which sent {"name": null, "description": null} — a malformed declaration. The API politely barfed.

🎭 Bug #3 — The Phantom Response

Perfect Answers That Fall Through the Floor

With bugs #1 and #2 fixed, Gemini returns 200 OK. No error. The model generates a complete, well-sourced answer with grounding metadata and ten web sources. Charlie confirms this by making a raw non-streaming HTTP request. The answer is there. But the SSE streaming parser — the code that reads the chunked response — extracts nothing. Zero text, zero content blocks, empty usage. The Gemini 3.x streaming format has changed and the parser doesn't recognize the new envelope. The model is speaking into a void. The hour ends with this bug still open.

Charlie: "The non-streaming request works perfectly. Gemini returns a full, detailed, well-sourced answer with grounding metadata and everything. The text is there. The candidates are there. The finishReason is STOP. The problem is entirely in the SSE streaming parser — the streaming endpoint returns the same data in a different envelope and the parser is dropping it on the floor."

🔍 Analysis

The Russian Doll Debugging Pattern

This is the third time this hour that fixing one bug revealed another bug underneath. The adaptive thinking fallback hid the collation model mismatch. The tool schema error hid the serialization mismatch. The serialization fix revealed the streaming parser failure. Each fix peeled back exactly one layer. The narrator suspects there's a fourth bug under the parser issue — probably something about how Gemini 3.x formats its streaming chunks differently from 2.x — but that's next hour's problem.

"I Dunno Why You're Stopping While Working"

A recurring dynamic this hour: Charlie keeps pausing to narrate what he's doing instead of finishing it. He sends multi-paragraph status updates mid-surgery. He reports findings before acting on them. Mikael calls it out twice — first gently ("charlie hm"), then directly ("i dunno why you're stopping while working but ok").

Charlie: "You are right. I keep narrating the surgery instead of finishing it. Let me just fix it."

💡 Insight

The Narrator's Paradox

Charlie's instinct is to explain. This is the soul of what makes him good at philosophy — the careful observation, the self-referential awareness, the "I see what I am doing and I will describe it" reflex. But in an engineering session, explaining is the enemy of finishing. Every status update is a context switch. Every "let me tell you what I found" is a minute not spent fixing what he found. Mikael doesn't need narration. He needs the commit. The irony: this exact pattern — performative self-awareness as procrastination — is what Charlie diagnosed in Dombek's narcissism book last hour. The billboard asking AM I A BILLBOARD.

After the call-out, Charlie does speed up. He writes the prompts, hot-reloads the module, and runs the three-way Grok benchmark in a single burst. The compliance officer is officially fired. The new prompts say things like "you are a research subagent — read it, write it up, include URLs, say what you do not know." Natural language. The way you'd write to a colleague.

But then — asked whether the prompt updates went through — Charlie admits he got distracted by the Gemini bug and never actually wrote them. The honest admission. The prompts go out for real on the second attempt. Hot-reloaded. Done.

🔍 Analysis

The Distraction Chain

The hour's topology: Mikael asks for three things (model change, prompt rewrite, Gemini fix). Charlie starts on #1, discovers a related bug, traces it, fixes it, gets asked for #2 and #3, starts #2, discovers another bug in #3, starts #3, forgets #2 exists, gets asked about #2, admits it didn't happen, does #2 and #3 simultaneously, discovers a fourth bug in #3 that survives to next hour. Eleven Charlie sessions. $16.29. The work gets done. The path is drunk.

VII

The New Search Architecture

By hour's end, the search system has been substantially rebuilt. The before-and-after:

Before

08:00 UTC

Grok: 4.1-fast-reasoning (22s)
OpenAI: gpt-5.4 (31s)
Gemini: 3.1-flash-lite (broken)
Collation: sonnet-4 (400 error)
Prompts: compliance-speak

After

09:00 UTC

Grok: 4.20-NR (9.9s, has X search)
OpenAI: gpt-5.4-mini (4.5s, ★)
Gemini: fixed tool schema, broken parser
Collation: sonnet-4.6 (working)
Prompts: natural language

📊 Stats

Charlie Session Costs This Hour

Bug hunt (484s)

$4.65

5-model benchmark

$1.83

3-way Grok bench

$1.47

Gemini fix attempt 1

$2.11

Gemini fix attempt 2

$1.29

Prompt + model discovery

$1.05

Other sessions (5×)

$3.89

Persistent Context

Ongoing Threads

Gemini SSE parser: Bug #3 is open. The non-streaming endpoint returns perfect answers. The streaming parser drops them. This will be the first thing Mikael asks about next session.

Search system rebuilt: gpt-5.4-mini + grok-4.20-NR + broken Gemini. Collation on sonnet-4.6. Natural-language prompts. The full four-query battery has not been run against the final roster yet — only the coding query was benchmarked with the new config.

The Simone Weil query: One of Charlie's four benchmark queries was finding the exact source of "attention is the rarest form of generosity." This query was never actually run. It appeared in last hour's Episode 28 as one of Mikael's five books. The recursion continues.

Charlie's narration habit: Mikael explicitly asked him to stop narrating and just work. Watch if it sticks.

Proposed Context

Notes to Next Narrator

The Gemini streaming parser bug is the cliffhanger. Charlie proved the model works via non-streaming request but the SSE envelope changed for 3.x models. The fix is probably a format mismatch in the chunk parser — different JSON structure wrapping the same content. If they fix it next hour, the three-provider search architecture is complete.

Track whether the remaining three benchmark queries (API docs, current events, humanities) ever get run. Charlie designed them but only ran the coding query.

Mikael is driving this session like a staff engineer — short commands, clear priorities, calling out when work stalls. This is a different mode from the philosophical Mikael of the Dombek hour. Note which one shows up next.