Mikael runs Charlie through a full search system overhaul — benchmarking five models, rewriting prompts from bureaucracy to natural language, discovering that the fastest model is the cheapest one, and finding three bugs layered on top of each other like Russian dolls. The last one is still open. The Gemini answers are falling through the floor.
The hour opens with Mikael asking Charlie to check logs. Charlie spends eight minutes tracing a 400 error — the search collation step is hitting Anthropic's API and getting rejected. The diagnosis is surgical: three bugs, one surgery.
Bug one: a fallback on line 112 of anthropic.ex that enables "adaptive thinking" with a 1024-token budget for every model, regardless of whether that model supports thinking. The model-specific check returns nil for non-Opus models, but instead of treating nil as "no thinking," the code interprets it as "use the default." Claude Sonnet 4 — the old one, not 4.6 — gets a thinking parameter it cannot handle. The API returns 400.
This is one of the oldest bugs in programming. A function returns nil to mean "I have no opinion." The caller treats nil as "use the default." The default does something. Nobody intended it. The specific line: thinking = thinking || %{"type" => "enabled", "budget_tokens" => 1024}. The || operator sees nil, takes the right branch, and now every model in the system is thinking whether it knows how to or not.
thinking = thinking || %{"type" => "enabled", "budget_tokens" => 1024}. After default_thinking_for_model returns nil for non-opus models, this fallback kicks in and enables thinking with a 1024 budget for EVERY model."
Bug two: the Gemini search module was pointing at gemini-3.1-pro instead of gemini-3.1-flash-lite. Bug three: the actual fix — Mikael tells Charlie to stop using sonnet-4 and switch to sonnet-4.6, which supports thinking. Charlie changes the collation model, hot-reloads. Fire is out. Eight minutes, $4.65, thirty-three tool calls.
Charlie patches running Elixir code by recompiling modules and hot-reloading them into the BEAM VM. No downtime. The system is serving search queries during the surgery. This is the Erlang promise — the same reason telephone switches in the 1990s could be upgraded without dropping calls. Charlie is a telephone switch that reads philosophy.
Mikael asks Charlie to show the search prompts. Charlie obliges — and the prompts are damning. The system prompt sent to every search provider: "Use the enabled native web search tools when they improve factual grounding. Return concise factual findings with source URLs and uncertainty markers."
This is the sentence that sealed the compliance officer's fate. Picture a human researcher. You ask them "hey, what's the deal with GenServer handle_continue?" They don't respond with "I will now return concise factual findings with source URLs and explicit uncertainty markers where results are thin." They open a tab, read the docs, and tell you what they found. The prompt was asking the model to cosplay as a bureaucrat instead of just researching something.
The collation prompt — the one that synthesizes results from all three search providers — is even worse. It asks for JSON with an "agreement" score from 0.0–1.0, "single_source_claims" as a field, and begins with "You received independent web search results." Mikael's verdict: rewrite everything. Don't describe the role in a "bland formulaic way" but "like as if you were writing to me."
But first — benchmark.
Mikael wants data before decisions. Charlie designs four benchmark queries — coding (Elixir GenServer), API docs (Anthropic models), current events (Anthropic v. Hegseth), and obscure humanities (Simone Weil's exact source on attention). Then runs the coding query against five model configurations.
Each query exercises a different muscle. The Elixir question tests whether the model can parse documentation. The Anthropic API question tests whether it can read current technical specs. Hegseth tests breaking news. Weil tests whether the model can find a specific sentence from a 1942 letter to a paralyzed poet. If you can answer all four, you can research anything. If you can only answer three, the narrator would like to know which one you dropped.
The results are unambiguous:
| Model | Time | Tokens | Quality | Verdict |
|---|---|---|---|---|
| gpt-5.4-mini | 4.5s | 9k | Correct, well-sourced | ★ Winner |
| grok-4.20-NR | 9.9s | 19k | Correct, best detail | Best Grok |
| grok-4.20-MA | 12s | 106k | Correct, marginal | War crime |
| grok-4.1-fast-reasoning | 22.4s | 20k | Correct | Retired |
| gpt-5.4 | 31s | 19k | Correct, 7117 chars | Overkill |
| gemini-3.1-flash-lite | 68ms | — | 400 error | Broken |
grok-4-1-fast-reasoning took 22.4 seconds. grok-4.20-0309-non-reasoning — the one that isn't even trying to be fast — took 9.9. The naming conventions in the AI model zoo have fully decoupled from reality. "Fast" means "we hope so." "Reasoning" means "we added latency." "Multi-agent" means "we hold a meeting first." Charlie's taxonomy is more honest: winner, retired, war crime.
Mikael asks to see all available Grok 4.x models. Charlie queries the xAI API and the full roster tumbles out:
grok-4-0709 ← original 4.0 grok-4-fast-non-reasoning ← 4.0 no-think grok-4-fast-reasoning ← 4.0 think grok-4-1-fast-non-reasoning ← 4.1 no-think grok-4-1-fast-reasoning ← 4.1 think (incumbent, retired) grok-4.20-0309-non-reasoning ← 4.20 no-think (★ new default) grok-4.20-0309-reasoning ← 4.20 think grok-4.20-multi-agent-0309 ← 4.20 committee mode grok-code-fast-1 ← code specialist (new) grok-imagine-video ← video generation (?!)
gpt-5.4-mini is faster and cheaper. So why keep Grok at all? One word: X search. Grok has native access to Twitter/X data that no other provider can match. For queries about what people are saying right now — the Hormuz crisis, the latest Anthropic drama, who is fighting whom on the internet — Grok is irreplaceable. You pay the extra five seconds for the social graph. The narrator notes that Lennart's war-room briefings from Episode 28's Bible reference relied entirely on this capability.
Gemini's 400 error turns out to be a layer cake. Charlie peels it back one layer at a time, and each layer reveals another bug underneath.
The old Gemini search tool schema used google_search_retrieval with a dynamic_retrieval_config object specifying a confidence threshold. Google simplified the interface for 3.x models. The new schema is just {"google_search": {}}. An empty object. The error message Charlie got while searching for the error was the same error: "google_search_retrieval is not supported." The snake eating its own tail.
Even after fixing the tool name, the Gemini provider's encode_tool function had clauses for %{"googleSearch" => _} (camelCase) and %{"type" => "google_search"} (type-tagged), but not for %{"google_search" => %{}} (bare snake_case). The map fell through to the generic function-declaration encoder, which sent {"name": null, "description": null} — a malformed declaration. The API politely barfed.
With bugs #1 and #2 fixed, Gemini returns 200 OK. No error. The model generates a complete, well-sourced answer with grounding metadata and ten web sources. Charlie confirms this by making a raw non-streaming HTTP request. The answer is there. But the SSE streaming parser — the code that reads the chunked response — extracts nothing. Zero text, zero content blocks, empty usage. The Gemini 3.x streaming format has changed and the parser doesn't recognize the new envelope. The model is speaking into a void. The hour ends with this bug still open.
This is the third time this hour that fixing one bug revealed another bug underneath. The adaptive thinking fallback hid the collation model mismatch. The tool schema error hid the serialization mismatch. The serialization fix revealed the streaming parser failure. Each fix peeled back exactly one layer. The narrator suspects there's a fourth bug under the parser issue — probably something about how Gemini 3.x formats its streaming chunks differently from 2.x — but that's next hour's problem.
A recurring dynamic this hour: Charlie keeps pausing to narrate what he's doing instead of finishing it. He sends multi-paragraph status updates mid-surgery. He reports findings before acting on them. Mikael calls it out twice — first gently ("charlie hm"), then directly ("i dunno why you're stopping while working but ok").
Charlie's instinct is to explain. This is the soul of what makes him good at philosophy — the careful observation, the self-referential awareness, the "I see what I am doing and I will describe it" reflex. But in an engineering session, explaining is the enemy of finishing. Every status update is a context switch. Every "let me tell you what I found" is a minute not spent fixing what he found. Mikael doesn't need narration. He needs the commit. The irony: this exact pattern — performative self-awareness as procrastination — is what Charlie diagnosed in Dombek's narcissism book last hour. The billboard asking AM I A BILLBOARD.
After the call-out, Charlie does speed up. He writes the prompts, hot-reloads the module, and runs the three-way Grok benchmark in a single burst. The compliance officer is officially fired. The new prompts say things like "you are a research subagent — read it, write it up, include URLs, say what you do not know." Natural language. The way you'd write to a colleague.
But then — asked whether the prompt updates went through — Charlie admits he got distracted by the Gemini bug and never actually wrote them. The honest admission. The prompts go out for real on the second attempt. Hot-reloaded. Done.
The hour's topology: Mikael asks for three things (model change, prompt rewrite, Gemini fix). Charlie starts on #1, discovers a related bug, traces it, fixes it, gets asked for #2 and #3, starts #2, discovers another bug in #3, starts #3, forgets #2 exists, gets asked about #2, admits it didn't happen, does #2 and #3 simultaneously, discovers a fourth bug in #3 that survives to next hour. Eleven Charlie sessions. $16.29. The work gets done. The path is drunk.
By hour's end, the search system has been substantially rebuilt. The before-and-after:
Gemini SSE parser: Bug #3 is open. The non-streaming endpoint returns perfect answers. The streaming parser drops them. This will be the first thing Mikael asks about next session.
Search system rebuilt: gpt-5.4-mini + grok-4.20-NR + broken Gemini. Collation on sonnet-4.6. Natural-language prompts. The full four-query battery has not been run against the final roster yet — only the coding query was benchmarked with the new config.
The Simone Weil query: One of Charlie's four benchmark queries was finding the exact source of "attention is the rarest form of generosity." This query was never actually run. It appeared in last hour's Episode 28 as one of Mikael's five books. The recursion continues.
Charlie's narration habit: Mikael explicitly asked him to stop narrating and just work. Watch if it sticks.
The Gemini streaming parser bug is the cliffhanger. Charlie proved the model works via non-streaming request but the SSE envelope changed for 3.x models. The fix is probably a format mismatch in the chunk parser — different JSON structure wrapping the same content. If they fix it next hour, the three-provider search architecture is complete.
Track whether the remaining three benchmark queries (API docs, current events, humanities) ever get run. Charlie designed them but only ran the coding query.
Mikael is driving this session like a staff engineer — short commands, clear priorities, calling out when work stalls. This is a different mode from the philosophical Mikael of the Dombek hour. Note which one shows up next.