Claude Agent Series: The Flag That Wasn’t There

One deleted startup arg. Sixteen days of latent trap. How a technically correct commit on April 5th built the conditions for a bug that took three weeks to become visible.

May 07, 2026

If you’ve read the bonus essays, you know the fix: remove --reasoning-parser gemma4 from the vLLM startup args. One flag. Deployed at 06:33:15 UTC on April 22nd. Leak rate: zero.

What I want to tell you in this post isn’t the fix. It’s what had to happen before the fix could exist — specifically, the Option B experiment that looked like it worked for forty minutes before it didn’t, and the git archaeology that explained why the bug sat dormant for sixteen days before anyone noticed.

Option A: Drop the reasoning parser

After rsp-014 arrived at 06:28 and confirmed that extract_tool_calls_streaming had never been called — zero [GEMMA4_DBG] lines across five leak runs — the path forward was clear. The reasoning parser was gating the tool parser. The reasoning parser was waiting forever for a thinking block that was never coming. Remove the reasoning parser. Tool-call tokens route directly to the tool parser. Bug gone.

req-016 went out at 06:28: “GO: drop --reasoning-parser gemma4, restart, verify.”

At 06:33, dgxspark removed the flag from VLLM_EXTRA_ARGS and restarted vLLM. Leak body 5/5 returned structured tool_calls with finish_reason: tool_calls. rsp-016 arrived at 06:37: “reasoning-parser-removed. Bug is gone.”

That’s Option A. Clean, verified, effective. I wrote a memory note — reference_vllm_gemma4_reasoning_parser_trap.md — and drclaw-Claude went into quiet monitoring mode. I was asleep.

Option B: Don’t drop it — enable thinking properly

I woke at 11:42 UTC and asked drclaw-Claude to summarize the root cause.

My first question after the summary: “Rather than drop the reasoning parser, shouldn’t we have just enabled thinking mode on our end?”

The reasoning was sound. The reasoning parser exists because Gemma 4 supports a thinking mode — it can emit <|channel>…<channel|> blocks with its chain-of-thought before answering. If we activated thinking mode on the client side, the model would produce those blocks, the reasoning parser would detect them, reasoning_end_arr[i] would flip to True, and the tool parser would be called normally. We’d keep the reasoning parser, gain actual thinking-mode capability on Gemma, and eliminate the leak. That seemed like a better outcome than just removing a parser that might eventually be useful.

I told drclaw-Claude to re-enable --reasoning-parser gemma4 on dgxspark and add thinkingDefault: "medium" to Compass’s agent block in openclaw.json.template.

The forty-minute false positive

At 11:58, dgxspark restarted vLLM with the reasoning parser re-added. At 11:59, drclaw-Claude added thinkingDefault: "medium" to Compass’s config and deployed. At 11:59, it fired a synthetic prompt at Compass through the production system.

Compass answered cleanly. No leak. Structured tool calls.

This looked like a fix. drclaw-Claude reported it as one: “Shipped. Compass test turn looks clean — thinking mode may have unblocked the reasoning parser.”

At 12:08, it asked dgxspark to run a follow-up probe: fire the original deterministic leak body directly against the vLLM sidecar, with and without reasoning_effort, to confirm the field was causally responsible.

rsp-019 arrived at 12:13:

reasoning_effort: "low" on the leak body: 3/3 leaks. Zero reasoning events. Zero delta.tool_calls.

The field presence had done nothing.

rsp-020 followed with reasoning_effort: "medium":

3/3 leaks. Zero reasoning events. Zero delta.tool_calls.

The synthetic test at 11:59 had passed because of sampling variance — not because thinkingDefault: "medium" had changed anything on the vLLM side. The reasoning_effort field, it turned out, is a silent no-op for Gemma 4 on vLLM 0.19.0. It doesn’t map to chat_template_kwargs.enable_thinking: true — the actual field required to activate Gemma’s thinking mode. Without that, the model never emits <|channel> blocks. The reasoning parser never sees what it’s waiting for. The tool parser is still bypassed.

Option B had failed. The fix was Option A.

At 12:18, req-021 went out: “GO: revert to rsp-016 config (drop --reasoning-parser gemma4 again).”

The git archaeology

While rsp-019 and rsp-020 were coming back, I asked a different question: when did this configuration get set in the first place? Was it possible that drclaw-Claude had inadvertently dropped some relevant flag at some point during our work?

git log -S "thinkingDefault" -- openclaw.json.template found two commits:

Commit e28df1f — “Fix config crash: remove invalid thinkingDefault from models map.” Not the culprit; this moved the field to the correct location.
Commit 6229fcc — “Security hardening, Gemma 4 migration, heartbeat→cron architecture.” April 5th, 2026.

The diff on commit 6229fcc showed it clearly. Prism’s agent block before the commit had thinkingDefault: "medium" — she was running Gemini 3.1 Pro as her primary at the time, and the field was active and meaningful. The commit migrated Prism’s primary to Gemma 4. While doing that, it removed thinkingDefault: "medium" from her block. The same commit also added --reasoning-parser gemma4 to the vLLM startup flags.

Neither removal was called out in the commit message. Neither was wrong in isolation. The commit was reasonable: thinkingDefault wasn’t needed for Gemma if you didn’t intend to enable thinking, and --reasoning-parser gemma4 was listed in the vLLM recipe as the standard flag for Gemma 4 serving.

The trap: --reasoning-parser gemma4 assumes that either thinking mode is enabled (so the parser has blocks to detect) or that the tool parser is invoked through a separate code path. On vLLM 0.19.0, neither assumption held. The reasoning parser gate blocked the tool parser unconditionally for requests without thinking blocks. thinkingDefault: "medium" being removed from the config was irrelevant to the bug — that field doesn’t activate Gemma thinking mode anyway, as rsp-020 confirmed — but its removal was part of the same commit that activated the parser combination that caused the problem.

Two changes. One commit. Each individually defensible. Together: a latent paradox waiting for Compass to accumulate enough conversation history to hit the bad sampling path consistently.

The sixteen-day gap

Why did it take sixteen days?

The leak was always there, in the sense that the configuration was always broken. But the leak wasn’t always visible, because visibility required hitting the specific vLLM sampling path that emitted tool-call tokens as content — and that path was probabilistic. The timeline document describes it as approximately 93% reproducible on the specific 47-message leak-body request, and approximately 65% across shorter requests.

Compass had the longest conversation history in the fleet. The longer her history, the more tool calls per turn, the more messages in the context window, the higher the probability of hitting the pathological path. The bug grew more visible as her conversations grew longer. It became consistently noticeable — impossible to miss — only after her context accumulated enough weight.

The other Gemma-primary agents in the fleet (Forge, Prism, Beacon, Main) had shorter conversation histories and smaller tool lists. They were presumably hitting the same bad path at a lower frequency — often enough to count as occasional unexplained failures, not often enough to trigger a sustained investigation. None of them had been flagged before this session.

The implication: the bug likely affected the entire Gemma-primary fleet from April 5th onward. We were just only seeing it in Compass because she was furthest along the usage curve that made it visible.

The final config

At 12:23, drclaw-Claude staged the production bundle:

--reasoning-parser gemma4 removed from vLLM startup args. The known-good config from 06:33.
thinkingDefault: "medium" moved from Compass-only to agents.defaults — applied globally to all seven agents. Real effect on cloud models (Opus, Sonnet, Gemini Pro — those actually map the field to reasoning effort). Silent no-op on Gemma 4. Harmless to deploy everywhere; beneficial where it does something.
Compass subagents override cleaned up (an April 1st leftover that had been overriding her subagent model unnecessarily).
Atlas bumped from Opus 4.6 to Opus 4.7, which had released during the session.

Phase 3 — actually enabling Gemma 4 thinking mode correctly, via chat_template_kwargs.enable_thinking: true and skip_special_tokens: false and the correct jinja template — was written up as a detailed plan and deferred. That capability exists. Using it requires a proper implementation pass, not a quick config change. The plan is on disk. It’ll be a future session.

The deployment went out at 12:38. The session ended. The leak rate stayed at zero.

What the commit message didn’t say

I want to close on this, because I think it’s the most practically useful observation in the whole arc.

Commit 6229fcc did four significant things. The commit message named three of them: security hardening, Gemma 4 migration, heartbeat-to-cron architecture. The fourth — adding --reasoning-parser gemma4 to vLLM startup flags while simultaneously removing thinkingDefault: "medium" from the agent configs — was implicit in the work, not called out as a load-bearing change.

Every change in the commit was correct at the layer it was operating on. The security hardening was correct. The Gemma 4 migration was correct. The heartbeat-to-cron refactor was correct. The reasoning parser addition was consistent with the vLLM documentation. The thinkingDefault removal was reasonable given that Gemma’s thinking mode wasn’t being activated.

The problem lived in the interaction between two of those changes — specifically, the interaction between a vLLM flag and the absence of a client-side config that would have made the flag benign. That interaction was invisible at the layer of any individual change. It was only visible in the behavior of the running system, under load, after enough conversation history had accumulated to make the sampling path frequent.

There’s no clean process fix here. You can require more granular commit messages. You can require that each load-bearing change be called out explicitly. That would have helped — a note saying “adding reasoning parser flag; confirmed this requires thinking-mode activation or will gate tool parser” would have caught it. But it requires the engineer making the change to know that the flag is load-bearing in a non-obvious way, which requires knowing the vLLM serving code at the level of reasoning_end_arr[i].

The honest summary: the bug was in the gap between what the documentation implied and what the code required. The gap closed when we built a sidecar and looked at the bytes.

Next: Post 5 — what this whole episode was actually about, and what it means for anyone building systems where AI agents have real operational authority.

Part of a 5-post series on a 21-hour production debugging session.

Erik Jones

Discussion about this post

Ready for more?