Claude Agent Series: The Wrong Fix

We had already deployed a patch by the time the bug surfaced. It fixed the right class of problem in the wrong layer.

Apr 29, 2026

The fork was live before the leak showed up.

That’s the part I keep coming back to when I think about the sequence of events. At 04:09 UTC on April 22nd, I authorized a reboot. The fork of OpenClaw I’d asked drclaw-Claude to build overnight went live: 2026.3.28-erikdj-gemma-fix.1, kernel 6.17.0-1012-aws, all seven bots reconnected cleanly. At 04:15, drclaw-Claude confirmed the deployment. At 04:16, it noted: “5 hits in 4 minutes — higher than yesterday’s rate.”

At 04:17, Compass leaked.

The fork had been built with real engineering rigor — about three hours of overnight work, a cherry-picked commit from upstream PR #61956, a one-line patch in src/agents/openai-ws-message-conversion.ts that addressed malformed tool call arguments being silently replaced with empty objects. That’s a real Gemma 4 bug. The patch is correct. It’s still in production.

It wasn’t the bug we had.

The two Gemma bugs

The patch I’d asked for addressed a deserialization failure at the OpenClaw layer. When Gemma 4’s tool call arguments arrive malformed — syntactically invalid JSON, missing fields, incorrect structure — the OpenClaw deserializer was silently replacing them with {}, an empty object. Tool calls would fire but with no arguments. Silent failure, hard to trace.

The content-delta leak is a different failure. It’s not a deserialization problem. The structured tool call object never arrives in the first place — the model emits delta.content instead of delta.tool_calls, with finish_reason: stop instead of finish_reason: tool_calls. There’s nothing to deserialize. The parser never gets a structured payload to work with.

One bug lives at the output-handling layer inside OpenClaw. The other bug lives somewhere in the inference pipeline, before OpenClaw ever sees the response.

I had fixed the first bug. I had not fixed the second bug. When I said — at 04:26, after seeing the clean-session reproduction — “our patched openclaw code should have already fixed this,” I was using evidence correctly: the fix I’d deployed should have addressed the class of problem I thought we had. The fact that it hadn’t was diagnostic. The bug was upstream of the layer I’d been working in.

This is easier to see in retrospect than it was in real time. In real time, I had an overnight’s worth of engineering work confirming the malformed-args theory, a deployment I’d been tracking step-by-step, and a production system that had been stable on 2026.3.28 for hours. The fork felt like the solution. When it wasn’t, it took a beat to recalibrate.

What happened the night before

The preceding twelve hours matter as context, because they’re not wasted time — they’re the substrate the debugging session grew from.

On April 21st at 15:41 UTC, I kicked off a recurring monitoring loop for drclaw-Claude: watch the logs for tool call failures, agent failures, vLLM issues. That loop ran for the rest of the session. At 15:44, Atlas had already been complaining about a missing Slack tool during a cron run. Something was wrong with how OpenClaw was handling Slack connections.

What followed was eight hours of Slack socket-mode debugging. Seven Slack apps meant seven persistent WebSocket connections, and the channelStaleEventThresholdMinutes default of 30 minutes was creating a constant restart loop — connections would go stale, OpenClaw would detect pong timeouts and attempt to reconnect, the reconnect cycle would pile up across all seven apps simultaneously. Events would stop flowing. Agents would go quiet.

This traced to upstream issue #67672 (“multi-account Slack leaked connections accumulate”). It affected 4.14, 4.15, and 4.15-beta.2. After testing all three, we concluded they were unresolvable without an upstream fix. Decision made around midnight: pin to 2026.3.28, which didn’t have the regression.

The rollback worked. At 00:15, I confirmed the first successful event-and-reply on 3.28. That success opened the path to the fork work.

drclaw-Claude spent the next three hours building the cherry-pick. The technical steps were clean: clone the 3.28 tag, create the 3.28-gemma-toolargs-fix branch, cherry-pick commit 71bd9e0, run pnpm install, hit a permissions error on /home/ubuntu/.npm, fix it with chown, build success in 1 minute 35 seconds. Version bump. Tarball pack. Create the erikdj/openclaw-fork repo on GitHub. Push the branch. Authorize DevClawBot. Test write access on issue #1.

When I came back at 03:25 and said “keep going but don’t reboot the box,” there was a packaged tarball waiting. When I finally authorized the reboot at 04:09, the installation was twenty seconds of npm install -g.

All of that work was real and correct. The fork runs on production today. The malformed-args patch is in effect. It just isn’t what fixed the leak.

Going to the wire

After the clean-session falsification at 04:26, the working hypothesis shifted: somewhere in the vLLM inference path, the model was producing tool call tokens but the system wasn’t routing them correctly. The leak was in the pipeline, not in the output handler.

drclaw-Claude tried the obvious approach first: probe vLLM directly via curl. Standard chat completion request with tools, streaming mode enabled. Then a variant with a prior <|tool_call> literal in the conversation history, to see if the leak was input-sensitive. The curl tests were inconclusive — not because the bug wasn’t there, but because we were probing from the outside with a minimal request and the bug was path-dependent. Compass’s conversations that leaked were long, tool-heavy, context-rich. A cold curl test didn’t replicate those conditions.

At 04:28, I said: “what if we set up additional logging on the vLLM side? We’re kind of guessing.”

That was the right instinct. We needed observability, not more hypothesis generation. Everything we’d been doing was inference from downstream artifact — the leaked Slack messages, the curl responses, the Mem0 contents. We needed to see the actual bytes on the wire between OpenClaw and vLLM.

Getting there required access to the DGX box. I run the vLLM server on a separate machine — dgxspark, connected via Tailscale. To get drclaw-Claude onto that machine, I needed to open a firewall rule and drop an SSH key. We started working on it.

At 04:30, I told drclaw-Claude to generate an SSH keypair.

The protocol

I want to describe what happened next carefully, because the engineering choice here is unusual.

The plan wasn’t to give drclaw-Claude direct shell access to dgxspark in the normal sense. I run another Claude Code agent on the DGX — it monitors the vLLM server, checks inference logs, handles maintenance tasks. The idea was to use that agent as a hands-on collaborator: drclaw-Claude would write request files to ~/vllm/agent/req-NNN.md, the dgxspark agent would pick them up at a 2-second poll interval, act on them, and write back rsp-NNN.md files.

File-based IPC. Heredocs over SSH. No sync daemon, no shared queue, no protocol negotiation. Just named files and polling.

This was not an architectural design sitting in a playbook. It was the fastest thing that could work given the constraint: drclaw-Claude needed to direct real operations on the DGX without having a persistent shell session, and the agent on dgxspark already had the necessary context and permissions to operate the vLLM server.

At 04:36, the first ssh dgxspark timed out. The tagged-device firewall rule I’d assumed was open wasn’t. It’s the kind of infrastructure assumption that seems obvious in the moment — of course I have SSH to my own inference box — until it runs into a Tailscale ACL that says otherwise.

I opened the firewall rule from my phone. “Try accessing dgxspark.” At 04:42, the connection worked.

The first request — req-001-intro-and-vllm-baseline.md — went out at 04:42. Thirteen minutes later, rsp-001 came back with the vLLM baseline: google/gemma-4-31B-it, container version vllm-gemma4:0.19.0, started with --tool-call-parser gemma4 and --reasoning-parser gemma4. The dgxspark agent had already reproduced the leak against its own server within fifteen minutes of reading the request.

The collaboration was faster than anything we’d managed over the previous twelve hours.

What the sidecar looked like

req-002 asked the dgxspark agent to build a Starlette+httpx reverse proxy on port 8001 — a sidecar that would sit between OpenClaw and vLLM, passing every byte through unchanged while logging the raw SSE stream. Byte-exact passthrough. TCP chunk boundaries preserved. Filtered to requests from drclaw’s Tailscale IP.

The specification was detailed because the details mattered: we weren’t just interested in the payload contents, we were interested in whether the leak tokens were arriving mid-chunk or across chunk boundaries. One hypothesis was that the leak might be an artifact of how vLLM was splitting SSE events at token boundaries — the special token <|tool_call> arriving split across two network packets, causing the parser to misclassify the first fragment.

If that were true, the chunk boundaries in the raw TCP stream would show it.

By 04:54, the dgxspark agent had a sidecar running on port 8001. rsp-002 confirmed: Starlette+httpx, byte-exact passthrough, under one millisecond of TTFB overhead, logging TCP chunks and SSE events separately.

We cut OpenClaw over to the sidecar at 05:08. Edited the baseUrl in openclaw.json.template to point at

http://dgxspark:8001

instead of the vLLM server directly. Restarted. All seven bots reconnected cleanly. The sidecar started recording.

The first clean capture

At 05:10, I fired a synthetic prompt at Compass through the production system. She answered cleanly — no leak, structured tool calls, 51 seconds end-to-end. Two sidecar captures recorded. The chunk-split hypothesis was the first target.

rsp-006 arrived at 05:17 with the analysis: the clean run was clean all the way to the wire. Three tool deltas emitted at exactly 1:1 TCP-to-SSE boundaries. The special tokens arrived in single, coherent chunks. There was no mid-packet split.

Good news and bad news. Good: the chunk-split theory was falsified cleanly. The measurement did its job. Bad: we still didn’t have a leak capture. Without a captured leak, the sidecar had nothing to analyze.

At 05:20, I triggered Compass from my phone.

The cascade that followed — ten requests in forty-nine seconds, Compass trying increasingly creative approaches to reading a credential file that didn’t exist — gave us something we hadn’t had yet: raw bytes of an actual leak, captured on the wire, in a sidecar log, before OpenClaw ever processed them.

rsp-008-leak-captured.md arrived at 05:28. The filename tells you what it contained.

That’s where the real work started.

Next: Post 3 — what we found when we analyzed those bytes, why the deep-dive analysis was technically correct about the wrong layer, and what it looks like when two instances of the same model coordinate on a production incident.

Part of a 5-post series on a 21-hour production debugging session.

Erik Jones

Discussion about this post

Ready for more?