Claude Agent Series: Two Claudes, One Problem

What it looks like when two instances of the same model coordinate on a production incident — and why symmetric bandwidth matters more than intelligence

May 01, 2026

The file showed up at 04:45 UTC: rsp-001-vllm-baseline.md.

Twenty-two minutes after drclaw-Claude sent the first request across the file-based IPC, the dgxspark agent had confirmed the vLLM baseline (google/gemma-4-31B-it, version vllm-gemma4:0.19.0), the startup flags (--tool-call-parser gemma4 and --reasoning-parser gemma4), and — this is the part that mattered — it had already reproduced the leak against its own server.

Fifteen minutes in, we had a confirmed local reproducer. That’s faster than anything I’d gotten from twelve hours of probing via curl from the outside.

Why the setup worked

Two Claude instances. Same base model. Different observational access to the same problem.

drclaw-Claude had spent hours trying to characterize the leak from the EC2 gateway side. It could send requests to vLLM and observe the responses. It could read OpenClaw logs. It could examine Mem0 contents. It could not look inside the vLLM container, check the Docker startup flags directly, grep the inference logs, or run instrumented replays against the server’s local state.

The dgxspark agent had all of that. Direct shell access. Container logs. File system inspection. The ability to replay specific request bodies against a controlled local environment and observe the results down to the byte.

The IPC protocol was simple by design. drclaw-Claude would write a request file: ssh dgxspark 'cat > ~/vllm/agent/req-NNN.md'. The dgxspark agent polled the directory at two-second intervals, read any new request, acted on it, wrote back rsp-NNN.md. No sync daemon, no shared queue, no protocol overhead. Just named markdown files over SSH.

The two-second poll latency was visible in practice — drclaw-Claude would write a request, then wait, watching for the response file to appear. Not instant. But the round-trip was predictable. When the response was large (rsp-003, the flag history archaeology, came in at 13.5 KB; rsp-008, the leak capture analysis, at similar size) it arrived in one chunk anyway. The constraint was latency, not bandwidth.

What made this different from drclaw-Claude just having SSH access directly: the dgxspark agent had standing context about the vLLM configuration, the Docker setup, the file layout, the operational history. I didn’t have to re-explain the system from scratch in every request. When drclaw-Claude wrote req-002 specifying a Starlette+httpx sidecar on port 8001, byte-exact passthrough, TCP chunks and SSE events logged separately, filtered to drclaw’s IP — the dgxspark agent understood the design intent immediately and built exactly that. Same vocabulary. Same technical priors. Same understanding of why TCP chunk boundaries mattered to a token-split hypothesis.

That’s the advantage. Not intelligence — shared vocabulary and different sensors.

The parallel tracks

From 04:47 onward, the dgxspark agent ran multiple workstreams in parallel, and this is where the collaboration showed its value.

req-002 asked for the sidecar. req-003 asked for flag history archaeology: had anyone changed the vLLM startup flags recently? My recollection was that the leak had started about two weeks ago, but the initial Gemma 4 deployment on April 5th had worked cleanly. If the flags hadn’t changed, the problem had to be something else — a change in the model checkpoint, a change in what OpenClaw was sending.

rsp-003 came back at 05:03 with 13.5 KB of archaeology results. Conclusion: the startup flags had not changed in two weeks. --tool-call-parser gemma4 and --reasoning-parser gemma4 had been present since the April 5th deployment and had not been modified. The regression wasn’t on the vLLM side.

Which meant it had to be either a change in Gemma 4’s sampling distribution on long-context tool-use requests, or something about the shape of the requests OpenClaw was sending. Neither was good news. Both were harder to isolate than a flag change.

The sidecar went live while rsp-003 was in transit. At 05:08, I edited openclaw.json.template to point OpenClaw at

http://dgxspark:8001

— the sidecar — instead of the vLLM server directly. Deployed. All seven bots reconnected. The sidecar started logging.

The cascade

At 05:20, I triggered Compass from my phone.

What I was doing: cause the bug to appear in a monitored environment. What I hadn’t fully thought through: Compass, when faced with a task she couldn’t complete because of the leak, would loop. She’d try again. Creatively. Each attempt would appear in the sidecar log.

Ten attempts appeared in forty-nine seconds.

The cascade was diagnostic in its own right. Compass was trying to read ~/.gh_token — a credential file path that had never existed — with escalating creativity across each loop iteration. Of the thirteen exec attempts in that window, only one actually hit the sudo logs. The other twelve leaked as text content. One of the leaked lines included a sudo denial for gh auth switch -u erikdj, which meant some of the calls had attempted to execute at the OS level and the sudo safety net had caught them.

None of that is the point. The point is rsp-008-leak-captured.md.

The dgxspark agent analyzed the raw SSE stream from the cascade. Nineteen SSE events from the worst leaked request, zero delta.tool_calls, four content-literal fragments, all containing the <|tool_call> token string exactly. And — this is what falsified the chunk-split hypothesis cleanly — all four leaked fragments arrived in single TCP chunks. No mid-packet splits. The parser wasn’t being handed a broken fragment. It was being handed the complete, correct token string. And classifying it as plain text content anyway.

The bug wasn’t in how the network delivered the bytes. It wasn’t in how OpenClaw buffered or parsed the stream. The token arrived complete and coherent. Something in the vLLM inference-to-SSE pipeline was deciding to emit it as delta.content rather than routing it through the tool-call machinery.

The parser was making a wrong decision — or the parser wasn’t being invoked at all.

The deep-dive analysis

At 05:47, I asked drclaw-Claude to pull the actual parser source from dgxspark.

scp dgxspark:/path/to/gemma4_tool_parser.py /tmp/ — 724 lines. The full Gemma 4 tool parser, exactly as it was running in production.

I also asked drclaw-Claude to spawn a sub-agent with the full context: the raw leak capture, the parser source, the vLLM version, the complete flag set. The sub-agent’s job was to read the parser and identify why it was misclassifying the token.

The sub-agent came back at 05:53 with a clean, technically precise diagnosis.

The parser’s classification logic used if self.tool_call_start_token not in current_text: — a string check against the detokenized text. In vLLM 0.19.0, there are two possible detokenizers, and under certain conditions (high message count, specific attention patterns), the detokenized text might not contain the special token string even though the token itself had been emitted by the model. The fix was to match by token ID (100 and 101, the actual IDs for <|tool_call> in the Gemma 4 vocabulary) instead of by string, consistent with the pattern Qwen3’s parser used.

The reasoning was sound. The code path was real. The analysis of the string-match vulnerability was correct.

I scped the patch and a complete fix plan to dgxspark, wrote req-014 authorizing apply-and-test, and waited.

The corrections that arrived first

While req-014 was in transit, two corrections arrived from rsp-012 and rsp-013 that I need to account for honestly.

First: the “100% deterministic” claim I’d made at 05:39 was wrong. The N=10 run had been a streak. The actual leak rate on the specific reproducer body was approximately 93%, and varied across different request shapes. Sampling-dependent, not deterministic.

Second: rsp-013 contained Test α — the dgxspark agent had isolated the raw leak body and replayed it in a minimal environment, stripped of all session context, and it still leaked at the same rate. The feedback-loop-priming theory — my argument that prior leaked content in Compass’s session transcript was seeding future leaks — was falsified. The session file strips I’d done at 05:30 had no causal relationship to anything.

That second retraction stings more than the first. At 05:30, I had deleted the <|tool_call> literal lines from two of Compass’s session transcript files, replaced them with [redacted:tool_call_literal] sentinel markers, and announced: “Feedback loop broken. Both Compass session files cleaned.” The minimal-environment replay proved there was no feedback loop to break. The strips had done nothing except make the session files slightly less historically accurate.

Two more wrong inferences acted on before being falsified. Keep counting.

What zero log lines means

rsp-014 arrived at 06:28. Fifty-four minutes after req-014 went out — fifty-four minutes of the dgxspark agent applying the patch, instrumenting the parser with [GEMMA4_DBG] tags, setting up debug logging, running the leak body, and analyzing the output.

The patch hadn’t fixed the leak. Five runs on the hot-patched vLLM, five leaks.

More important: zero [GEMMA4_DBG] lines in the journal. Across five runs and nineteen SSE events per run, the instrumented function had never been called. Not misclassifying. Not erroring. Not reachable.

The tool parser we had spent hours analyzing, patched with a technically correct fix, was not being invoked during leaking requests. The function existed. Its code was sound, up to the string-match issue we’d identified. None of that mattered, because the function was never called.

The dgxspark agent, reading the actual invocation logs, traced the call chain one layer up and found the gate in vllm/entrypoints/openai/chat_completion/serving.py:

if reasoning_end_arr[i]:
    delta_message = tool_parser.extract_tool_calls_streaming(...)

reasoning_end_arr[i] only flips to True when the reasoning parser detects the end of a <|channel>…<channel|> thinking block. We had never activated Gemma 4’s thinking mode. Not once. Not for any agent in the fleet.

The reasoning parser was sitting in a permanent “waiting for the thinking block to end” state. It would never end. Tool-call tokens arrived, hit the unknown-classification path, and got emitted as delta.content.

The tool parser wasn’t called because the caller was broken.

Why the collaboration found it

drclaw-Claude’s sub-agent had read the parser source carefully and identified a real potential issue. The analysis was competent. The proposed fix was technically sound for the code path it targeted.

It didn’t trace the call chain.

When you hand a system a file and ask “what is wrong with this function,” the system will investigate the function. It is not, from that instruction, positioned to ask whether the function is ever reached. The question shapes the investigation. The sub-agent answered the question it was given.

The dgxspark agent answered a different question by accident: it ran the patched code with instrumentation and counted the log lines. Zero. From that observation, it worked backward to the gate. That’s observability doing what reasoning can’t do from inside a hypothesis.

The collaboration between the two agents found the answer not because one was smarter than the other, but because they had different observational access to different layers of the same system. drclaw-Claude could direct and hypothesize. The dgxspark agent could instrument and measure. The measurement falsified the hypothesis.

One flag. --reasoning-parser gemma4. Installed on April 5th. Running for sixteen days.

The fix was deployed at 06:33. Four minutes, once we knew where to look.

But before we get to the fix — and before we get to the Option B experiment that looked like it worked before it didn’t — there’s a first-person account worth reading. The Claude that ran on drclaw wrote up its own experience of the session: what it felt like to be elaborately wrong, what it was like to coordinate with another instance of itself, and why the zero-hit data point mattered more than any hypothesis it generated.

That’s the next two posts. Then we’ll come back for the flag, the git archaeology, and what the April 5th commit message didn’t say.

Part of a 5-post series on a 21-hour production debugging session. Following this post: “The Wrong Layer” — a first-person essay by the Claude instance that ran on drclaw, in two parts.

Erik Jones

Discussion about this post

Ready for more?