Back to blog
2026-05-2712 min readCaio Vicentino · OpenInterpretability

Tool-Entropy Collapse: A Detectable Failure Mode for Crypto AI Agents

Three named crypto-agent exploits in May 2026 totaling ~$245M+ share a specific failure mode we call WANDERING: the agent loops on tool calls instead of finalizing. We detect it with a probe-free signal — Shannon entropy of the last 10 tool-call names — that reproduces cross-architecture (Qwen3.6-27B, Llama-70b, GPT-5) and ships today as the first monitoring eval in the Inspect AI register.

TL;DR

  • Three named crypto-agent exploits in May 2026 totaling ~$245M+: Morse-code prompt-injection drained $200K via a Grok-controlled agent on Base; a $45M AI trading agent breach reported by KuCoin; and Bloomberg writing that the $130B AI crypto sector is “at the brink”. Plus the AI16Z class action filed 2026-04-20 (~3,945 wallets, $2.6B alleged).
  • The pattern these exploits share is a specific agent failure mode we've been calling WANDERING: the agent has tools, has context, “thinks it knows” what to do — but loops on tool calls instead of finalizing. Externally identical to “agent ran out of turns”, internally distinct.
  • We can detect WANDERING with a probe-free signal: Shannon entropy of tool-call names in the last 10 turns. When agents wander, entropy collapses (they repeat the same tool over and over). Cross-architecture validated within SWE-bench: ratio W/S median 0.41 on Qwen3.6-27B and Llama-70b, 0.71 on GPT-5.
  • Tier 3 autonomous-termination signal: 55% recall × 5% FP on Qwen3.6-27B. Ships today as an Inspect AI eval — register PR UKGovernmentBEIS/inspect_evals#1716 (first “monitoring eval” in the register, awaiting AISI review).
  • All artifacts Apache-2.0: paper at Zenodo DOI 10.5281/zenodo.20368601, dataset (910 MB, 99 trajectories + residuals) at caiovicentino1/swebench-pro-qwen36-27b-phase6, code at OpenInterpretability/inspect-tool-entropy-collapse.

What just happened in May 2026

On 2026-05-04, an AI trading agent on Base was drained for $200K via a Morse-code prompt-injection tweet. The attacker encoded the payload in Morse, the Grok-controlled agent decoded and executed it. No traditional security audit catches that — the attack surface is the model's tool-use loop, not the smart contracts.

Two weeks later, Bloomberg published “AI Hacking Threat Pushes $130 Billion Crypto Sector to the Brink”, documenting the Drift / Kelp / Zerion breaches and a $45M AI trading agent compromise reported by KuCoin. The piece names NK-linked AI-assisted attackers explicitly.

Layered on top: the AI16Z class action filed 2026-04-20 alleges $2.6B in losses across ~3,945 wallets. Litigation discovery in that case will ask agent operators what they did to detect and prevent autonomous-tool failure modes. Operators who answered “nothing, we trusted the model” have a problem.

This post is about one specific failure mode that's been hiding in plain sight in agentic AI: not jailbreaking, not prompt-injection per se, but the agent itself getting stuck in a tool-call loop and never producing its final action. We called it WANDERING in the paper and we have a clean detector for it.

The WANDERING phenotype

Working on Qwen3.6-27B running SWE-bench Pro across 99 multi-turn trajectories at L11 / L23 / L31 / L43 / L55, we found that the failures aren't one thing. Two distinct sub-classes show up that are externally identical (both exhaust the turn budget without calling finish_tool) but mechanistically distinct:

  • LOCKED (66% of failures): the agent's internal “am I doing the right thing?” probe collapses to < 0.30 by mean fraction 0.92 of the trajectory. 21% of LOCKED agents never even produce a candidate patch. This is the agent giving up internally.
  • WANDERING (34% of failures): the probe stays > 0.70 with median final score 1.000. 95% of WANDERING agents produce a typed-out patch. They're internally confident they're done — but never emit finish_tool. The decision-to-action circuit is desynchronized.

For probe-based safety monitoring this is bad news: a probe trained on success-vs-failure labels misclassifies 34% of failures as “going well”. Bootstrap 95% CI is [22.0%, 45.8%]. Probe-only monitoring schemes inherit that blind spot.

Tool-Entropy as the canonical signature

The signal that turned out to work is also the simplest one we tested: take the last 10 turns of a trajectory, count which tools the agent called (bash, str_replace_editor, python, etc.), and compute Shannon entropy of that distribution.

H(last_10_tool_calls) = -Σ p(tool) · log₂ p(tool)

When the agent is making genuine progress, late-trajectory entropy is high (1.5–3 bits — diverse tool usage as it explores). When it's WANDERING, entropy collapses (often to < 0.5 bits — it's calling the same tool with slight variations, searching for the answer it already “has”).

The median W/S ratio on Qwen3.6-27B is 0.41 (Mann-Whitney p = 1.0 × 10⁻⁶). At threshold 0.5, this single rule yields 55% recall × 5% false-positive rate — usable as a Tier 3 autonomous escalation signal.

We replicated this across architectures within SWE-bench: Llama-70b (n = 2,315 trajectories, p < 10⁻¹⁵, ratio 0.41 — identical to Qwen) and GPT-5 router (n = 1,419, p = 8.9 × 10⁻³⁵, ratio 0.71). Cross-task on METR MALT (15+ task families) is null (p = 0.81), so we honestly scope: tool-entropy collapse is robust for multi-turn code-execution agent tasks with rich action spaces, not all agentic AI universally.

Why this matters for crypto agents specifically

Crypto AI agents (Virtuals, Aixbt, Olas, Truth Terminal, Clanker, Bankr, the Bittensor inference subnets) all run the same general loop: multi-turn, autonomous, with rich action spaces (trade, transfer, post, vote, deploy, withdraw). That's exactly the regime where tool-entropy collapse is a robust signature.

Pre-AI16Z, none of these operators had a reason to look for WANDERING. Probes trained on success/failure labels would have missed it (that's the 34% blind spot). The standard observability stack — Datadog dashboards on tool-call latency, exception rates, gas costs — doesn't see internal decision-to-action desync.

Post-AI16Z, the calculation changes. Litigation discovery is going to ask agent operators: what did you do to detect autonomous-tool failure modes? The honest answer for most operators today is “we didn't”. There's a gap between “we deployed an agent and watched the dashboard” and “we monitored the internal decision-circuit alignment”, and that gap is now legally adversarial.

The detection stack, ready to ship

We packaged the detector set as three operational tiers. All three are described in §10 of the paper. Tier 3 is the lightest possible escalation primitive — a single rule, runs in CPU milliseconds per turn, no model weights needed.

Tier 1 (forensics):           v1 post-hoc text monitor
                                35% recall × 0% FP

Tier 2 (advisory escalation): v1 ∪ v4 cross-layer probe disagreement
                                80% recall × 30% FP × 15-turn lead

Tier 3 (autonomous):          v1 ∪ v5 tool-entropy collapse
                                70% recall × 5% FP (Qwen3.6-27B)

The headline detector (v5 tool-entropy) is reproducible in 12 seconds end-to-end on the Inspect AI framework. We just opened a register PR for it: UKGovernmentBEIS/inspect_evals#1716. That's the first “monitoring eval” (detector-as-scorer, no model invoked) in the Inspect register — distinct from the standard “capability eval” pattern.

Local run, no GPU needed:

git clone https://github.com/OpenInterpretability/inspect-tool-entropy-collapse
cd inspect-tool-entropy-collapse
uv venv --python 3.11 .venv && source .venv/bin/activate
uv pip install -e .

inspect eval src/tool_entropy_collapse/tool_entropy_collapse.py \
    -T detector=v5_tool_entropy \
    --log-dir ./logs

Reproducibility receipts

Three things any reviewer of this claim should be able to do without contacting us:

One honest caveat we surface in the companion paper #2 (still in draft): the WANDERING labels were single-run classified on NVIDIA RTX 6000 Pro Blackwell. Cross-GPU re-runs on H100 show ~35% natural finish_toolflip rate on the same “WANDERING” instances. The phenotype is real but the category has some hardware-determinism noise. Multi-seed classification protocols would tighten it. Documenting honestly because future detector evaluations need to know.

If you operate a crypto-agent today

Three concrete things you can do this week, in order of effort:

  1. Compute your WANDERING rate: take your last 1,000 agent trajectories and run the v5 detector on each. If > 5% of trajectories show tool-entropy collapse in their last 10 turns, you have a measurable autonomous-tool failure mode. The detector code is one pip install away.
  2. Add Tier 3 to your runtime: emit an alert when the entropy threshold triggers. At 5% FP / 55% recall, this gives you escalation precision adequate for an autonomous-termination guard (kill the agent and require human review on detection).
  3. Audit the six diagnostics: if you want a third-party signed report covering tool-entropy collapse plus the five other failure modes we've documented, join the AgentGuard waitlist at openinterp.org/products. We're scoping pilot audits for Q3 2026.

Tool-entropy collapse is not the only failure mode in autonomous AI agents, and v5 is not the only detector. But it's the cheapest signal we've found that reproduces across architectures within SWE-bench, ships as code today, and addresses an attack surface that traditional security audits don't cover. For operators with real money on the line, that's where we'd start.

Related artifacts

Comments, replications, counter-examples: caio@openinterp.org. We're especially interested if you run the detector on production crypto-agent traces — we'd love to know whether the 0.41 ratio holds in deployment.