2026-05-3014 min readCaio Vicentino · OpenInterpretability

A detector is not a fix: detecting agent WANDERING is easy, steering it back is not

Across three papers we learned to detect when a coding agent gives up mid-task (tool-entropy collapse, cross-architecture), to localize its mechanistic fingerprint (an L11 edge-layer drift, found by stability selection), and then — the humbling part — we tried to use that locus to rescue the agent and failed three times. Injecting the SUCCESS direction at the very layer that detects WANDERING best does not rescue it (paired McNemar p=0.73); it destabilizes the model into malformed tool calls (0→60% as the dose rises). A monitor is not a lever, even at the strongest locus. Includes the walk-back where a wrong baseline made the null look like a p=0.02 positive.

TL;DR

Detect. When a coding agent gives up mid-task — it keeps acting but never submits, burning its whole turn budget — its tool-use entropy collapses in the last ten turns. That signal separates this WANDERING failure from success and generalizes across architectures (Qwen, Llama-70B, GPT-5).
Understand. The mechanistic fingerprint of WANDERING is a drift in an early (edge) layer, L11 — surfaced independently by stability selection over 60 features, with a 0.95 selection frequency. That is where the failure is most legible in the residual stream.
Try to fix it.So we injected the “what success looks like” direction at that layer to steer the agent back. It does not work. Not at the output layer (inert), and — the surprising part — not even at L11, the layer that detects WANDERING best (paired McNemar p = 0.73). The one robust effect of the injection is the opposite of a rescue: it destabilizes the model into malformed tool calls, 0% → 60% as we turn up the dose.
The lesson. A detector is not a lever. Finding the direction that predicts a failure does not give you a direction that fixes it — even when you target the exact layer where the prediction is strongest. For deployed agent safety this means: ship the monitor, but do not assume you can flip it into a one-shot repair.

The failure mode: agents that wander

Give a capable model a real software bug, a shell, and 50 turns to fix it. Most runs end one of two ways: the agent submits a patch (SUCCESS), or it locks into a repeated dead-end and gives up early (LOCKED). There is a third, quieter failure that is more unsettling to watch: the agent keeps working — reading files, running commands, editing — turn after turn, and simply never decides it is done. It exhausts the budget without ever calling finish_tool. We call this WANDERING.

WANDERING matters for anyone running agents unattended. A LOCKED agent fails loudly and cheaply. A WANDERING agent fails expensively— it consumes the full budget, looks busy the whole time, and produces nothing. For autonomous agents that hold funds or take irreversible actions (the crypto-agent world is full of these now), “looks busy, accomplishes nothing, never stops” is exactly the kind of failure you want a monitor to catch.

This post is the arc across three papers: one that detects WANDERING, one that localizes it, and one that tried to fix it and could not. The third is the most useful, so it gets the most space.

Paper 1 — Detect: tool-entropy collapse

The first paper (Tool-Entropy Collapse, Zenodo) asks a narrow question: can you tell, from the trajectory alone, that an agent is wandering? The answer is yes, and the signal is simple. In its final ten turns a WANDERING agent's distribution over tool calls collapses— it stops exploring the action space and grinds on a shrinking set of moves. The entropy of that distribution drops relative to a successful agent's.

The useful part is that this is not Qwen-specific. The same collapse separates WANDERING from SUCCESS in Llama-70B and in a GPT-5 router on the same SWE-bench tasks (ratios and p-values in the paper). As a deployable monitor it lands at a usable precision-recall — enough to gate an autonomous “this run is lost, stop it” escalation. It does not transfer to every agent benchmark (a cross-task test on a different action space came back null), so the honest scope is: multi-turn, code-execution agents with rich tool spaces. Within that scope, the detector works.

The thing to hold onto: this is a correlational result. The entropy collapse co-occurs with wandering. Nothing about it says you can intervene on it.

Paper 2 — Understand: the fingerprint is at L11

The second paper asks where in the network WANDERING lives. We extracted 60 features per trajectory across four channels — text, tool-use, residual-stream geometry at five layers (L11 through L55), and temporal dynamics — over 99 SWE-bench Pro trajectories, and trained a three-way classifier (SUCCESS / LOCKED / WANDERING).

Two things are worth reporting honestly. First, the walk-back: our initial classifier hit a macro-F1 of 0.987, which is the kind of number that should make you suspicious of your own pipeline rather than proud of your model. It was leakage — several features were near-definitional proxies for the label. Dropping them took the honest macro-F1 to 0.636 (z = 5.88, p = 0.001 against a permutation null). Still well above chance, no longer too good to be true.

Second, the mechanism. We used stability selection — repeatedly refitting on subsamples and keeping only features that survive almost every time — to find which features carry the signal. The answer is clean and, in hindsight, interpretable:

stable feature           class       selection freq
─────────────────────────────────────────────────────
tool_diversity_count     SUCCESS     0.99   (used a 3rd tool)
L43_cosine_consec_late   LOCKED      0.84   (mid-layer freeze)
L11_drift_first_last     WANDERING   0.95   (edge-layer drift)

LOCKED agents freeze in a middle layer (L43) — their representation stops moving, consistent with locking onto one idea. WANDERING agents driftin an early, edge layer (L11) — the representation keeps rotating without converging on the “I am done” action. The effect sizes are subtle (a few points on a cosine, not a visual chasm), but the locus is robust: L11 is where WANDERING is most legible.

Paper 3 — Try to fix it: the part that humbled us

Here is the obvious next move. If wandering agents have a consolidated “I should finish” verdict in the middle of the network that simply fails to reach the action, and if successful agents have a recognizable edge-layer state, then take the edge-layer residual from a successfulagent and inject it into a wandering one. Supply the missing “this is what done looks like” signal. Steer it home.

We tried this three ways. All three are null on rescue.

Attempt 1 — the output layer (L55): inert

The first attempt injected the SUCCESS direction at L55, an output-planning layer, always on, at a modest strength. Nothing. The hooked agents flipped to finish_tool at the same rate as no intervention — if anything slightly less. The model was behaviorally inert to the injection.

Attempt 2 — the right layer (L11): still null, with a twist

The natural objection: you targeted the wrong layer. Paper 2 says WANDERING lives at L11, not L55. So we re-ran the whole thing at L11, norm-matched the injection strength, and swept the dose. This is the test the project was built to run — intervene at the exact locus where the failure is most legible.

Before the result, the control that decides everything. WANDERING is not run-stable. Re-running the same 20 wandering trajectories at temperature 1.0 with no intervention, 7 of 20 (35%) flip to finish_toolon their own. The label is a single-run snapshot of a stochastic process. That one fact dictates the analysis: the only honest baseline for “did my hook rescue this agent” is the paired no-hook re-run on the same instance, not the definitional “all 20 are WANDERING by construction.”

Here is what we measured:

condition                    finish_tool   invalid_tools (crash)
──────────────────────────────────────────────────────────────
no-hook baseline (paired)    7/20 (35%)    0
L11 hook, alpha = 0.70       5/20 (25%)    0
L11 hook, alpha = 1.15       2/20 (10%)    12/20 (60%)

Read the first two rows. The hook flips feweragents (5/20) than doing nothing (7/20). Paired, instance by instance, the discordant pairs are 5 that finished without the hook but not with it, versus 3 that finished only with it. McNemar's exact test: p = 0.73. Null, with the point estimate pointing the wrong way. Correcting the locus from L55 to L11 did not convert the null — it just moved the failure from “inert” to “live but un-steerable.”

The walk-back: how this almost shipped as a positive

This is worth dwelling on, because it is the kind of mistake that is easy to make and easy to miss. Our own analysis notebook computed a different number. It counted the 6 of 20 agents that flipped at either dose, and ran a Fisher test against the definitional0/20 baseline (“all of these were WANDERING, so any finish is a win”). That gives p = 0.02 — a clean little positive. A draft of the paper was written around it.

It is wrong. The 0/20 baseline is the bug. The same agents finish 7/20 on a no-hook re-run, so the correct comparison is paired, and paired it is null. Worse, of those 6 “rescued” agents, 2 already finish with no hook at all — their rescue is the temperature dice, not the intervention. We caught it by recomputing from the raw per-run logs instead of trusting the summary cell. The general rule we wrote down for ourselves:

When you intervene on a run-unstable phenotype, the baseline must be the paired same-session no-hook re-run — never the definitional label. Comparing intervention-flips against a 0/N definitional baseline manufactures false positives.

The one real effect: the dose-dependent crash

Look at the third column of the table. At low strength the L11 hook does nothing visible. At higher strength it crashes the model into invalid_tools — malformed tool calls — on 12 of 20 runs. A no-hook re-run never produces that. So the L11 injection is unambiguously causal: it reaches behavior, unlike the inert L55 hook. It just reaches behavior in the wrong direction. Push the “success” direction at L11 hard enough and the agent does not finish — it falls apart. (The crash is also direction-agnostic: a random direction of the same size crashes it too.)

So L11 is the right place to read WANDERING and the wrong place to steer it to completion. The detector locus and the lever locus are not the same locus — they may not be a residual direction at all.

Why this is the useful result

It is tempting, across mechanistic interpretability, to slide from “we found a direction that predicts the behavior” to “so we can steer the behavior by moving along that direction.” Our own program now has four nulls on exactly that slide (two earlier probe results, plus L55 and L11 here). The pattern is consistent enough to state plainly:

Predictive directions in the residual stream are routinely correlational without being causal levers. And when a steering null shows up, the fix is notalways “you picked the wrong layer” — we picked the provably-best layer and it was still null. The gap between a detector and a lever is not always a targeting error.

For deploying agent safety, that distinction is load-bearing. The good news survives intact: the WANDERING detector works, generalizes across architectures, and is cheap enough to run as an autonomous “stop this run” gate. The cautionary half is that you cannot take that same monitor and naively invert it into a one-shot rescue by nudging activations. Detection you can ship today. Repair is a harder, separate problem.

What we would try next

The null we have is specifically about an always-on, additive residual direction. Its dose-response has only two regimes — a null one (low strength) and a destabilizing one (high strength) — with no rescue regime in between. That leaves two honest openings:

A transient, behavior-gated pulse. Instead of injecting continuously, fire a single corrective pulse only at the moment the tool-entropy signal collapses — and gate it on the behavioral signal, not the residual one. Which agents are even candidate-rescuable is predicted by how deeply their tool-entropy has collapsed (AUC 0.77), and not by the residual fingerprint (AUC 0.62, fails its own test). If a lever exists, the evidence says it is indexed by behavior, not by a static direction.
Run-stable labels first. Re-run every trajectory several times up front and only call an agent WANDERING if it wanders on all of them, then intervene on those. That removes the temperature noise that swamps a 20-instance study — without the after-the-fact conditioning that quietly rigs the result.

Notes & artifacts

Detector paper (published): Tool-Entropy Collapse, Zenodo DOI 10.5281/zenodo.20368601. The localization and intervention papers are in preparation.
All experiments are on Qwen3.6-27B over SWE-bench Pro (N = 99 trajectories; 20 WANDERING), one RTX 6000 Pro Blackwell (96 GB). Generation is at temperature 1.0 — the source of the run-instability that drives the paired-baseline rule.
The honest re-analysis (paired McNemar, contamination check, dose-response) is a single script that recomputes everything from the raw per-run logs, separate from the notebook that produced the original (wrong-baseline) number.

If you build agent monitors, the one-line takeaway is the boring, expensive one: validate the lever separately from the detector, and validate it against a paired baseline. The direction that tells you an agent is lost will not, by default, tell it how to get home.