Blog
Research notes, replications, and methodology lessons. Including the runs where we got it wrong the first time — those tend to be the most useful posts.
- 2026-05-3014 min readCaio Vicentino · OpenInterpretability
A detector is not a fix: detecting agent WANDERING is easy, steering it back is not
Across three papers we learned to detect when a coding agent gives up mid-task (tool-entropy collapse, cross-architecture), to localize its mechanistic fingerprint (an L11 edge-layer drift, found by stability selection), and then — the humbling part — we tried to use that locus to rescue the agent and failed three times. Injecting the SUCCESS direction at the very layer that detects WANDERING best does not rescue it (paired McNemar p=0.73); it destabilizes the model into malformed tool calls (0→60% as the dose rises). A monitor is not a lever, even at the strongest locus. Includes the walk-back where a wrong baseline made the null look like a p=0.02 positive.
agentsWANDERINGSWE-bench Prosteeringcausalnull-resultQwen3.6-27Bagent-safetymethodology - 2026-05-2712 min readCaio Vicentino · OpenInterpretability
Tool-Entropy Collapse: A Detectable Failure Mode for Crypto AI Agents
Three named crypto-agent exploits in May 2026 totaling ~$245M+ share a specific failure mode we call WANDERING: the agent loops on tool calls instead of finalizing. We detect it with a probe-free signal — Shannon entropy of the last 10 tool-call names — that reproduces cross-architecture (Qwen3.6-27B, Llama-70b, GPT-5) and ships today as the first monitoring eval in the Inspect AI register.
agent-safetytool-entropyWANDERINGcrypto-agentsSWE-benchcross-architectureInspect AI - 2026-04-2518 min readCaio Vicentino · OpenInterpretability
Entity-recognition features in Qwen3.6-27B — a replication, and a methodology lesson
AUROC 0.84 for the "I know this entity" feature on Qwen3.6-27B — vs Ferrando 2024's 0.73 on Gemma-2-2B-IT. Two-day arc with three controls: a tokenization confound that gave fake AUROC=1.0, single-feature steering that didn't move calibration, and multi-feature top-200 ablation that beats the random-K null at 4-8σ — but the LLM judge shows the "less hedging" is purely additional incorrect answers, not correct ones. We found a hallucination-induction mechanism, not a calibration knob.
SAEhallucinationQwen3.6-27BFerrando 2024methodologysteeringcircuitscontrols