Research

Papers, posts, and roadmap.

Every artifact is open. Negative results are publishable. Broken links, stale numbers, or methodological bugs get flagged and fixed in public.

Open artifacts

SDKs, probe weights, SAE / crosscoder models, and reproducible harnesses. Apache-2.0 throughout. Every artifact is paired with a paper or eval doc that documents how it was built and what it's for.

agent-probe-guard SDK v0.3.1

Live · Apache-2.0

PyPI · openinterp · HF dataset · GitHub release

Two-probe activation gate for code agents on Qwen3.6-27B (capability + thinking-intent). Skip / escalate / proceed routing, ~50ms gate, refit() helper for cross-env transfer. Detect-only by design — confirmed across 3 intervention experiments.

Read

agent-probe-guard probe weights

Live · Apache-2.0

HF dataset · caiovicentino1/agent-probe-guard-qwen36-27b

L43 K=10 capability + L55 K=5 thinking probes for Qwen3.6-27B. CV AUROC 0.83 / 0.85 with random K-matched gap +0.08 / +0.15. Cross-env transfer caveat documented in paper Appendix C.

Read

NLA two-tier verbalization — three-model dataset

Live · Apache-2.0

HF dataset · caiovicentino1/openinterp-paper7-nla-two-tier-verbalization

Reproducibility artifacts for the "Reconstruction Without Recall" paper — Phase 16 result JSONs (N=150 per model) + AV explanations across V1 Qwen2.5-7B-L20, V2 Gemma-3-12B-L32, V3 Gemma-3-27B-L41 NLA pairs from the kitft release. Documents the three differential scaling axes and format-prior contraction.

Read

Probe-detected grokking — DPO sweep dataset

Live · Apache-2.0

HF dataset · caiovicentino1/openinterp-41v2-grokking-extended

Forward-only sweep across 11 multi-probe DPO checkpoints on Qwen3.6-27B. Fresh-probe AUROC trajectory 0.472 → 0.528 with late/early slope ratio 2.60 — the grokking signature orthogonal to original FG/RG axes. Companion to the multi-probe DPO checkpoints (caiovicentino1/openinterp-37v2-multiprobe-dpo-extended).

Read

Qwen3.6-27B paper-grade SAE

Live · Apache-2.0

HF model · caiovicentino1/qwen36-27b-sae-papergrade

Multi-layer SAE on Qwen3.6-27B trained on 200M tokens. L11 ve=0.842 / L31 0.706 / L55 0.808. Only public SAE for Qwen3.6 reasoning model as of May 2026. Used by FabricationGuard, ReasonGuard, agent-probe-guard.

Read

Qwen3.6-27B full-stack SAE (PGAC)

Live · Apache-2.0

HF model · caiovicentino1/qwen36-27b-sae-fullstack

Eleven-layer TopK SAE stack covering every 5th layer of Qwen3.6-27B for Probe-Gated Adaptive Compute (PGAC). d_sae 40,960, k=128, 200M tokens. Substrate for the 3.57× compound inference speedup at iso-SAE-floor result.

Read

Gemma-2-2B base/IT crosscoder (paper-grade)

Live · Apache-2.0

HF model · caiovicentino1/gemma2-2b-crosscoder-model-diff-papergrade

BatchTopK crosscoder, 73,728 latents, k=100, expansion 32, layer 13. Trained on 100M tokens (FineWeb-Edu + UltraChat-200k). VE_A 0.877 / VE_B 0.867. Substrate for the Cosine-Causal Gap paper.

Read

FabricationGuard linear probe

Live · Apache-2.0

HF dataset · caiovicentino1/FabricationGuard-linearprobe-qwen36-27b

Linear probe at L31 on Qwen3.6-27B residual stream. HaluEval within 0.90, SimpleQA cross 0.88, −88% confidently-wrong reduction in SimpleQA. Powers the FabricationGuard product and the multi-probe DPO joint reward.

Read

SWE-bench Pro reproducible harness

Live · Apache-2.0

GitHub · OpenInterpretability/openinterp-swebench-harness

V1 harness for collecting agent rollouts on SWE-bench Pro with residual-stream captures (transformers direct, forward hooks). Phase 1 N=20 + Phase 6 N=99 in flight. Substrate for Two Forms Epiphenomenal Probes paper.

Read
Plus ~60 SAEs, probes, datasets, and intermediate artifactson Hugging Face — every published paper's reproducibility data, training checkpoints, and probe weights.

Our papers

16 hosted on site · full text

Drafts, eval docs, and submitted papers authored by OpenInterp. Markdown source mirrored from the research repos so you can read the exact text without leaving the site.

Explore-Consolidate Dynamics in Cross-Probe Coherence

draft

U-Shape Trajectories of κ_t Separate Successful and Failed LLM Agent Runs on SWE-bench Pro / Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft v2) → ICLR 2027 main · 2026-05-18

We propose cross-probe coherence κ_t — the mean absolute pairwise correlation across N concurrent per-turn behavioral probes within a moving window of agent turns — as a meta-signal for LLM agent monitoring. On 99 SWE-bench Pro trajectories from Qwen3.6-27B we report two distinct findings. (1) Per-trace mean κ̄ separates success from failure at AUROC 0.677 (Mann-Whitney p=0.0009). (2) κ_t exhibits a U-shape over each trajectory: it decreases through an early exploration phase and increases through a late consolidation phase, and the amplitude of this U-shape is markedly larger in successful traces (early-half slope −0.0078/turn vs −0.0007/turn, p=0.0002; late-half slope +0.0149/turn vs +0.0025/turn, p=0.00004). A pre-registered robustness control (C1) found the monolithic per-trace slope is substantially explained by trace-length confound (p=0.56 after length regression), motivating the length-normalized early-half/late-half decomposition. Within-trace turn-order shuffle nulls (C2) confirm the U-shape is genuinely temporal (p<0.0001). The pattern is the inverse of cardiac uncoupling: in ICU literature, cross-vital decorrelation anticipates physiological decompensation; in LLM agents, cross-probe trajectories oscillate during successful reasoning (explore→consolidate) and remain flat during failure. We document the methodological discipline — pre-registered gates that caught both five prior single-probe walk-backs in the 36 hours before this finding and this paper's own headline-confound — that gives the rescued temporal claim its credibility.

cross-probe correlationmeta-signalLLM agent monitoringSWE-bench ProQwen3.6-27B

Conditionally-Causal Probes: Five Operational Constraints on Linear-Probe Causality in Qwen3.6-27B

draft

An eleven-site empirical map, a unifying operational-constraints framework, and a pre-publication diagnostic battery — derived from four prior honest negatives

Caio Vicentino · TMLR (Survey Certification) → ICLR 2027 main (draft v1) · 2026-05-16

Linear probes on transformer residual streams routinely achieve high predictive AUROC, yet whether a probe direction also levers downstream behavior under intervention is rarely measured systematically. We report a twelve-site causal-authority map of probes in Qwen3.6-27B (reasoning-tuned, 27B parameters), comprising eleven probes evaluated under a unified α-sweep + control-token + onset-timing protocol plus one predictive case study, and identify five distinct empirical causal regimes: causal trajectory-shaping, pushup-asymmetric, pushdown-asymmetric, structurally-locked, and epiphenomenal-via-softmax-temperature. We propose that probe causality is operationally constrained by a five-axis configuration — layer (spatial), trajectory (temporal), magnitude (α), direction (saturation alignment), and decision locus (architectural) — and demonstrate each constraint with a within-paper falsifying experiment that holds the other four fixed. We then consolidate the methodology that surfaced these constraints into a six-item pre-publication diagnostic battery: random-feature baseline, shuffled-source baseline, control-token normalization, structural-rigidity α-sweep, whitespace-stripped flip metric, and onset-timing sweep. Each diagnostic is mapped to a concrete failure mode we shipped or nearly shipped in our own work: over-parameterization at N<100, marginal-fit pathology in sparse top-k prediction, softmax-temperature artifacts that look causal, amplitude-null masquerading as structural-null, tokenization-inflated flip rates, and trajectory-versus-state confusion. Together the diagnostics cost under one GPU-hour per probe. We release the protocol, capture batches, per-probe verdicts, and an open-source SDK that implements the diagnostics, and argue that the field's growing reliance on probe-based monitoring, reward shaping, and alignment auditing should treat probe causality as a conditional property to be measured per deployment configuration, not a global per-probe attribute.

meta-paperprobe causality taxonomyoperational constraintspre-publication diagnosticsQwen3.6-27B

Trajectory-Shaping Probe Steering in Qwen3.6-27B Reasoning

draft

Causal, Cross-Domain, and KV-Cache-Bound — a Subjective-Time Direction with Operational Constraints

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft v2.1) · 2026-05-16

We identify a causally functional subjective-time direction in the residual stream of Qwen3.6-27B (open-weights, 27B parameters, hybrid Gated-Delta-Net plus standard attention), validate it across math (GSM8K) and code (SWE-bench Verified) reasoning, and characterize a fundamental operational constraint on its causal effect: the steering intervention works only when applied continuously from generation start. A Ridge regression probe trained on residuals at L11/L31/L55 predicts thinking-phase completion with R²=0.82-0.86 (Spearman ρ ≥ 0.90); three baselines (random-feature, shuffled-target, constant-mean) cleanly fail. Forward-hook steering at L31 with α=+50 from token 1 shortens GSM8K thinking-length in 9/14 prompts vs 2/14 for matched random (Fisher p=0.0092). Cross-domain: 19/20 (95%) probe-clean-termination on SWE-bench Verified across 6 repositories vs 6/20 (30%) random (Fisher p<0.001), at mean 299 thinking-tokens vs unbounded baselines (0/10 terminate even at MAX_NEW_TOK=2048). The mechanism is trajectory-dependent: delayed steering — even at decode step 50 — drops termination from 9/10 to 3/10; by step 200, the rescue effect vanishes entirely (0/10). Two closed-loop variants (probe-as-sensor with threshold trigger + plateau detector) achieve only 1-2/10 termination, confirming that the 'termination basin' is mediated through KV-cache state buildup rather than instantaneous residual perturbation. Phase 2C cross-layer test further establishes that the direction is causal ONLY at L31 — L11 (R²=0.84) and L55 (R²=0.82) are inert despite equivalent probe accuracy. This refines the probe-causality taxonomy with a third category beyond 'causal' / 'epiphenomenal': operationally-constrained causal — directions that lever behavior only under specific application protocols (temporal: from token 1; spatial: specific layer only).

probe steeringtrajectory-dependentKV-cachesubjective timeQwen3.6-27B

The Marginal-Fit Pathology in Predictive SAE Feature Trajectory Probes

draft

An Honest-Negative on Predicting End-of-Thinking SAE Features in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-16

We train linear probes to predict end-of-thinking sparse-autoencoder (SAE) features in Qwen3.6-27B from residual activations at earlier thinking-phase fractions, across three layers (L11, L31, L55). Naive evaluation reports recall@1024 = 0.83-0.87 at L11/L31 and 0.67-0.72 at L55. We then run a shuffled-source baseline (X_train shuffled, y_train kept, identical recipe) and observe that the baseline reproduces the real recall within ±0.03 at all 12 (layer × source-fraction) sites, with Cohen's d < 0.15. A trivial constant baseline that predicts the top-M most-globally-common features ignoring input strictly exceeds the trained probe (1.000 at L11/L31, 0.991 at L55). The probe is not learning per-prompt predictive structure — it is fitting the marginal distribution of end-of-thinking SAE features and approximating an input-independent constant rule. We name this the marginal-fit pathology, identify five structural conditions that produce it (sparse top-k target + concentrated marginal + N_train << d_target + lazy loss + recall-style metric), contribute the shuffled-source baseline as a Phase 6c-class hard rule for sparse-target probe-prediction work, and reframe predictive-probe agendas — including JEPA-shaped LLM experiments — toward differential metrics (REAL − SHUFFLED) from day one rather than absolute recall.

sparse autoencoderspredictive probeshonest negativemarginal-fit pathologyshuffled-source baseline

Reconstruction Without Recall

draft

Two-Tier Verbalization in Natural Language Autoencoders — Three-Model Differential Scaling on Qwen2.5-7B, Gemma-3-12B, and Gemma-3-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-09

Natural Language Autoencoders (Fraser-Taliente et al. 2026) train an activation-verbalizer (AV) and an activation-reconstructor (AR) end-to-end with GRPO so that round-trip MSE between original and AR-reconstructed activations serves as a learnable explanation-quality reward. We replicate the canonical recipe on three NLA pairs from the kitft release spanning two model families and three scales — kitft/nla-qwen2.5-7b-L20, kitft/nla-gemma3-12b-L32, and kitft/nla-gemma3-27b-L41 — and show that the headline metric, fve_nrm, decouples from semantic content fidelity across all three models, with three differential scaling axes that sharpen the methodological position. On a 50-prompt corpus across 4 categories (chat / code / agent / reasoning) at K=3 samples (N=150 per model), fve_nrm is uniform at high absolute level (Qwen 0.880 / Gemma-12B 0.992 / Gemma-27B 0.982; spreads 0.017 / 0.005 / 0.010) while keyword recall varies 6.5–8.8× across categories (Qwen spread 0.490 → Gemma-12B 0.649 → Gemma-27B 0.654). The trajectory reveals: (a) overall content-fidelity signal-above-floor grows monotonically with NLA training quality (permutation gap +0.27 → +0.38 → +0.43, no ceiling visible); (b) per-category recall spread saturates between 12B and 27B at a training-distribution-imbalance ceiling; (c) Tier 1 fve_nrm peaks at moderate model size then slight regression at 27B, suggesting layer-extraction-dependent quality. Three controls validate on all three: permutation, random Gaussian (collapse to fve_nrm = −0.949 → −0.992 → −1.000 with exact orthogonal cosine, while AV produces increasingly contracted format templates from heterogeneous formats to single 'Educational article' hyper-template attractor), and direction-injection (4/4, 3/4, 3/4 self-category alignment with negation symmetry — agent failure mode model-specific: Gemma-12B agent → code, Gemma-27B agent → chat under format-prior contraction). Two-tier thesis: Tier 1 (format/category) is direction-modulated and what fve_nrm measures; Tier 2 (specific content) is largely unencoded. Better NLA training makes fve_nrm less, not more, informative about per-category Tier 2 quality.

natural language autoencodersactivation decodinginterpretability evaluationformat priorsdecoupling magnification

Activation-Bounded Chain-of-Thought Monitorability

draft

Template-locked reasoning decisions and the structural ceiling on text-only CoT monitoring in Qwen3.6-27B

Caio Vicentino · Position paper, May 2026 · 2026-05-09

Chain-of-thought monitoring has emerged as a leading candidate for scalable AI safety oversight: if a long, serial reasoning process must pass through a textual trace, then reading that trace should reveal what the model is thinking. We argue this view is structurally incomplete in instruction-tuned reasoning models. The chat template that activates thinking mode in Qwen3.6-27B injects a fixed <think></think> token pair into the input itself; the decision to think at all is encoded in the prompt before the residual stream encodes anything else. We document this template-lock experimentally (bidirectional α-sweep up to ±200 in the L55 thinking probe direction produces zero behavioral change) and contrast it with three other reasoning loci where decisions are residual-stream-encoded and steerable: capability deployment (+33-40pp pushdown gap across distributions at α=−100), persona (+60pp pushdown at α=−200), and mid-reasoning quality (+30pp pushup at α=+200). We use this evidence to argue for an activation-derived monitorability bound: text-only CoT monitoring cannot, by construction, observe decisions made before the residual stream encodes them. Activation-derived monitoring is the structurally complementary half of any complete monitoring strategy. We discuss implications for the Frontier Model Forum's January 2026 issue brief and Anthropic's 2027 detection goal.

chain-of-thoughtmonitorabilityAI safetyactivation steeringtemplate-lock

Saturation-Direction Lever

draft

A Five-Class Taxonomy of Probe Causality in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-09

Linear probes routinely achieve high predictive AUROC, yet their causal authority — whether their direction levers downstream behavior — has been only sparsely tested at frontier scale. We map probe causality across 8 probes (5 layers, 5 positions, 3 training-objective classes) on Qwen3.6-27B using a unified protocol combining bidirectional α-sweep up to α=±200, random K-matched control direction, control-token-normalized log-prob shifts, structural-rigidity diagnostic, and whitespace-stripped behavioral flip metric. We document five empirical classes of probe-causality regime and identify a single unifying principle — probes lever in the saturation direction of the baseline residual — that explains all observed asymmetric-lever cases including a falsified prediction. The classes are: (1) surface softmax-temperature artifact (L43 capability), (2) template-locked categorical decision (L55 thinking emission, L31 fabrication-detection), (3) structural fragility at fragile layers (L11/L43 think_start), (4a) pushup-asymmetric lever for reasoning quality at high amplitude (RG L55 mid_think, +30pp gap), and (4b) pushdown-asymmetric lever for capability and persona at high amplitude (5 sites, +30 to +60pp gap). We falsify the naive prediction that continuous-gradient probes lever in the pushup direction by demonstrating that persona — a continuous gradient — levers in the pushdown direction when the test prompt's baseline is in the helpful saturation region.

linear probescausal interpretabilitysaturation directionasymmetric leverQwen3.6-27B

The Cosine–Causal Gap in Cross-Model Crosscoders

draft

When Decoder Universality Overstates Causal Equivalence in Gemma-2-2B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-08

We provide the first per-feature empirical measurement of the gap between decoder cosine universality and causal-equivalence in cross-model crosscoders. Training a paper-grade BatchTopK crosscoder (73,728 latents, k=100) on Gemma-2-2B base/IT at layer 13, we measure Pearson correlation between two-model KL trajectories under per-feature ablation across 256 probes. Median decoder cosine 0.965 vs median Pearson_CE 0.616, with 38.24% of shared features having cosine > 0.7 yet CE < 0.5. Outliers in both tails — anti-aligned decoders with equivalent causal effect, and aligned decoders with opposite causal effect — show that cosine is neither necessary nor sufficient for causal equivalence. We propose Pearson_CE as a mandatory complementary diagnostic for crosscoder universality claims and release all artifacts under Apache-2.0.

crosscodersSAEuniversalitycausal equivalenceGemma-2

Probe-Detected Grokking in Multi-Probe DPO

draft

Orthogonal Learning Beyond Task-Specific Detectors in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-08

We report a phase-transition-like dynamic in multi-probe Direct Preference Optimization on a 27B-parameter reasoning model, observable only through fresh probes trained after the fact. Original probes (FabricationGuard at L31, ReasoningGuard at L55) used as the joint preference signal remain invariant across training (variance 7×10⁻⁸, ~40× below within-step noise) despite a 0.234 DPO loss descent and 0.654 logit-difference. A fresh probe re-trained on each checkpoint reveals a smooth, accelerating progression in AUROC from 0.472 → 0.528 with a late-half-to-early-half slope ratio of 2.60 — the construct-then-compress signature of grokking dynamics, but with a compression target orthogonal to the original probes. We argue this is a structural Goodhart phenomenon specific to probe-derived rewards, propose fresh-probe AUROC progression as a complementary safety evaluation, and release training checkpoints, probes, and reproducer code under Apache-2.0.

grokkingDPOGoodhartprobesQwen3.6-27B

Two Forms of Epiphenomenal Probes in Code Agents

draft

Mid-Reasoning Capability and Chain-of-Thought Emission in Qwen3.6-27B

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft) · 2026-05-08

We train linear probes on residual-stream activations of Qwen3.6-27B and obtain two correlative findings (capability AUROC 0.83 at L43 pre_tool, CoT emission AUROC 0.85 at L55). Three intervention experiments show both are detection-only via two distinct mechanisms: softmax-temperature artifact (L43) and template-locked decision (L55). We contribute three sanity checks (random-K baseline, control-token normalization, structural-rigidity α-sweep) and ship agent-probe-guard, an Apache-2.0 SDK that markets the correlative finding without overclaiming. Cross-environment eval reveals probe weights are coupled to inference setup; v0.3.1 ships a refit() helper.

linear probescode agentsQwen3.6-27Bepiphenomenalmethodology

Pre-flight Probe Eval v6 — Phase 8 Template-Lock Verdict

published

Closing the L55 thinking-emission causality experiment

Caio Vicentino · Internal eval document (companion to NeurIPS MI 2026 draft) · 2026-05-08

Phase 8 single-shot bidirectional steering with bf16 amplitude diagnostic delivers a structural-rigidity null that converges with Phase 7 on the higher-order claim: probes detect; mid-layer steering doesn't lever. Two epiphenomenal mechanisms documented (softmax-temp + template-lock), three methodology contributions consolidated (random-K baseline, control-token normalization, structural-rigidity α-sweep). Includes Phase 8 redux confirming structural lock isn't dilution.

intervention experimentssteeringQwen3.6-27Bverdict

Six Diagnostics, Six Walk-Backs

draft

An Operational Checklist for Causal Claims in Mechanistic Interpretability

Caio Vicentino · NeurIPS 2026 Mechanistic Interpretability Workshop (draft v0.1) · 2026-05-19

Mechanistic interpretability papers routinely make causal claims — that a linear probe is a causal lever, that a circuit mediates a behavior, that an SAE feature implements a concept — but rarely state the identification assumptions that make these claims falsifiable. A recent position paper (Bohnet et al., 2026) proposed a disclosure norm: state whether a claim is causal, name the identification strategy, enumerate assumptions, and demonstrate how conclusions shift if assumptions fail. We operationalize this norm as six diagnostics, each runnable in under one GPU-hour per probe, and demonstrate each on a published-or-near-published causal claim from our own prior work where the diagnostic would have falsified the claim if not run. One of the six — trace-length-controlled slope decomposition — has not previously been published as standalone methodology. We position the resulting checklist as the minimum-viable operational layer beneath theoretical frameworks for causal abstraction (Geiger et al., 2024) and causal scrubbing (Anthropic), and ship an open-source Python module, a Colab demonstration on a toy probe, and an integration with the ProbeBench leaderboard.

methodology checklistcausal claimsidentification assumptionspre-registrationwalk-back-and-rescue

Causal Localization of Agent WANDERING to Edge-Layer L11: The Right Locus Is Still Not a Rescue Lever

published

Three causal nulls and a dose-dependent destabilization at WANDERING's strongest detector locus

Caio Vicentino · Zenodo · CC-BY-4.0 · DOI 10.5281/zenodo.20490278 · 2026-06-01

We report three causal tests of the Tool-Entropy WANDERING mechanism hypothesis (mid-layer verdict consolidated, edge-layer alignment fails) on Qwen3.6-27B SWE-bench Pro trajectories, and all three are null on rescuing WANDERING. A forced-finish counterfactual rules out silent success (Fisher p=0.71); always-on L55 SUCCESS-donor steering is behaviorally inert (p=1.00); and re-targeting the injection to L11 — the edge layer the companion classification paper flags as WANDERING's strongest discriminator — across a norm-matched magnitude sweep does not rescue either (paired McNemar p=0.73). The one robust effect is the opposite of a rescue: at high magnitude the L11 hook destabilizes the model into invalid tool calls (0/20 → 12/20). We also surface a load-bearing methodological finding: WANDERING is not run-stable at temperature 1.0 (the same instances flip finish 7/20 with no intervention), so every intervention test must be paired and the unpaired 0/20 baseline manufactures a false positive.

agent WANDERINGcausal localizationactivation steeringhonest negativesQwen3.6-27B

Multi-Channel Mechanistic Signatures of Agent WANDERING: Classification, Causal Localization, and Behavior-Legible Response to Intervention

published

60 multi-channel features, a mid-to-edge mechanism, and a residual-blind / behavior-legible response signal

Caio Vicentino · Zenodo · CC-BY-4.0 · DOI 10.5281/zenodo.20490284 · 2026-06-01

WANDERING — an LLM agent stays internally confident it has solved a task yet never emits a termination action and exhausts its turn budget — is a 34% blind spot in probe-based agent monitoring. We characterize it mechanistically on N=99 Qwen3.6-27B SWE-bench Pro trajectories: 60 multi-channel features (text, tool-use, per-layer residual, temporal) classify SUCCESS/LOCKED/WANDERING at macro-F1 0.636 (z=5.88, p=0.001), after a transparent walk-back from a leaky 0.987 baseline. Stability selection independently recovers a mid-to-edge mechanism (LOCKED→L43, WANDERING→L11), and an LLM-judge bridge to a human taxonomy co-locates ≈60% of both LOCKED and WANDERING into one category, matching a mechanistically weak boundary (p=0.035). Finally, the residual signature does not predict which agents flip to finish under a companion L11 injection run (LOO-AUC 0.619), but tool-entropy collapse depth does (AUC 0.768): response to intervention is residual-blind but behavior-legible.

agent WANDERINGmulti-channel classificationstability selectionhuman-taxonomy bridgetool-entropy

Modality Matters: A Transient Behavioral Interruption Rescues Agent WANDERING Where Residual Steering Does Not

published

The predictive signal is residual; the causal lever is behavioral — the first positive of the WANDERING arc

Caio Vicentino · Zenodo · CC-BY-4.0 · DOI 10.5281/zenodo.20490286 · 2026-06-01

A multi-turn coding agent fails in a distinctive way we call WANDERING: it keeps acting but never emits the finish action, exhausting its turn budget. A companion line of work detects WANDERING from a residual-stream signature and a tool-entropy signal, but three pre-registered residual interventions (SUCCESS-direction injection at L55 and at L11) all fail to rescue it. We ask whether the missing lever is not the locus but the modality. On the same 20 WANDERING Qwen3.6-27B SWE-bench Pro trajectories, gated by the same live tool-entropy collapse detector, a transient behavioral interruption — one fresh user turn at the collapse point — roughly doubles finalization (30%→70%, paired McNemar p=0.021), while residual L11 injection stays inert (p=0.63). The lever is the interruption itself, not its content: a content-neutral message rescues as well as a re-plan (p=1.0). SWE-bench Pro Docker evaluation suggests the interruption also raises task solve-rate (~23%→50%, cross-session). For long-horizon agents the predictive signal lives in the residual stream but the causal lever lives in behavior.

agent WANDERINGbehavioral interventioncourse-correctiondetection-intervention asymmetrytool-entropy

No Better Than Behavioral: A Residual Velocity-Freezing Fingerprint Predicts Agent WANDERING No Better Than the Cheap Tool-Entropy Detector

published

A pre-registered negative — companion note to the WANDERING arc: context rot leaves a real residual trace that is operationally redundant

Caio Vicentino · Zenodo · CC-BY-4.0 · DOI 10.5281/zenodo.20500053 · companion note to the WANDERING arc · 2026-06-01

Does the residual stream carry an earlier or better detector of long-horizon agent WANDERING than the cheap probe-free tool-entropy signal — does the geometry rot before the behavior does? We pre-register and test this on the same 99 Qwen3.6-27B SWE-bench Pro trajectories. Stage 1 (raw residual geometry, no SAE) finds a real but weak fingerprint, representational velocity-freezing: trajectories that will WANDER settle toward an attractor sooner (smaller per-turn state change early), directionally consistent across all five layers (4/5 raw p<0.05, length-controlled), with one mid-network layer clearing a pre-registered trend-and-divergence conjunction (p=0.015) — but nothing survives multiple-comparison correction. Stage 2 (the decisive predictive test) shows the fingerprint adds nothing: early velocity at L31 reaches AUROC 0.695, statistically indistinguishable from the fair early behavioral baseline (tool_entropy_first10, 0.688; paired bootstrap Δ=+0.008, 95% CI [−0.170,+0.211]) and clearly below the deployed late detector (0.888); as a sharp alarm at ≤5% false-positive it catches only 1–3 of 20 WANDERING — far fewer than the deployed detector, with too few overlapping detections to measure a lead-time advantage. The residual fingerprint of context rot is real but downstream-redundant — strengthening the arc: for this failure mode, watching the cheap behavior is as good as or better than reading the residual stream.

agent WANDERINGcontext rothonest negativespre-registrationresidual geometry

Roadmap

Living document. Items change as results come in.

Now (May 2026)
  • ·Phase 6 N=99 SWE-bench Pro trace collection (in flight, ~6h remaining at last check)
  • ·Paper-3 finalization with N=99 capability numbers + agent-probe-guard SDK v0.3.2 (Phase-1-real probes)
  • ·Paper-1 ICML MI Workshop notification awaited (June 12)
Next (Jun-Sep 2026)
  • ·NeurIPS MI Workshop 2026 submissions × 3 (Sep deadline): Cosine-Causal Gap, Probe-Detected Grokking, Two Forms Epiphenomenal Probes
  • ·nb47 v2 — Probe-gated memory with thinking-preservation prompt fix (RAG-breaks-CoT finding paper)
  • ·nb48 — LoRA distillation from probe-gated memory (completes 3-timescale stack)
  • ·MATS Winter 2027 application (opens July)
Later (Q4 2026+)
  • ·SAE-decoded steering experiments (Two Forms paper §6 future work, recover causal claim)
  • ·Cross-model crosscoder replication on Qwen + Llama (Qwen CSV currently empty, Llama unstarted)
  • ·nb37 v3 — extended DPO 400 steps × 40 checkpoints (confirms grokking phase transition continues)
  • ·Integration with Anthropic circuit-tracer via native plugin
  • ·Cross-architecture probe transfer matrix (Qwen3.6 ↔ Llama-3.3 ↔ Gemma-2)

Further reading

55 canonical papers · curated

The reading list we wish we'd had when starting. Every paper cites a primary source (arxiv, transformer-circuits.pub, LessWrong, or an official blog). If you spot a dead link or a missing seminal paper, open a PR editing lib/papers.ts.

SAE foundations & scaling

Start here. The dictionary-learning → scaling → evaluation arc that every modern interpretability paper builds on.

Foundations

· The "SAEs actually work" moment and its precursors.

Scaling SAEs

· From toy models to frontier-scale.

Architecture variants

· Beyond ReLU + L1 — Gated, JumpReLU, BatchTopK, ProLU.

Transcoders & cross-coders

· SAEs that span layers — the primitives behind 2025 circuit analysis.

Circuits & attribution

How information flows. Attribution patching, automatic circuit discovery, sparse feature circuits, Anthropic's 2025 biology papers.

Early circuit analysis

· The Elhage/Olsson/Wang canon — read before any attribution paper.

Attribution patching

Circuit discovery algorithms

Causal scrubbing / DAS

Steering, probing & safety

The supervised and semi-supervised complements to SAEs — probes, activation steering, representation engineering, interpretability for deception detection.

Activation steering

· Directions are the unit of analysis.

This list is intentionally curated, not exhaustive. Seminal paper missing? Open an issue with the arxiv/URL and one sentence on why it belongs. We'll review within 72h.

Cite this work

BibTeX for the library + protocol (paper arXiv forthcoming):

@software{openinterpretability2026mechreward,
  author = {Vicentino, Caio and contributors},
  orcid  = {0009-0003-4331-6259},
  title  = {mechreward: Mechanistic interpretability as reward signal for RL},
  year   = {2026},
  url    = {https://github.com/OpenInterpretability/mechreward},
  note   = {OpenInterpretability project},
}