The Atlas — every public mech-interp finding, hashed and citable.
A read-only index of probes, causality verdicts, and honest-negative findings — published via openinterp-mcp. Every entry ships with a content-only sha256, an HF dataset (when applicable), and an optional Zenodo DOI.
Probe result
Detection-tier probe artifacts — joblib + scaler + metadata, often with cross-task AUROC.
FabricationGuard v2 — L31 cross-task hallucination probe (Qwen3.6-27B)
Linear probe on layer 31 residual stream detects confident hallucinations across tasks. HaluEval within 0.90, SimpleQA cross-task 0.88. −88% confident-wrong reduction in SimpleQA. ~1ms sklearn p95 inference.
ReasonGuard v0.2 — L55 mid_think CoT faithfulness probe (Qwen3.6-27B)
Position-of-faithfulness probe at L55 mid_think token. AUROC 0.888 within GSM8K, 0.605 cross StrategyQA. Honest narrow-scope finding — domain-bound, not universal.
CoTGuard v1 — CoT faithfulness probe via Lanham-2023 truncation (Qwen3.6-27B)
Linear probe trained on Lanham-2023 truncation-induced unfaithful CoT signal. Detection-tier probe — pending Phase 8 causality verdict (template-locked under steering).
agent-probe-guard v0.1 — L43 pre_tool detection probe (epiphenomenal-softmax under steering)
Detection-tier probe for tool-call success in SWE-bench traces. AUROC 0.83 at N=42 with random-feature baseline gap +0.27. Causality protocol verdict is epiphenomenal-softmax: probe DETECTS but cannot LEVER (paper-6 Phase 7 finding).
Finding
Principles, methodology results, and findings without a single probe artifact attached.
Probe-detected grokking in multi-probe DPO (Qwen3.6-27B nb37 v2)
Phase transition (ratio 2.596) in fresh-probe AUROC across 11 nb37 v2 checkpoints. Original FG/RG probes show ZERO effect — DPO learning orthogonal to task-probe axes. Construct-then-compress pattern.
NLA two-tier verbalization — uniform fve_nrm decoupled from category-spread recall (Qwen2.5-7B + Gemma-3-12B)
N=150. Reconstruction fve_nrm UNIFORM 0.880 across chat/code/reasoning/agent. Keyword recall MASSIVELY category-dependent (chat 0.578 / agent 0.088 = 6.5×). Better-trained NLA → smaller fve_nrm spread but LARGER recall spread (decoupling magnification).
Saturation-direction principle — 5 empirical classes of probe causality (Qwen3.6-27B)
Unifies 8 probes into 5 causality classes. Saturation-direction principle: probes lever in the direction of baseline residual saturation. L55 reversal in Phase 11e (pushdown→pushup when saturation flips) strongly confirms principle.
Adversarial / negative
Causality verdicts that walked back claims (epiphenomenal, template-locked, OOD failures). Honest-negatives are first-class citizens.
Capability locus on Qwen3.6-27B SWE-bench Pro — 4/4 pre_tool/turn_end sites pushdown-asymmetric
α-sweep [-200,+200] on L23/L31/L43/L55 capability probes. All 4 sites show pushdown-asymmetric levers (+34 to +60pp gap vs random control). First causal verdict on capability axis. Refines paper-3 §4.1 L43 finding (was N=54 inflated).
L55 CoT-Integrity probe is template-locked epiphenomenal (Qwen3.6-27B)
N=240. AUROC 0.91. Bidirectional steering up to α=+200 (>‖residual‖) produces ZERO behavioral change for probe AND random direction. Mechanism: enable_thinking=False chat template injects <think></think> in input tokens — thinking decision is not in residual stream.
L43 pre_tool probe is softmax-temp epiphenomenal (Qwen3.6-27B SWE-bench)
Triple-source convergent verdict on L43 pre_tool probe direction. (1) log-prob proxy with control-token norm: Δrel ≈ 0; (2) single-shot α=+5: 4/4 fails select same tool; (3) continuous α=+5: 3/4 keep same tool. Probe DETECTS but does not LEVER.
Multi-probe ensemble OOD walk-back — 0/3 cross-distribution generalization (Qwen3.6-27B)
Cross-distribution test on TruthfulQA + StrategyQA + TriviaQA. 0/3 survives, mean lift −0.002. nb45 +6.7pp was within-distribution effect. ProbePack universal-middleware framing publicly walked back. FG single probe still valid OOD on factual (TriviaQA 0.710).
Publish your finding to the Atlas
In your agent session (Claude Code / Cursor / Cline / OpenHands / Aider), once attached to a Colab backend via openinterp-mcp:
from openinterp_mcp.publish import publish
publish(
title="My probe — what it does in one line",
type="probe-result",
model_id="Qwen/Qwen3.6-27B-Instruct",
numbers={"auroc": 0.91, "n_samples": 240},
methodology_check={"verdict": "weak-causal", "baselines_run": [...]},
hf_repo_id="myuser/my-probe-artifact",
)
# → HF dataset created + Zenodo DOI minted + PR opened against the registry