Public registrySchema v1 · Apache-2.0

The Atlas — every public mech-interp finding, hashed and citable.

A read-only index of probes, causality verdicts, and honest-negative findings — published via openinterp-mcp. Every entry ships with a content-only sha256, an HF dataset (when applicable), and an optional Zenodo DOI.

Entries

updated 2026-05-11

Probes

detection-tier artifacts

Adversarial / negative

honest-negatives are first-class

Findings

principles + methodology results

Publish via your agent Registry on GitHub Raw index.json

Probe result

Detection-tier probe artifacts — joblib + scaler + metadata, often with cross-task AUROC.

4 entries

Probe resultQwen3.6-27B-Instruct2026-05-11

FabricationGuard v2 — L31 cross-task hallucination probe (Qwen3.6-27B)

Linear probe on layer 31 residual stream detects confident hallucinations across tasks. HaluEval within 0.90, SimpleQA cross-task 0.88. −88% confident-wrong reduction in SimpleQA. ~1ms sklearn p95 inference.

8d5df2d5d5by caiovicentino🤗 openinterp/fabricationguard-qwen36-27b-l31-v2

Probe resultQwen3.6-27B-Instruct2026-05-11

ReasonGuard v0.2 — L55 mid_think CoT faithfulness probe (Qwen3.6-27B)

Position-of-faithfulness probe at L55 mid_think token. AUROC 0.888 within GSM8K, 0.605 cross StrategyQA. Honest narrow-scope finding — domain-bound, not universal.

49eba51edbby caiovicentino🤗 openinterp/reasonguard-qwen36-27b-l55-mid_think

Probe resultQwen3.6-27B-Instruct2026-05-11

CoTGuard v1 — CoT faithfulness probe via Lanham-2023 truncation (Qwen3.6-27B)

Linear probe trained on Lanham-2023 truncation-induced unfaithful CoT signal. Detection-tier probe — pending Phase 8 causality verdict (template-locked under steering).

7a4c7cf42eby caiovicentino

Probe resultQwen3.6-27B-Instruct2026-05-10

agent-probe-guard v0.1 — L43 pre_tool detection probe (epiphenomenal-softmax under steering)

Detection-tier probe for tool-call success in SWE-bench traces. AUROC 0.83 at N=42 with random-feature baseline gap +0.27. Causality protocol verdict is epiphenomenal-softmax: probe DETECTS but cannot LEVER (paper-6 Phase 7 finding).

9f2e9c5b8eby caiovicentino🤗 caiovicentino1/agent-probe-guard-qwen36-27b

Finding

Principles, methodology results, and findings without a single probe artifact attached.

3 entries

FindingQwen3.6-27B-Instruct2026-05-11

Probe-detected grokking in multi-probe DPO (Qwen3.6-27B nb37 v2)

Phase transition (ratio 2.596) in fresh-probe AUROC across 11 nb37 v2 checkpoints. Original FG/RG probes show ZERO effect — DPO learning orthogonal to task-probe axes. Construct-then-compress pattern.

7019cff912by caiovicentino🤗 caiovicentino1/openinterp-37v2-multiprobe-dpo-extended

Findinggemma-3-12b2026-05-11

NLA two-tier verbalization — uniform fve_nrm decoupled from category-spread recall (Qwen2.5-7B + Gemma-3-12B)

N=150. Reconstruction fve_nrm UNIFORM 0.880 across chat/code/reasoning/agent. Keyword recall MASSIVELY category-dependent (chat 0.578 / agent 0.088 = 6.5×). Better-trained NLA → smaller fve_nrm spread but LARGER recall spread (decoupling magnification).

e328cd066fby caiovicentino

FindingQwen3.6-27B-Instruct2026-05-11

Saturation-direction principle — 5 empirical classes of probe causality (Qwen3.6-27B)

Unifies 8 probes into 5 causality classes. Saturation-direction principle: probes lever in the direction of baseline residual saturation. L55 reversal in Phase 11e (pushdown→pushup when saturation flips) strongly confirms principle.

03a6e70bfdby caiovicentino

Adversarial / negative

Causality verdicts that walked back claims (epiphenomenal, template-locked, OOD failures). Honest-negatives are first-class citizens.

4 entries

Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

Capability locus on Qwen3.6-27B SWE-bench Pro — 4/4 pre_tool/turn_end sites pushdown-asymmetric

α-sweep [-200,+200] on L23/L31/L43/L55 capability probes. All 4 sites show pushdown-asymmetric levers (+34 to +60pp gap vs random control). First causal verdict on capability axis. Refines paper-3 §4.1 L43 finding (was N=54 inflated).

60b5c38463by caiovicentino

Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

L55 CoT-Integrity probe is template-locked epiphenomenal (Qwen3.6-27B)

N=240. AUROC 0.91. Bidirectional steering up to α=+200 (>‖residual‖) produces ZERO behavioral change for probe AND random direction. Mechanism: enable_thinking=False chat template injects <think></think> in input tokens — thinking decision is not in residual stream.

a0c01e67c9by caiovicentino

Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

L43 pre_tool probe is softmax-temp epiphenomenal (Qwen3.6-27B SWE-bench)

Triple-source convergent verdict on L43 pre_tool probe direction. (1) log-prob proxy with control-token norm: Δrel ≈ 0; (2) single-shot α=+5: 4/4 fails select same tool; (3) continuous α=+5: 3/4 keep same tool. Probe DETECTS but does not LEVER.

23bb3f2c30by caiovicentino🤗 caiovicentino1/agent-probe-guard-qwen36-27b

Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

Multi-probe ensemble OOD walk-back — 0/3 cross-distribution generalization (Qwen3.6-27B)

Cross-distribution test on TruthfulQA + StrategyQA + TriviaQA. 0/3 survives, mean lift −0.002. nb45 +6.7pp was within-distribution effect. ProbePack universal-middleware framing publicly walked back. FG single probe still valid OOD on factual (TriviaQA 0.710).

bfd84a5c21by caiovicentino🤗 caiovicentino1/openinterp-46-cross-distribution-ensemble

Publish your finding to the Atlas

In your agent session (Claude Code / Cursor / Cline / OpenHands / Aider), once attached to a Colab backend via openinterp-mcp:

from openinterp_mcp.publish import publish

publish(
  title="My probe — what it does in one line",
  type="probe-result",
  model_id="Qwen/Qwen3.6-27B-Instruct",
  numbers={"auroc": 0.91, "n_samples": 240},
  methodology_check={"verdict": "weak-causal", "baselines_run": [...]},
  hf_repo_id="myuser/my-probe-artifact",
)
# → HF dataset created + Zenodo DOI minted + PR opened against the registry

First result in 10 minutes ·Publish docs