Public registrySchema v1 · Apache-2.0

The Atlas — every public mech-interp finding, hashed and citable.

A read-only index of probes, causality verdicts, and honest-negative findings — published via openinterp-mcp. Every entry ships with a content-only sha256, an HF dataset (when applicable), and an optional Zenodo DOI.

11
Entries
updated 2026-05-11
4
Probes
detection-tier artifacts
4
Adversarial / negative
honest-negatives are first-class
3
Findings
principles + methodology results

Probe result

Detection-tier probe artifacts — joblib + scaler + metadata, often with cross-task AUROC.

4 entries
Probe resultQwen3.6-27B-Instruct2026-05-11

FabricationGuard v2 — L31 cross-task hallucination probe (Qwen3.6-27B)

Linear probe on layer 31 residual stream detects confident hallucinations across tasks. HaluEval within 0.90, SimpleQA cross-task 0.88. −88% confident-wrong reduction in SimpleQA. ~1ms sklearn p95 inference.

8d5df2d5d5by caiovicentino🤗 openinterp/fabricationguard-qwen36-27b-l31-v2
Probe resultQwen3.6-27B-Instruct2026-05-11

ReasonGuard v0.2 — L55 mid_think CoT faithfulness probe (Qwen3.6-27B)

Position-of-faithfulness probe at L55 mid_think token. AUROC 0.888 within GSM8K, 0.605 cross StrategyQA. Honest narrow-scope finding — domain-bound, not universal.

49eba51edbby caiovicentino🤗 openinterp/reasonguard-qwen36-27b-l55-mid_think
Probe resultQwen3.6-27B-Instruct2026-05-11

CoTGuard v1 — CoT faithfulness probe via Lanham-2023 truncation (Qwen3.6-27B)

Linear probe trained on Lanham-2023 truncation-induced unfaithful CoT signal. Detection-tier probe — pending Phase 8 causality verdict (template-locked under steering).

7a4c7cf42eby caiovicentino
Probe resultQwen3.6-27B-Instruct2026-05-10

agent-probe-guard v0.1 — L43 pre_tool detection probe (epiphenomenal-softmax under steering)

Detection-tier probe for tool-call success in SWE-bench traces. AUROC 0.83 at N=42 with random-feature baseline gap +0.27. Causality protocol verdict is epiphenomenal-softmax: probe DETECTS but cannot LEVER (paper-6 Phase 7 finding).

9f2e9c5b8eby caiovicentino🤗 caiovicentino1/agent-probe-guard-qwen36-27b

Finding

Principles, methodology results, and findings without a single probe artifact attached.

3 entries

Adversarial / negative

Causality verdicts that walked back claims (epiphenomenal, template-locked, OOD failures). Honest-negatives are first-class citizens.

4 entries
Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

Capability locus on Qwen3.6-27B SWE-bench Pro — 4/4 pre_tool/turn_end sites pushdown-asymmetric

α-sweep [-200,+200] on L23/L31/L43/L55 capability probes. All 4 sites show pushdown-asymmetric levers (+34 to +60pp gap vs random control). First causal verdict on capability axis. Refines paper-3 §4.1 L43 finding (was N=54 inflated).

60b5c38463by caiovicentino
Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

L55 CoT-Integrity probe is template-locked epiphenomenal (Qwen3.6-27B)

N=240. AUROC 0.91. Bidirectional steering up to α=+200 (>‖residual‖) produces ZERO behavioral change for probe AND random direction. Mechanism: enable_thinking=False chat template injects <think></think> in input tokens — thinking decision is not in residual stream.

a0c01e67c9by caiovicentino
Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

L43 pre_tool probe is softmax-temp epiphenomenal (Qwen3.6-27B SWE-bench)

Triple-source convergent verdict on L43 pre_tool probe direction. (1) log-prob proxy with control-token norm: Δrel ≈ 0; (2) single-shot α=+5: 4/4 fails select same tool; (3) continuous α=+5: 3/4 keep same tool. Probe DETECTS but does not LEVER.

23bb3f2c30by caiovicentino🤗 caiovicentino1/agent-probe-guard-qwen36-27b
Adversarial / negativeQwen3.6-27B-Instruct2026-05-11

Multi-probe ensemble OOD walk-back — 0/3 cross-distribution generalization (Qwen3.6-27B)

Cross-distribution test on TruthfulQA + StrategyQA + TriviaQA. 0/3 survives, mean lift −0.002. nb45 +6.7pp was within-distribution effect. ProbePack universal-middleware framing publicly walked back. FG single probe still valid OOD on factual (TriviaQA 0.710).

bfd84a5c21by caiovicentino🤗 caiovicentino1/openinterp-46-cross-distribution-ensemble

Publish your finding to the Atlas

In your agent session (Claude Code / Cursor / Cline / OpenHands / Aider), once attached to a Colab backend via openinterp-mcp:

from openinterp_mcp.publish import publish

publish(
  title="My probe — what it does in one line",
  type="probe-result",
  model_id="Qwen/Qwen3.6-27B-Instruct",
  numbers={"auroc": 0.91, "n_samples": 240},
  methodology_check={"verdict": "weak-causal", "baselines_run": [...]},
  hf_repo_id="myuser/my-probe-artifact",
)
# → HF dataset created + Zenodo DOI minted + PR opened against the registry