Probe resultepiphenomenal-softmaxQwen/Qwen3.6-27B-Instruct2026-05-10 · by caiovicentino

agent-probe-guard v0.1 — L43 pre_tool detection probe (epiphenomenal-softmax under steering)

Detection-tier probe for tool-call success in SWE-bench traces. AUROC 0.83 at N=42 with random-feature baseline gap +0.27. Causality protocol verdict is epiphenomenal-softmax: probe DETECTS but cannot LEVER (paper-6 Phase 7 finding).

Numbers

auroc
0.830
n_samples
42
gap_over_random_baseline_pca10
0.269
control_token_normalized_delta_rel
-0.046
behavioral_flips_alpha_5
0
sklearn_p95_inference_latency_ms
0.190
verdict_class
epiphenomenal-softmax

Methodology check

Output of the causality_protocol primitive when it was run on this artifact. See paper-6 for the 3-baseline methodology and the 5-class verdict spec.

verdict
epiphenomenal-softmax
real_auroc
0.83
random_baseline_mean
0.561
auroc_gap
0.269
delta_rel_max
-0.046
flip_rate_at_max_alpha
0
baselines_run
random_direction_random_acts
caveat
v0 protocol — predates the three-baseline upgrade in openinterp-mcp v0.0.3

Artifacts

probe.joblibmanifest.jsonREADME.md

These files live in the linked HF dataset. Open dataset →

Cite

Content-only sha256 below. Verifiable: re-hash the JSON manifest (with manifest_sha256 set to null, sort_keys=True) and you get the same digest. Zenodo DOI pending.

manifest_sha256
9f2e9c5b8e4fbb7c7eb4c9290d927e76341006c9d63277da05f4d21a0ab26c9b
Atlas URL
https://openinterp.org/atlas/9f2e9c5b8e
Raw manifest
https://raw.githubusercontent.com/OpenInterpretability/registry/main/atlas/2026/9f2e9c5b8e.json

Reproduce this in your agent

In an agent session attached to your Colab via openinterp-mcp:

from openinterp_mcp.atlas import load_entry

entry = load_entry("9f2e9c5b8e")
print(entry.methodology_check)

# Re-run the causality protocol against the linked HF artifact:
from openinterp_mcp.judge import reproduce
reproduce(entry, hf_repo_id="caiovicentino1/agent-probe-guard-qwen36-27b")