Adversarial / negativeepiphenomenal-softmaxQwen/Qwen3.6-27B-Instruct2026-05-11 · by caiovicentino
L43 pre_tool probe is softmax-temp epiphenomenal (Qwen3.6-27B SWE-bench)
Triple-source convergent verdict on L43 pre_tool probe direction. (1) log-prob proxy with control-token norm: Δrel ≈ 0; (2) single-shot α=+5: 4/4 fails select same tool; (3) continuous α=+5: 3/4 keep same tool. Probe DETECTS but does not LEVER.
Numbers
auroc
0.830
n_samples
42
layer
43
position
pre_tool
delta_rel_max
-0.046
flip_rate_at_alpha_5
0
verdict_class
epiphenomenal-softmax
Methodology check
Output of the causality_protocol primitive when it was run on this artifact. See paper-6 for the 3-baseline methodology and the 5-class verdict spec.
- verdict
- epiphenomenal-softmax
- baselines_run
- random_direction_random_acts
- control_token_normalization
- ✓ yes
- structural_rigidity_sweep
- ✗ no
Artifacts
phase7_micro_pilot.jsonphase7_continuous.json
These files live in the linked HF dataset. Open dataset →
Cite
Content-only sha256 below. Verifiable: re-hash the JSON manifest (with manifest_sha256 set to null, sort_keys=True) and you get the same digest. Zenodo DOI pending.
manifest_sha256
23bb3f2c303b120e2689f5dbe1c5d55ea40f25e36754f546b493fe52fb30e1d3Atlas URL
https://openinterp.org/atlas/23bb3f2c30Raw manifest
https://raw.githubusercontent.com/OpenInterpretability/registry/main/atlas/2026/23bb3f2c30.jsonReproduce this in your agent
In an agent session attached to your Colab via openinterp-mcp:
from openinterp_mcp.atlas import load_entry
entry = load_entry("23bb3f2c30")
print(entry.methodology_check)
# Re-run the causality protocol against the linked HF artifact:
from openinterp_mcp.judge import reproduce
reproduce(entry, hf_repo_id="caiovicentino1/agent-probe-guard-qwen36-27b")