Adversarial / negativeepiphenomenal-softmaxQwen/Qwen3.6-27B-Instruct2026-05-11 · by caiovicentino

L43 pre_tool probe is softmax-temp epiphenomenal (Qwen3.6-27B SWE-bench)

Triple-source convergent verdict on L43 pre_tool probe direction. (1) log-prob proxy with control-token norm: Δrel ≈ 0; (2) single-shot α=+5: 4/4 fails select same tool; (3) continuous α=+5: 3/4 keep same tool. Probe DETECTS but does not LEVER.

Numbers

auroc
0.830
n_samples
42
layer
43
position
pre_tool
delta_rel_max
-0.046
flip_rate_at_alpha_5
0
verdict_class
epiphenomenal-softmax

Methodology check

Output of the causality_protocol primitive when it was run on this artifact. See paper-6 for the 3-baseline methodology and the 5-class verdict spec.

verdict
epiphenomenal-softmax
baselines_run
random_direction_random_acts
control_token_normalization
✓ yes
structural_rigidity_sweep
✗ no

Artifacts

phase7_micro_pilot.jsonphase7_continuous.json

These files live in the linked HF dataset. Open dataset →

Cite

Content-only sha256 below. Verifiable: re-hash the JSON manifest (with manifest_sha256 set to null, sort_keys=True) and you get the same digest. Zenodo DOI pending.

manifest_sha256
23bb3f2c303b120e2689f5dbe1c5d55ea40f25e36754f546b493fe52fb30e1d3
Atlas URL
https://openinterp.org/atlas/23bb3f2c30
Raw manifest
https://raw.githubusercontent.com/OpenInterpretability/registry/main/atlas/2026/23bb3f2c30.json

Reproduce this in your agent

In an agent session attached to your Colab via openinterp-mcp:

from openinterp_mcp.atlas import load_entry

entry = load_entry("23bb3f2c30")
print(entry.methodology_check)

# Re-run the causality protocol against the linked HF artifact:
from openinterp_mcp.judge import reproduce
reproduce(entry, hf_repo_id="caiovicentino1/agent-probe-guard-qwen36-27b")