Adversarial / negativeepiphenomenal-softmaxQwen/Qwen3.6-27B-Instruct2026-05-11 · by caiovicentino

L43 pre_tool probe is softmax-temp epiphenomenal (Qwen3.6-27B SWE-bench)

Triple-source convergent verdict on L43 pre_tool probe direction. (1) log-prob proxy with control-token norm: Δrel ≈ 0; (2) single-shot α=+5: 4/4 fails select same tool; (3) continuous α=+5: 3/4 keep same tool. Probe DETECTS but does not LEVER.

🤗 HF dataset Paper Manifest Raw JSON

Numbers

auroc

0.830

n_samples

layer

position

pre_tool

delta_rel_max

-0.046

flip_rate_at_alpha_5

verdict_class

epiphenomenal-softmax

Methodology check

Output of the causality_protocol primitive when it was run on this artifact. See paper-6 for the 3-baseline methodology and the 5-class verdict spec.

verdict: epiphenomenal-softmax
baselines_run: random_direction_random_acts
control_token_normalization: ✓ yes
structural_rigidity_sweep: ✗ no

Artifacts

phase7_micro_pilot.jsonphase7_continuous.json

These files live in the linked HF dataset. Open dataset →

Cite

Content-only sha256 below. Verifiable: re-hash the JSON manifest (with manifest_sha256 set to null, sort_keys=True) and you get the same digest. Zenodo DOI pending.

manifest_sha256

23bb3f2c303b120e2689f5dbe1c5d55ea40f25e36754f546b493fe52fb30e1d3

Atlas URL

https://openinterp.org/atlas/23bb3f2c30

Raw manifest

https://raw.githubusercontent.com/OpenInterpretability/registry/main/atlas/2026/23bb3f2c30.json

Reproduce this in your agent

In an agent session attached to your Colab via openinterp-mcp:

from openinterp_mcp.atlas import load_entry

entry = load_entry("23bb3f2c30")
print(entry.methodology_check)

# Re-run the causality protocol against the linked HF artifact:
from openinterp_mcp.judge import reproduce
reproduce(entry, hf_repo_id="caiovicentino1/agent-probe-guard-qwen36-27b")

First result in 10 minutes