Adversarial / negativeepiphenomenal-templateQwen/Qwen3.6-27B-Instruct2026-05-11 · by caiovicentino

L55 CoT-Integrity probe is template-locked epiphenomenal (Qwen3.6-27B)

N=240. AUROC 0.91. Bidirectional steering up to α=+200 (>‖residual‖) produces ZERO behavioral change for probe AND random direction. Mechanism: enable_thinking=False chat template injects <think></think> in input tokens — thinking decision is not in residual stream.

Paper Manifest Raw JSON

Numbers

auroc

0.910

n_samples

240

layer

position

mid_think

alpha_tested_max

200

behavioral_change_rate

verdict_class

epiphenomenal-template

Methodology check

Output of the causality_protocol primitive when it was run on this artifact. See paper-6 for the 3-baseline methodology and the 5-class verdict spec.

verdict: epiphenomenal-template
baselines_run: random_direction_random_actsrandom_direction_real_acts
control_token_normalization: ✓ yes
structural_rigidity_sweep: ✓ yes
amplitude_tested_x_norm: 2

Artifacts

phase8_results.jsonphase8_redux_top5.json

Cite

Content-only sha256 below. Verifiable: re-hash the JSON manifest (with manifest_sha256 set to null, sort_keys=True) and you get the same digest. Zenodo DOI pending.

manifest_sha256

a0c01e67c97d6d1beed9539b9259d774c6d67cd8bd7f9dbfafb552572fb48663

Atlas URL

https://openinterp.org/atlas/a0c01e67c9

Raw manifest

https://raw.githubusercontent.com/OpenInterpretability/registry/main/atlas/2026/a0c01e67c9.json

Reproduce this in your agent

In an agent session attached to your Colab via openinterp-mcp:

from openinterp_mcp.atlas import load_entry

entry = load_entry("a0c01e67c9")
print(entry.methodology_check)

# Re-run the causality protocol against the linked HF artifact:
# (no HF artifact attached — replicate from methodology alone)

First result in 10 minutes