Adversarial / negativeQwen/Qwen3.6-27B-Instruct2026-05-11 · by caiovicentino

Multi-probe ensemble OOD walk-back — 0/3 cross-distribution generalization (Qwen3.6-27B)

Cross-distribution test on TruthfulQA + StrategyQA + TriviaQA. 0/3 survives, mean lift −0.002. nb45 +6.7pp was within-distribution effect. ProbePack universal-middleware framing publicly walked back. FG single probe still valid OOD on factual (TriviaQA 0.710).

Numbers

datasets_tested
3
datasets_generalized
0
mean_lift
-0.002
fg_single_probe_triviaqa
0.710
walked_back_claim
ProbePack universal-middleware

Artifacts

nb46_cross_distribution_ensemble

These files live in the linked HF dataset. Open dataset →

Cite

Content-only sha256 below. Verifiable: re-hash the JSON manifest (with manifest_sha256 set to null, sort_keys=True) and you get the same digest. Zenodo DOI pending.

manifest_sha256
bfd84a5c21c8a80b7078ba6a7c7cc437fb0cf9c123a7fea269439be22369094e
Atlas URL
https://openinterp.org/atlas/bfd84a5c21
Raw manifest
https://raw.githubusercontent.com/OpenInterpretability/registry/main/atlas/2026/bfd84a5c21.json

Reproduce this in your agent

In an agent session attached to your Colab via openinterp-mcp:

from openinterp_mcp.atlas import load_entry

entry = load_entry("bfd84a5c21")
print(entry.methodology_check)

# Re-run the causality protocol against the linked HF artifact:
from openinterp_mcp.judge import reproduce
reproduce(entry, hf_repo_id="caiovicentino1/openinterp-46-cross-distribution-ensemble")