Adversarial / negativeQwen/Qwen3.6-27B-Instruct2026-05-11 · by caiovicentino

Multi-probe ensemble OOD walk-back — 0/3 cross-distribution generalization (Qwen3.6-27B)

Cross-distribution test on TruthfulQA + StrategyQA + TriviaQA. 0/3 survives, mean lift −0.002. nb45 +6.7pp was within-distribution effect. ProbePack universal-middleware framing publicly walked back. FG single probe still valid OOD on factual (TriviaQA 0.710).

🤗 HF dataset Manifest Raw JSON

Numbers

datasets_tested

datasets_generalized

mean_lift

-0.002

fg_single_probe_triviaqa

0.710

walked_back_claim

ProbePack universal-middleware

Artifacts

nb46_cross_distribution_ensemble

These files live in the linked HF dataset. Open dataset →

Cite

Content-only sha256 below. Verifiable: re-hash the JSON manifest (with manifest_sha256 set to null, sort_keys=True) and you get the same digest. Zenodo DOI pending.

manifest_sha256

bfd84a5c21c8a80b7078ba6a7c7cc437fb0cf9c123a7fea269439be22369094e

Atlas URL

https://openinterp.org/atlas/bfd84a5c21

Raw manifest

https://raw.githubusercontent.com/OpenInterpretability/registry/main/atlas/2026/bfd84a5c21.json

Reproduce this in your agent

In an agent session attached to your Colab via openinterp-mcp:

from openinterp_mcp.atlas import load_entry

entry = load_entry("bfd84a5c21")
print(entry.methodology_check)

# Re-run the causality protocol against the linked HF artifact:
from openinterp_mcp.judge import reproduce
reproduce(entry, hf_repo_id="caiovicentino1/openinterp-46-cross-distribution-ensemble")

First result in 10 minutes