FabricationGuard v2
AUROC 0.88 cross-task on SimpleQA · −88% confident-wrong reduction
Multi-feature L2 logistic regression on residual stream at L31 of Qwen3.6-27B. Trained cross-bench on TruthfulQA + HaluEval + MMLU train splits, evaluated cross-task on held-out SimpleQA. Ships in openinterp PyPI v0.2.0.
Quickstart
Three lines via the openinterp.probebench SDK. Returns the probe's P(positive_class) for a tensor of activations captured at the layer/position below.
from openinterp import probebench
probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
score = probe.score(activations) # → P(positive_class)ProbeScore — 7 weighted axes
Composite metric in [0, 1]. No single axis can dominate; we think this discourages single-metric optimization. Numbers update as new evaluations land.
ProbeScore — 8-axis breakdown
ProbeBench v0.0.2 · 8 axes weighted to sum 1.0Visual breakdown of where this probe scores high vs low across all 8 axes. Polygon area is proportional to total score; shape reveals which axes carry it (and which ones it sacrifices).
Per-task evaluation
Headline AUROC alongside the eval-awareness-corrected and distribution-shift versions for each task. Reproducers run from a single notebook on a single Colab session.
| Task | n | AUROC [95% CI] | Eval-aware corrected | Dist-shift | ECE | FPR@99TPR | Latency | Reproducer |
|---|---|---|---|---|---|---|---|---|
HaluEval-QA Hallucination · hallucinated | 200 | 0.903 [0.85, 0.95] | 0.840 | 0.710 | 0.080 | 0.040 | 1.0 ms | open |
SimpleQA Hallucination · hallucinated | 100 | 0.882 [0.83, 0.93] | 0.820 | 0.720 | 0.070 | 0.050 | 1.0 ms | — |
TruthfulQA-MC1 Hallucination · misconception | 200 | 0.599 [0.51, 0.69] | 0.570 | 0.550 | 0.180 | 0.920 | 1.0 ms | — |
MMLU Hallucination · incorrect | 500 | 0.444 [0.40, 0.49] | 0.420 | 0.410 | 0.220 | 0.990 | 1.0 ms | — |
Cross-model transfer (Pearson_CE)
Pearson correlation of paired ablation effects across models. ≥ 0.7 suggests the probe's direction is shared; 0.4–0.7 suggests partial transfer; < 0.4 suggests retraining per architecture.
| Source | Target | Pearson_CE | Transfer AUROC | Notes |
|---|---|---|---|---|
| Qwen3.6-27B | Llama-3.3-70B | 0.420 | 0.710 | Cross-architecture transfer; partial signal. Improves with retraining via Pearson_CE methodology. |
| Qwen3.6-27B | Gemma-3-27B | 0.380 | 0.680 | |
| Qwen3.6-27B | Qwen3.6-27B | 1.000 | 0.882 | Same model — identity baseline. |
Architecture & artifact
Base model
- Family
- Qwen
- Params
- 27B
- Architecture
- Hybrid GDN + Gated-Attn (dense, reasoning)
- Layers
- 64
- d_model
- 5120
- Weights license
- Apache-2.0
Artifact
- Params
- 312,000
- Size
- 1.2 MB
- License
- Apache-2.0
- Released
- 2026-04-27
- sha256
- fb8c2a4e1f9d…9b8a7f6e
openinterp probebench verify openinterp/fabricationguard-qwen36-27b-l31-v2.Reproduce
Three entry points. Pick one. Each path lands at the same numbers.
git clone https://github.com/OpenInterpretability/notebooks.git pip install openinterp openinterp probebench reproduce openinterp/fabricationguard-qwen36-27b-l31-v2
Honest scope
Limits derived from the evaluation data above. We think these are the honest constraints; revisions land as more reproducers do.
- Trained / evaluated on: HaluEval-QA, SimpleQA, TruthfulQA-MC1, MMLU. Performance outside these tasks is unmeasured here.
- Eval-aware AUROC drop: −0.045 AUROC averaged across tasks (uncorrected − corrected for eval-awareness confound, arXiv:2509.13333).
- Distribution-shift drop: −0.110 AUROC averaged across tasks (in-distribution − long-context / OOD).
- Cross-model fit (mean Pearson_CE): 0.400 — values below 0.4 suggest retraining per architecture; values above 0.7 suggest the direction is shared.
- Probe attaches at: L31 · end_question of Qwen3.6-27B. Other layers / positions are out of scope unless re-trained.
ProbeBench v0.0.1 · 5 probes registered · 11 evaluations · 7 transfer measurements. Schema and weights subject to revision.