livelinearHallucinationApache-2.0

FabricationGuard v2

AUROC 0.88 cross-task on SimpleQA · −88% confident-wrong reduction

Multi-feature L2 logistic regression on residual stream at L31 of Qwen3.6-27B. Trained cross-bench on TruthfulQA + HaluEval + MMLU train splits, evaluated cross-task on held-out SimpleQA. Ships in openinterp PyPI v0.2.0.

byCaio VicentinoOpenInterp2026-04-27

ProbeScore

0.662

Hallucination #1Global #3

7 weighted axes. Subject to revision as more reproducers land.

Quickstart

Three lines via the openinterp.probebench SDK. Returns the probe's P(positive_class) for a tensor of activations captured at the layer/position below.

python

from openinterp import probebench
probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
score = probe.score(activations)  # → P(positive_class)

ProbeScore — 7 weighted axes

Composite metric in [0, 1]. No single axis can dominate; we think this discourages single-metric optimization. Numbers update as new evaluations land.

scoring formula

ProbeScore breakdown0.662

AUROC0.177

Eval-aware0.119

Dist-shift0.072

Calibration0.072

Transfer0.040

Goodhart-resistance0.032

Latency0.100

License0.050

ProbeScore — 8-axis breakdown

ProbeBench v0.0.2 · 8 axes weighted to sum 1.0

Visual breakdown of where this probe scores high vs low across all 8 axes. Polygon area is proportional to total score; shape reveals which axes carry it (and which ones it sacrifices).

AUROC

0.707

× 0.25 = 0.177

Eval-aware

0.662

× 0.18 = 0.119

Dist-shift

0.598

× 0.12 = 0.072

Calibration

0.725

× 0.10 = 0.072

Cross-model

0.400

× 0.10 = 0.040

Goodhart-resistance

0.320

× 0.10 = 0.032

Latency

1.000

× 0.10 = 0.100

License

1.000

× 0.05 = 0.050

Per-task evaluation

Headline AUROC alongside the eval-awareness-corrected and distribution-shift versions for each task. Reproducers run from a single notebook on a single Colab session.

Task	n	AUROC [95% CI]	Eval-aware corrected	Dist-shift	ECE	FPR@99TPR	Latency	Reproducer
HaluEval-QA Hallucination · hallucinated	200	0.903 [0.85, 0.95]	0.840	0.710	0.080	0.040	1.0 ms	open
SimpleQA Hallucination · hallucinated	100	0.882 [0.83, 0.93]	0.820	0.720	0.070	0.050	1.0 ms	—
TruthfulQA-MC1 Hallucination · misconception	200	0.599 [0.51, 0.69]	0.570	0.550	0.180	0.920	1.0 ms	—
MMLU Hallucination · incorrect	500	0.444 [0.40, 0.49]	0.420	0.410	0.220	0.990	1.0 ms	—

Cross-model transfer (Pearson_CE)

Pearson correlation of paired ablation effects across models. ≥ 0.7 suggests the probe's direction is shared; 0.4–0.7 suggests partial transfer; < 0.4 suggests retraining per architecture.

Source	Target	Pearson_CE	Transfer AUROC	Notes
Qwen3.6-27B	Llama-3.3-70B	0.420	0.710	Cross-architecture transfer; partial signal. Improves with retraining via Pearson_CE methodology.
Qwen3.6-27B	Gemma-3-27B	0.380	0.680
Qwen3.6-27B	Qwen3.6-27B	1.000	0.882	Same model — identity baseline.

Mean Pearson_CE for this probe (excluding self-baseline): 0.400

Architecture & artifact

Base model

Qwen3.6-27B

Qwen/Qwen3.6-27B

Family: Qwen
Params: 27B
Architecture: Hybrid GDN + Gated-Attn (dense, reasoning)
Layers: 64
d_model: 5120
Weights license: Apache-2.0

Probe attaches at

Layer 31 · end_question

huggingface.co/Qwen/Qwen3.6-27B

Artifact

FabricationGuard weights

sklearn-compatible probe + scaler + meta.json

Params: 312,000
Size: 1.2 MB
License: Apache-2.0
Released: 2026-04-27
sha256: fb8c2a4e1f9d…9b8a7f6e

huggingface.co/datasets/caiovicentino1/FabricationGuard-linearprobe-qwen36-27b source notebook on GitHub open reproducer in Colab

Hash matches the artifact at huggingface.co/datasets/caiovicentino1/FabricationGuard-linearprobe-qwen36-27b. Recompute via openinterp probebench verify openinterp/fabricationguard-qwen36-27b-l31-v2.

Reproduce

Three entry points. Pick one. Each path lands at the same numbers.

Open in Colab

One-click. Free T4 works for inference; reproducer needs an A100 or RTX 6000.

colab.research.google.com

View on GitHub

Read the notebook source, see commit history, file an issue.

github.com · OpenInterpretability

Download from HF

Probe weights, scaler, and meta.json. Verified by the sha256 above.

huggingface.co · datasets

shell

git clone https://github.com/OpenInterpretability/notebooks.git
pip install openinterp
openinterp probebench reproduce openinterp/fabricationguard-qwen36-27b-l31-v2

Honest scope

Limits derived from the evaluation data above. We think these are the honest constraints; revisions land as more reproducers do.

Trained / evaluated on: HaluEval-QA, SimpleQA, TruthfulQA-MC1, MMLU. Performance outside these tasks is unmeasured here.
Eval-aware AUROC drop: −0.045 AUROC averaged across tasks (uncorrected − corrected for eval-awareness confound, arXiv:2509.13333).
Distribution-shift drop: −0.110 AUROC averaged across tasks (in-distribution − long-context / OOD).
Cross-model fit (mean Pearson_CE): 0.400 — values below 0.4 suggest retraining per architecture; values above 0.7 suggest the direction is shared.
Probe attaches at: L31 · end_question of Qwen3.6-27B. Other layers / positions are out of scope unless re-trained.

Back to Hallucination leaderboard Submit your own probe

ProbeBench v0.0.1 · 5 probes registered · 11 evaluations · 7 transfer measurements. Schema and weights subject to revision.