v0.2.0 live · Apache-2.0

Stop hallucinations
before they ship.

Activation-probe fabrication detection for open-weights LLMs. ~1 ms scoring latency, AUROC 0.88 cross-task, −88% confident-wrong rate on factual QA. No LLM-judge tax. Open weights. Reproducible.

Install from PyPI Try the live demo View on GitHub

$pip install --upgrade "openinterp[full]"

requires Python ≥ 3.10 · adds torch + transformers + scikit-learn for `[full]` extras

Cross-task AUROC

0.88

held-out SimpleQA

Within-bench AUROC

0.90

HaluEval-QA

Confident-wrong drop

−88%

SimpleQA mitigation

Latency / score call

~1 ms

single matrix mul

Registered on ProbeBench v0.0.1

FabricationGuard is the reference probe in the hallucination category — currently #3 globally with a ProbeScore of 0.662.

View the full DNA card

live · real Qwen3.6-27B

Type any prompt. Watch it score.

Real activation-probe inference on Qwen3.6-27B running on HF ZeroGPU. No mocks, no pre-computed answers. Every prompt is a fresh forward pass. Cold start is ~3–5 min if the Space was idle; subsequent requests run in seconds.

Open in HF

Detection that actually generalizes.

A linear probe on the residual stream at layer 31 of Qwen3.6-27B. Trained on three benchmark train splits. Held out the fourth. Generalizes strongly to fabrication-style hallucination, fails honestly on unrelated cognitive tasks.

Detection AUROC across 4 public benchmarks

probe layer 31 · train/test split 80/20

Benchmark

Single SAE feat

LR within-bench

LR cross-bench (held-out)

Pass

TruthfulQA

misconception

0.556

0.536

0.599

HaluEval

fabrication

0.500

0.903

0.619

SimpleQA

entity-fabrication

0.494

0.706

0.882

MMLU

knowledge MC

0.544

0.631

0.444

Mitigation impact — abstain mode @ threshold 0.684

TruthfulQA

65%→32.5%−50% wrong

still wrong abstained

HaluEval

57.5%→27.5%−52% wrong

still wrong abstained

SimpleQA

85%→10%−88% wrong

still wrong abstained

We tell you what it can't do.

The probe linearly encodes a fabrication-vs-grounded signal. It does not encode “is this a popular misconception?” or “do I know which of 4 MC options is right?” — those are different cognitive tasks. We tested all four honestly and report the results.

Works for

Generation-fabrication in open QA (HaluEval-style)
Entity recall failures (SimpleQA-style obscure facts)
Customer-support fact lookups (company policy, refund rules)
Medical / legal / internal-docs Q&A grounding
Sales DB lookups (customer names, account facts)
Code-assistant API hallucination detection

Out of scope

Misconception resistance (TruthfulQA-style multiple choice)
Knowledge-gap MC selection (MMLU-style 4-way pickers)
Subjective / opinion questions
Multi-step reasoning failures (math, logic chains)
Toxic content / prompt injection (use Lakera, Bedrock Guardrails)
Closed-API models (GPT, Claude, Gemini)

Honest scoping is procurement-friendly. Compliance teams and EU AI Act risk registers accept “tested and excluded” far more readily than “works for everything.”

One forward pass. One scalar. One decision.

No second judge model. No retraining. No fine-tune. The guard rides on top of your existing inference pipeline — captures the residual at layer 31, multiplies by the probe, applies a calibrated threshold.

Prompt arrives

User query enters your model the normal way.

Forward pass

Single forward through Qwen3.6-27B. Hook captures residual at L31, last token.

Probe applied

StandardScaler + L2 LogisticRegression. Single matmul, ~1 ms on CPU.

Score in [0,1]

Higher = higher fabrication risk. Threshold calibrated cross-bench (0.684).

Decision

detect / warn / abstain — ship the response or replace with uncertainty.

5-line integration

from openinterp import FabricationGuard
from transformers import AutoModelForImageTextToText, AutoTokenizer

model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-27B", ...)
tok   = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B")

guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B").attach(model, tok)
out   = guard.generate("Who is Bambale Osby?", mode="abstain")

Versus the competition.

LLM-judge tools and proprietary platforms run an entire second model to score each output. We capture an activation that the model already computed and run a 1-ms matrix multiplication. Different cost structure entirely.

Methodology extends Anthropic's persona-vectors approach (Aug 2025, tested on 7-8B models) to Qwen3.6-27B (3-4× larger) with formal cross-task AUROC + mitigation-rate evaluation. Apache-2.0 production-grade implementation, not a proprietary platform.

Tool	AUROC	Latency	License
Patronus Lynx-70B	0.87 (HaluBench)	~100 ms	Apache-2.0
Vectara HHEM-2.1	~0.85	600 ms	Apache-2.0
Galileo Luna-2	proprietary	152 ms	closed
Goodfire Ember	proprietary	unknown	enterprise-only since Feb 2026
usOpenInterp FabricationGuard	0.88 cross / 0.90 within	~1 ms	Apache-2.0

¹ Patronus Lynx benchmark on HaluBench² Vectara HHEM-2.1 measured on RTX 3090³ Goodfire Ember pivoted to enterprise-only Feb 2026

Don't trust us. Reproduce it.

Every number on this page came from a single notebook on a single Colab session. Click below, run it yourself in ~50 minutes for ~R$10 in credits. The probe artifact and the reproducer are both Apache-2.0.

Notebook 31 — the probe reproducer

The probe behind the shipped SDK. AUROC 0.88 cross-task, mitigation analysis, full reproducer. ~50 min on RTX 6000 Pro · ~R$10.

▶ Open in Colab source on github · 31_hallucinationguard_v2_linear_probe.ipynb

Probe artifact

probe.joblib + meta.json + verdict.json + headline figure. 1.4 MB total. Apache-2.0.

huggingface.co · datasets · FabricationGuard

Source SAE

Qwen3.6-27B paper-grade SAE at L11 / L31 / L55. The only public set on hybrid GDN architecture.

huggingface.co · qwen36-27b-sae-papergrade