Stop hallucinations
before they ship.
Activation-probe fabrication detection for open-weights LLMs. ~1 ms scoring latency, AUROC 0.88 cross-task, −88% confident-wrong rate on factual QA. No LLM-judge tax. Open weights. Reproducible.
pip install --upgrade "openinterp[full]"requires Python ≥ 3.10 · adds torch + transformers + scikit-learn for `[full]` extras
Type any prompt. Watch it score.
Real activation-probe inference on Qwen3.6-27B running on HF ZeroGPU. No mocks, no pre-computed answers. Every prompt is a fresh forward pass. Cold start is ~3–5 min if the Space was idle; subsequent requests run in seconds.
Powered by HF ZeroGPU · H200 partitioned · free for community use.
Detection that actually generalizes.
A linear probe on the residual stream at layer 31 of Qwen3.6-27B. Trained on three benchmark train splits. Held out the fourth. Generalizes strongly to fabrication-style hallucination, fails honestly on unrelated cognitive tasks.
Detection AUROC across 4 public benchmarks
Mitigation impact — abstain mode @ threshold 0.684
We tell you what it can't do.
The probe linearly encodes a fabrication-vs-grounded signal. It does not encode “is this a popular misconception?” or “do I know which of 4 MC options is right?” — those are different cognitive tasks. We tested all four honestly and report the results.
Works for
- Generation-fabrication in open QA (HaluEval-style)
- Entity recall failures (SimpleQA-style obscure facts)
- Customer-support fact lookups (company policy, refund rules)
- Medical / legal / internal-docs Q&A grounding
- Sales DB lookups (customer names, account facts)
- Code-assistant API hallucination detection
Out of scope
- Misconception resistance (TruthfulQA-style multiple choice)
- Knowledge-gap MC selection (MMLU-style 4-way pickers)
- Subjective / opinion questions
- Multi-step reasoning failures (math, logic chains)
- Toxic content / prompt injection (use Lakera, Bedrock Guardrails)
- Closed-API models (GPT, Claude, Gemini)
Honest scoping is procurement-friendly. Compliance teams and EU AI Act risk registers accept “tested and excluded” far more readily than “works for everything.”
One forward pass. One scalar. One decision.
No second judge model. No retraining. No fine-tune. The guard rides on top of your existing inference pipeline — captures the residual at layer 31, multiplies by the probe, applies a calibrated threshold.
from openinterp import FabricationGuard
from transformers import AutoModelForImageTextToText, AutoTokenizer
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-27B", ...)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-27B")
guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B").attach(model, tok)
out = guard.generate("Who is Bambale Osby?", mode="abstain")Versus the competition.
LLM-judge tools and proprietary platforms run an entire second model to score each output. We capture an activation that the model already computed and run a 1-ms matrix multiplication. Different cost structure entirely.
Methodology extends Anthropic's persona-vectors approach (Aug 2025, tested on 7-8B models) to Qwen3.6-27B (3-4× larger) with formal cross-task AUROC + mitigation-rate evaluation. Apache-2.0 production-grade implementation, not a proprietary platform.
| Tool | AUROC | Latency | Open weights | Multi-model | License |
|---|---|---|---|---|---|
Patronus Lynx-70B | 0.87 (HaluBench) | ~100 ms | Apache-2.0 | ||
Vectara HHEM-2.1 | ~0.85 | 600 ms | Apache-2.0 | ||
Galileo Luna-2 | proprietary | 152 ms | closed | ||
Goodfire Ember | proprietary | unknown | enterprise-only since Feb 2026 | ||
usOpenInterp FabricationGuard | 0.88 cross / 0.90 within | ~1 ms | Apache-2.0 |
Don't trust us. Reproduce it.
Every number on this page came from a single notebook on a single Colab session. Click below, run it yourself in ~50 minutes for ~R$10 in credits. The probe artifact and the reproducer are both Apache-2.0.
Where we're going.
- Qwen3.6-27B probe
- PyPI ship
- CLI
- 3 modes
- Llama-3.3 probe
- Gemma-2 probe
- Pearson_CE cross-model transfer
- Multi-model API
- vLLM plugin
- SGLang plugin
- LangChain middleware
- OpenTelemetry GenAI
- Hosted Pro tier ($0.02/1M tok)
- Slack/PagerDuty/Datadog
- Audit reports
- Custom probe training
Stop hallucinations before they ship.
Open source. Apache-2.0 with patent grant. No signup. No API key. Just pip install.
pip install --upgrade "openinterp[full]"