ProbeBench
ProbeBench Application · Medical AI

Fabrication detection for clinical LLM deployments

The 2026 medical AI category is moving fast — CZI Biohub committed $500M to AI biology on Apr 29, 2026. EU AI Act Article 14 takes effect August 2026. The bottleneck is not modeling — it is knowing when the model is making things up. FabricationGuard is one open standard for that signal: ~1 ms scoring latency, AUROC 0.88 cross-task, Apache-2.0.

Why this category is urgent in 2026

$500M just committed
CZI Biohub announced $500M in AI-biology funding on Apr 29, 2026. EvolutionaryScale was acquired. CZI operating budget at $1B/year. Capital is flowing into AI for medical use cases.
EU AI Act enforcement
Article 14 (human oversight) takes effect Aug 2026 for high-risk AI including medical. Activation-probe abstention fits the regulatory pattern: technical mechanism for operator intervention.
Hallucination is uniquely costly here
In open QA a fabricated answer is annoying. In medical AI it is a clinical incident. Asymmetric cost means probe-based abstention has very different threshold tuning than consumer applications.

What FabricationGuard does (in 60 seconds)

A linear logistic regression on the residual stream at layer 31 of Qwen3.6-27B, scored at the end of the user question. Trained cross-bench on TruthfulQA + HaluEval + MMLU + SimpleQA. Captures the "model is fabricating" signal that lives in the residual stream before generation begins — abstention happens at zero token cost.

HaluEval-QA
0.903
AUROC · n=200
SimpleQA
0.882
AUROC · n=100
TruthfulQA-MC1
0.599
AUROC · n=200
MMLU
0.444
AUROC · n=500

Mitigation analysis (notebook 31): in abstain mode at threshold 0.684, confidently-wrong response rate drops −88% on SimpleQA, −52% on HaluEval, −50% on TruthfulQA. MMLU is "capability control" — out of scope by design (see honest scope below).

Where FabricationGuard fits in clinical workflows

Clinical Q&A grounding

Patient-facing or clinician-facing chat: "What are the symptoms of long QT syndrome?" — model can fabricate a plausible-sounding but wrong list. FabricationGuard scores residual at end-of-question, abstains when fabrication probability > threshold.

HaluEval-QA AUROC 0.903
SimpleQA cross-task AUROC 0.882
Caveat: Calibrated for general open-domain factual QA. Medical-specific recalibration recommended (smaller threshold for asymmetric cost — medical false-confidence is more dangerous than over-abstention).

Drug-interaction lookup

Model summarizes drug interactions for clinical reference. Rare interactions = long-tail factoids = fabrication-prone. Probe surfaces when the model is generating plausibly without grounding.

SimpleQA: rare-entity factoid recall — model fabrication probability is high here
Caveat: Doesn't replace pharmacovigilance database lookup. Use as second-line confidence layer, not primary source.

Evidence-grounded clinical summarization

Model summarizes a patient chart, study abstract, or guideline document. Fabrication happens when model adds claims absent from source. Probe captures this even before the model finishes generating.

HaluEval-QA: open-ended generation hallucination — directly applicable
Caveat: Best paired with retrieval-grounded NLI (e.g., HaluGate-style post-hoc verification) for highest reliability. The two methods are complementary — see /probebench/comparisons (in development).

Biological / pharmaceutical reasoning

AI suggesting protein-target interactions, drug repurposing candidates, or disease pathways. Model fabricates relationships not in training data — clinically critical to flag.

Out-of-distribution generation: probe surfaces low-confidence/fabricated assertions
Caveat: EvolutionaryScale-style biological reasoning was NOT in our training distribution. Recalibration on a domain-specific test set is necessary before clinical use.

Honest scope — what is not validated yet

We do not currently have medical-domain test sets in the registry. The probe is trained on general open-domain factual QA. Performance on medical-specific benchmarks (PubMedQA, MedQA, MIMIC-IV summarization) is not yet measured. Anyone deploying this in a clinical setting must:

  • Recalibrate threshold on a held-out medical test set with annotated halu/grounded labels (we recommend 200+ samples).
  • Estimate domain-shift AUROC— out-of-distribution AUROC for medical context likely lower than the 0.88 cross-task headline.
  • Pair with retrieval grounding when ground truth exists (RAG, drug databases, guideline DB). FabricationGuard handles closed-book; HaluGate-style NLI handles grounded. The two are complementary.
  • Human-in-the-loop above any abstention threshold. Probe-based abstention is a filter, not a replacement for clinical review — especially in the EU AI Act Article 14 sense.

If your team has annotated medical hallucination data and wants to run the probe to produce domain-specific numbers, we will publish those evaluations on ProbeBench (with citation to your team) at no cost.

Regulatory framework (why probe-based abstention fits)

Pilot framework — for medical-domain partners

We are looking for 2-3 partners (clinical research orgs, medical AI startups, hospital systems with ML teams) to run a 30-day pilot. Outcome: domain-validated AUROC numbers on your test set, published on ProbeBench with citation.

What we provide

  • ✓ FabricationGuard probe (Apache-2.0) integrated into your inference pipeline
  • ✓ Recalibration on your held-out test set (200-500 examples)
  • ✓ Threshold tuning by your asymmetric cost model (false-confidence vs over-abstention)
  • ✓ ECE / FPR@99TPR / AUROC with bootstrap CIs
  • ✓ Cross-model Pearson_CE if you use a non-Qwen model
  • ✓ Public publication of results on ProbeBench (or private if NDA)

What we need from you

  • ✓ 200-500 medical Q-A pairs with annotated hallucination labels
  • ✓ Description of clinical context (specialty, query distribution, deployment surface)
  • ✓ Asymmetric-cost specification (e.g., "over-abstention 5×, false-confidence 50×")
  • ✓ Engineering counterpart for ~2 weeks integration support
  • ✓ Permission to publish anonymized aggregate results (or NDA + private deliverable)

Pilot outcomes — what gets published

At end of pilot we co-author a post on openinterp.org/blog with the domain-specific numbers. Your team is named. Your task gets registered as a new entry in /probebench/tasks with SHA-256-hashed test set. Probe variant tuned to your domain becomes a registered probe alongside FabricationGuard v2. Your customers see the validation.

Medical AI roadmap on ProbeBench

  1. v0.1 (today)
    FabricationGuard general-domain — applicable to medical with caveat
    Linear probe trained on TruthfulQA/HaluEval/MMLU/SimpleQA. Cross-domain AUROC unmeasured for medical. Use with explicit recalibration.
  2. v0.2 (Q3 2026)
    Medical-domain test sets registered
    PubMedQA, MedQA-USMLE, MIMIC-IV-derived summarization halu test set added to /probebench/tasks. SHA-256 pinned.
  3. v0.3 (Q4 2026)
    MedicalFabricationGuard — domain-specialized probe
    Probe re-trained on medical Q-A corpus with halu labels from clinician annotations. Released as separate registry entry alongside general-domain v2.
  4. v1.0 (2027)
    Drug-interaction probe + clinical-grounding NLI ensemble
    Specialized probe for pharmaceutical fabrications + post-hoc grounding against pharmacovigilance DB. Combined ProbeScore reported.

Pilot inquiry

Email with a 1-paragraph description of your medical AI deployment + test-set availability.

caio@openinterp.org

Self-serve integration

Already deployed Qwen3.6-27B and want to drop in FabricationGuard?

pip install openinterp
from openinterp import FabricationGuard
guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B")
output = guard.generate(query, mode="abstain", threshold=0.5)

ProbeBench v0.0.1 · Apache-2.0 · OpenInterp · 2026