ProbeBench Application · Medical AI

Fabrication detection for clinical LLM deployments

The 2026 medical AI category is moving fast — CZI Biohub committed $500M to AI biology on Apr 29, 2026. EU AI Act Article 14 takes effect August 2026. The bottleneck is not modeling — it is knowing when the model is making things up. FabricationGuard is one open standard for that signal: ~1 ms scoring latency, AUROC 0.88 cross-task, Apache-2.0.

View FabricationGuard probe DNA Pilot framework Email caio@openinterp.org

Why this category is urgent in 2026

$500M just committed

CZI Biohub announced $500M in AI-biology funding on Apr 29, 2026. EvolutionaryScale was acquired. CZI operating budget at $1B/year. Capital is flowing into AI for medical use cases.

EU AI Act enforcement

Article 14 (human oversight) takes effect Aug 2026 for high-risk AI including medical. Activation-probe abstention fits the regulatory pattern: technical mechanism for operator intervention.

Hallucination is uniquely costly here

In open QA a fabricated answer is annoying. In medical AI it is a clinical incident. Asymmetric cost means probe-based abstention has very different threshold tuning than consumer applications.

What FabricationGuard does (in 60 seconds)

A linear logistic regression on the residual stream at layer 31 of Qwen3.6-27B, scored at the end of the user question. Trained cross-bench on TruthfulQA + HaluEval + MMLU + SimpleQA. Captures the "model is fabricating" signal that lives in the residual stream before generation begins — abstention happens at zero token cost.

HaluEval-QA

0.903

AUROC · n=200

SimpleQA

0.882

AUROC · n=100

TruthfulQA-MC1

0.599

AUROC · n=200

MMLU

0.444

AUROC · n=500

Mitigation analysis (notebook 31): in abstain mode at threshold 0.684, confidently-wrong response rate drops −88% on SimpleQA, −52% on HaluEval, −50% on TruthfulQA. MMLU is "capability control" — out of scope by design (see honest scope below).

Where FabricationGuard fits in clinical workflows

Clinical Q&A grounding

Patient-facing or clinician-facing chat: "What are the symptoms of long QT syndrome?" — model can fabricate a plausible-sounding but wrong list. FabricationGuard scores residual at end-of-question, abstains when fabrication probability > threshold.

✓ HaluEval-QA AUROC 0.903

✓ SimpleQA cross-task AUROC 0.882

Caveat: Calibrated for general open-domain factual QA. Medical-specific recalibration recommended (smaller threshold for asymmetric cost — medical false-confidence is more dangerous than over-abstention).

Drug-interaction lookup

Model summarizes drug interactions for clinical reference. Rare interactions = long-tail factoids = fabrication-prone. Probe surfaces when the model is generating plausibly without grounding.

✓ SimpleQA: rare-entity factoid recall — model fabrication probability is high here

Caveat: Doesn't replace pharmacovigilance database lookup. Use as second-line confidence layer, not primary source.

Evidence-grounded clinical summarization

Model summarizes a patient chart, study abstract, or guideline document. Fabrication happens when model adds claims absent from source. Probe captures this even before the model finishes generating.

✓ HaluEval-QA: open-ended generation hallucination — directly applicable

Caveat: Best paired with retrieval-grounded NLI (e.g., HaluGate-style post-hoc verification) for highest reliability. The two methods are complementary — see /probebench/comparisons (in development).

Biological / pharmaceutical reasoning

AI suggesting protein-target interactions, drug repurposing candidates, or disease pathways. Model fabricates relationships not in training data — clinically critical to flag.

✓ Out-of-distribution generation: probe surfaces low-confidence/fabricated assertions

Caveat: EvolutionaryScale-style biological reasoning was NOT in our training distribution. Recalibration on a domain-specific test set is necessary before clinical use.

Honest scope — what is not validated yet

We do not currently have medical-domain test sets in the registry. The probe is trained on general open-domain factual QA. Performance on medical-specific benchmarks (PubMedQA, MedQA, MIMIC-IV summarization) is not yet measured. Anyone deploying this in a clinical setting must:

Recalibrate threshold on a held-out medical test set with annotated halu/grounded labels (we recommend 200+ samples).
Estimate domain-shift AUROC— out-of-distribution AUROC for medical context likely lower than the 0.88 cross-task headline.
Pair with retrieval grounding when ground truth exists (RAG, drug databases, guideline DB). FabricationGuard handles closed-book; HaluGate-style NLI handles grounded. The two are complementary.
Human-in-the-loop above any abstention threshold. Probe-based abstention is a filter, not a replacement for clinical review — especially in the EU AI Act Article 14 sense.

If your team has annotated medical hallucination data and wants to run the probe to produce domain-specific numbers, we will publish those evaluations on ProbeBench (with citation to your team) at no cost.

Regulatory framework (why probe-based abstention fits)

EU AI Act Article 14

High-risk AI systems (including medical AI) require human oversight. Activation-probe abstention fits as the technical mechanism for "intervention by the operator" — model declines when uncertain, human reviews.

FDA SaMD Pre-Cert (US)

Software as Medical Device guidance increasingly references uncertainty quantification. Probe-based abstention provides per-query uncertainty signal more interpretable than ensemble or temperature-scaling baselines.

WHO ethics guidance for AI in health

Recommends transparency about model uncertainty. Activation probes provide an audit trail — every output has a confidence number that maps to a specific layer/position decision.

Pilot framework — for medical-domain partners

We are looking for 2-3 partners (clinical research orgs, medical AI startups, hospital systems with ML teams) to run a 30-day pilot. Outcome: domain-validated AUROC numbers on your test set, published on ProbeBench with citation.

What we provide

✓ FabricationGuard probe (Apache-2.0) integrated into your inference pipeline
✓ Recalibration on your held-out test set (200-500 examples)
✓ Threshold tuning by your asymmetric cost model (false-confidence vs over-abstention)
✓ ECE / FPR@99TPR / AUROC with bootstrap CIs
✓ Cross-model Pearson_CE if you use a non-Qwen model
✓ Public publication of results on ProbeBench (or private if NDA)

What we need from you

✓ 200-500 medical Q-A pairs with annotated hallucination labels
✓ Description of clinical context (specialty, query distribution, deployment surface)
✓ Asymmetric-cost specification (e.g., "over-abstention 5×, false-confidence 50×")
✓ Engineering counterpart for ~2 weeks integration support
✓ Permission to publish anonymized aggregate results (or NDA + private deliverable)

Pilot outcomes — what gets published

At end of pilot we co-author a post on openinterp.org/blog with the domain-specific numbers. Your team is named. Your task gets registered as a new entry in /probebench/tasks with SHA-256-hashed test set. Probe variant tuned to your domain becomes a registered probe alongside FabricationGuard v2. Your customers see the validation.

Medical AI roadmap on ProbeBench

v0.1 (today)
FabricationGuard general-domain — applicable to medical with caveat
Linear probe trained on TruthfulQA/HaluEval/MMLU/SimpleQA. Cross-domain AUROC unmeasured for medical. Use with explicit recalibration.
v0.2 (Q3 2026)
Medical-domain test sets registered
PubMedQA, MedQA-USMLE, MIMIC-IV-derived summarization halu test set added to /probebench/tasks. SHA-256 pinned.
v0.3 (Q4 2026)
MedicalFabricationGuard — domain-specialized probe
Probe re-trained on medical Q-A corpus with halu labels from clinician annotations. Released as separate registry entry alongside general-domain v2.
v1.0 (2027)
Drug-interaction probe + clinical-grounding NLI ensemble
Specialized probe for pharmaceutical fabrications + post-hoc grounding against pharmacovigilance DB. Combined ProbeScore reported.

Pilot inquiry

Email with a 1-paragraph description of your medical AI deployment + test-set availability.

caio@openinterp.org

Self-serve integration

Already deployed Qwen3.6-27B and want to drop in FabricationGuard?

pip install openinterp
from openinterp import FabricationGuard
guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B")
output = guard.generate(query, mode="abstain", threshold=0.5)

ProbeBench v0.0.1 · Apache-2.0 · OpenInterp · 2026