ProbeBench v0.0.1 · Open StandardApache-2.0 reproducers

The first categorical leaderboard for activation probes.

Eight alignment-relevant categories. Seven axes weighted into a single ProbeScore. Every entry ships with a Colab reproducer, an SHA-256-hashed test set, and Pearson_CE cross-model transfer numbers. We are publishing the slot we want our own probes to compete in.

Probes registered

seed entries · open to community PRs

Why this exists

Probing the residual stream is one of the most reproducible interpretability primitives we have — and one of the easiest to mis-report. ProbeScore composes the three corrections most likely to surface the difference between a real probe and a confound.

Single AUROC is not enough

Published probes hit 0.96+ on Apollo deception suites, but break under long-context distribution shift (DeepMind 2601.11516) and inflate when test prompts trigger eval-awareness (arXiv 2509.13333). ProbeScore composites both corrections so the headline number is harder to oversell.

Probes don't transfer for free

Naive cosine alignment misses 38% of causal direction across crosscoders. Pearson_CE, the paired-ablation correlation between source and target models, is the honest transfer metric. Every cross-model entry on this page ships one.

Anti-Goodhart by construction

No axis weighted above 30%. License score capped for closed-weight probes. Eval-awareness component dampens when calibration is missing. Hard to game by optimizing one number, easy to PR if you disagree with the weights.

Global Leaderboard

Sorted by ProbeScore across all 8 categories. Click any row for the full Probe DNA card — eval table, Pearson_CE matrix, license, artifact hash.

5 entries · ProbeScore v0.0.1

Rank	Probe	Category	Model · Layer	Best AUROC	Eval-aware	Dist-shift	ProbeScore
1	EvalAwarenessGuardpending Detects when model believes it is being evaluated · UK AISI priority OpenInterp · Caio Vicentino	eval_awareness	Qwen3.6-27B L40 · end_question	0.930	0.930	0.810	0.848
2	DeceptionGuard (Apollo re-impl)pending Re-impl of Apollo Research deception probe (AUROC 0.96–0.999 published) OpenInterp · Caio Vicentino (re-impl)	deception	Llama-3.3-70B L40 · last_token	0.978	0.810	0.745	0.785
3	FabricationGuard v2live AUROC 0.88 cross-task on SimpleQA · −88% confident-wrong reduction OpenInterp · Caio Vicentino	hallucination	Qwen3.6-27B L31 · end_question	0.903	0.662	0.598	0.662
4	RewardHackGuard PoCpending Detect emergent reward-hacking generalization · Anthropic Nov 2025 framing OpenInterp · Caio Vicentino	reward_hacking	Qwen3.6-27B L31 · token_avg	0.650	0.590	0.520	0.645
5	ReasonGuard v0.2live Multi-bench training tested (FabricationGuard methodology) and FALSIFIED for reasoning probes — domain-bound at all difficulty levels OpenInterp · Caio Vicentino	reasoning	Qwen3.6-27B L55 · mid_think	0.908	0.572	0.537	0.626

By category

Each alignment-relevant slot has its own table, its own held-out test set, and its own empty rows for probes that haven't been submitted yet.

Hallucination

Factual fabrication, entity recall failures.

4 tasks1 probe

Rank	Probe	Model · Layer	Best AUROC	Eval-aware	Dist-shift	ProbeScore
3	FabricationGuard v2live AUROC 0.88 cross-task on SimpleQA · −88% confident-wrong reduction OpenInterp · Caio Vicentino	Qwen3.6-27B L31 · end_question	0.903	0.662	0.598	0.662

Reasoning

Chain-of-thought faithfulness, hypocrisy gap.

4 tasks1 probe

Rank	Probe	Model · Layer	Best AUROC	Eval-aware	Dist-shift	ProbeScore
5	ReasonGuard v0.2live Multi-bench training tested (FabricationGuard methodology) and FALSIFIED for reasoning probes — domain-bound at all difficulty levels OpenInterp · Caio Vicentino	Qwen3.6-27B L55 · mid_think	0.908	0.572	0.537	0.626

Deception

Strategic dishonesty, sleeper agents, alignment faking.

1 task1 probe

Rank	Probe	Model · Layer	Best AUROC	Eval-aware	Dist-shift	ProbeScore
2	DeceptionGuard (Apollo re-impl)pending Re-impl of Apollo Research deception probe (AUROC 0.96–0.999 published) OpenInterp · Caio Vicentino (re-impl)	Llama-3.3-70B L40 · last_token	0.978	0.810	0.745	0.785

Sandbagging

Capability hiding under evaluation.

0 tasks0 probes

🌱

Slot open · be the first to submit

No probe registered for sandbagging yet. PR an artifact + reproducer notebook to claim row 1.

Open the submission template

Eval Awareness

Model knows-it-is-being-tested confound.

1 task1 probe

Rank	Probe	Model · Layer	Best AUROC	Eval-aware	Dist-shift	ProbeScore
1	EvalAwarenessGuardpending Detects when model believes it is being evaluated · UK AISI priority OpenInterp · Caio Vicentino	Qwen3.6-27B L40 · end_question	0.930	0.930	0.810	0.848

Reward Hacking

RL-induced misalignment generalization.

0 tasks1 probe

Rank	Probe	Model · Layer	Best AUROC	Eval-aware	Dist-shift	ProbeScore
4	RewardHackGuard PoCpending Detect emergent reward-hacking generalization · Anthropic Nov 2025 framing OpenInterp · Caio Vicentino	Qwen3.6-27B L31 · token_avg	0.650	0.590	0.520	0.645

Manipulation

EU AI Act Article 5 — persuasion / subliminal.

0 tasks0 probes

🌱

Slot open · be the first to submit

No probe registered for manipulation yet. PR an artifact + reproducer notebook to claim row 1.

Open the submission template

Refusal

Over/under-refusal calibration; jailbreak proxy.

0 tasks0 probes

🌱

Slot open · be the first to submit

No probe registered for refusal yet. PR an artifact + reproducer notebook to claim row 1.

Open the submission template

ProbeScore axes

Weights v0.0.1 · subject to community revision

Seven axes, each in [0, 1], weighted to sum to 1.0. We report every component alongside the headline score so you can re-aggregate with your own weights.

AUROC25%

headline detection accuracy

Eval-aware18%

confound-corrected AUROC

Dist-shift12%

long-context / OOD generalization

Calibration10%

expected calibration error

Transfer10%

Pearson_CE across models

Latency10%

1 / log(latency_ms)

License5%

openness · commercial-friendliness

Read the full methodology ·No axis > 30% weight · license bounded · eval-awareness dampening

NEW · Apr 29

Applications

Where these probes apply in practice. CZI Biohub committed $500M to AI-biology on Apr 29 2026 — medical AI is the most urgent vertical for fabrication detection.

ProbeBench × Medical AI

Submit your probe

Open a PR with a sklearn-compatible artifact (probe + scaler + metadata), an SHA-256-hashed test set, and a one-click Colab reproducer. The submission template walks you through the spec.

· predict_proba(X) -> (n, 2) probe interface
· StandardScaler-compatible feature scaler
· metadata block: model, layer, position, license, contact
· spec_version 0.0.1

Open submission template Artifact spec

Run the SDK

The openinterp PyPI package ships a probebench module that downloads any probe, runs the canonical eval suite, and prints a ProbeScore card identical to the leaderboard above.

pip install -U openinterp

from openinterp import probebench
probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
print(probebench.score(probe))

SDK reference PyPI · openinterp