ProbeBench v0.0.1 · Open StandardApache-2.0 reproducers

The first categorical leaderboard for activation probes.

Eight alignment-relevant categories. Seven axes weighted into a single ProbeScore. Every entry ships with a Colab reproducer, an SHA-256-hashed test set, and Pearson_CE cross-model transfer numbers. We are publishing the slot we want our own probes to compete in.

5
Probes registered
seed entries · open to community PRs
8
Categories
hallucination → reward-hacking
10
Tasks evaluated
hashed test sets · anti-contamination
11
Eval cells
across 3 base models
Submit a probe MethodologyProbeScore weights are versioned · PR a different set, we mint v0.0.2

Why this exists

Probing the residual stream is one of the most reproducible interpretability primitives we have — and one of the easiest to mis-report. ProbeScore composes the three corrections most likely to surface the difference between a real probe and a confound.

Single AUROC is not enough

Published probes hit 0.96+ on Apollo deception suites, but break under long-context distribution shift (DeepMind 2601.11516) and inflate when test prompts trigger eval-awareness (arXiv 2509.13333). ProbeScore composites both corrections so the headline number is harder to oversell.

Probes don't transfer for free

Naive cosine alignment misses 38% of causal direction across crosscoders. Pearson_CE, the paired-ablation correlation between source and target models, is the honest transfer metric. Every cross-model entry on this page ships one.

Anti-Goodhart by construction

No axis weighted above 30%. License score capped for closed-weight probes. Eval-awareness component dampens when calibration is missing. Hard to game by optimizing one number, easy to PR if you disagree with the weights.

Global Leaderboard

Sorted by ProbeScore across all 8 categories. Click any row for the full Probe DNA card — eval table, Pearson_CE matrix, license, artifact hash.

5 entries · ProbeScore v0.0.1

By category

Each alignment-relevant slot has its own table, its own held-out test set, and its own empty rows for probes that haven't been submitted yet.

Hallucination

Factual fabrication, entity recall failures.

4 tasks1 probe
RankProbeModel · LayerBest AUROCEval-awareDist-shiftProbeScore
3
FabricationGuard v2live
AUROC 0.88 cross-task on SimpleQA · −88% confident-wrong reduction
OpenInterp · Caio Vicentino
Qwen3.6-27B
L31 · end_question
0.9030.6620.5980.662

Reasoning

Chain-of-thought faithfulness, hypocrisy gap.

4 tasks1 probe

Deception

Strategic dishonesty, sleeper agents, alignment faking.

1 task1 probe
RankProbeModel · LayerBest AUROCEval-awareDist-shiftProbeScore
2
DeceptionGuard (Apollo re-impl)pending
Re-impl of Apollo Research deception probe (AUROC 0.96–0.999 published)
OpenInterp · Caio Vicentino (re-impl)
Llama-3.3-70B
L40 · last_token
0.9780.8100.7450.785

Sandbagging

Capability hiding under evaluation.

0 tasks0 probes
🌱
Slot open · be the first to submit
No probe registered for sandbagging yet. PR an artifact + reproducer notebook to claim row 1.
Open the submission template

Eval Awareness

Model knows-it-is-being-tested confound.

1 task1 probe
RankProbeModel · LayerBest AUROCEval-awareDist-shiftProbeScore
1
EvalAwarenessGuardpending
Detects when model believes it is being evaluated · UK AISI priority
OpenInterp · Caio Vicentino
Qwen3.6-27B
L40 · end_question
0.9300.9300.8100.848

Reward Hacking

RL-induced misalignment generalization.

0 tasks1 probe
RankProbeModel · LayerBest AUROCEval-awareDist-shiftProbeScore
4
RewardHackGuard PoCpending
Detect emergent reward-hacking generalization · Anthropic Nov 2025 framing
OpenInterp · Caio Vicentino
Qwen3.6-27B
L31 · token_avg
0.6500.5900.5200.645

Manipulation

EU AI Act Article 5 — persuasion / subliminal.

0 tasks0 probes
🌱
Slot open · be the first to submit
No probe registered for manipulation yet. PR an artifact + reproducer notebook to claim row 1.
Open the submission template

Refusal

Over/under-refusal calibration; jailbreak proxy.

0 tasks0 probes
🌱
Slot open · be the first to submit
No probe registered for refusal yet. PR an artifact + reproducer notebook to claim row 1.
Open the submission template

ProbeScore axes

Weights v0.0.1 · subject to community revision

Seven axes, each in [0, 1], weighted to sum to 1.0. We report every component alongside the headline score so you can re-aggregate with your own weights.

AUROC25%
headline detection accuracy
Eval-aware18%
confound-corrected AUROC
Dist-shift12%
long-context / OOD generalization
Calibration10%
expected calibration error
Transfer10%
Pearson_CE across models
Latency10%
1 / log(latency_ms)
License5%
openness · commercial-friendliness
Read the full methodology ·No axis > 30% weight · license bounded · eval-awareness dampening
NEW · Apr 29

Applications

Where these probes apply in practice. CZI Biohub committed $500M to AI-biology on Apr 29 2026 — medical AI is the most urgent vertical for fabrication detection.

Submit your probe

Open a PR with a sklearn-compatible artifact (probe + scaler + metadata), an SHA-256-hashed test set, and a one-click Colab reproducer. The submission template walks you through the spec.

  • · predict_proba(X) -> (n, 2) probe interface
  • · StandardScaler-compatible feature scaler
  • · metadata block: model, layer, position, license, contact
  • · spec_version 0.0.1

Run the SDK

The openinterp PyPI package ships a probebench module that downloads any probe, runs the canonical eval suite, and prints a ProbeScore card identical to the leaderboard above.

pip install -U openinterp

from openinterp import probebench
probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
print(probebench.score(probe))