ProbeBench v0.0.1 · Open StandardApache-2.0 reproducers

The first categorical leaderboard for activation probes.

Eight alignment-relevant categories. Seven axes weighted into a single ProbeScore. Every entry ships with a Colab reproducer, an SHA-256-hashed test set, and Pearson_CE cross-model transfer numbers. We are publishing the slot we want our own probes to compete in.

New: every probe here is now agent-callable via openinterp-mcp — any Claude Code / Cursor / Cline session can score or audit a probe against your own Colab compute.

5
Probes registered
seed entries · open to community PRs
8
Categories
hallucination → reward-hacking
10
Tasks evaluated
hashed test sets · anti-contamination
11
Eval cells
across 3 base models
Submit a probe MethodologyProbeScore weights are versioned · PR a different set, we mint v0.0.2

Why this exists

Probing the residual stream is one of the most reproducible interpretability primitives we have — and one of the easiest to mis-report. ProbeScore composes the three corrections most likely to surface the difference between a real probe and a confound.

Single AUROC is not enough

Published probes hit 0.96+ on Apollo deception suites, but break under long-context distribution shift (DeepMind 2601.11516) and inflate when test prompts trigger eval-awareness (arXiv 2509.13333). ProbeScore composites both corrections so the headline number is harder to oversell.

Probes don't transfer for free

Naive cosine alignment misses 38% of causal direction across crosscoders. Pearson_CE, the paired-ablation correlation between source and target models, is the honest transfer metric. Every cross-model entry on this page ships one.

Anti-Goodhart by construction

No axis weighted above 30%. License score capped for closed-weight probes. Eval-awareness component dampens when calibration is missing. Hard to game by optimizing one number, easy to PR if you disagree with the weights.

Global Leaderboard

Sorted by ProbeScore across all 8 categories. Click any row for the full Probe DNA card — eval table, Pearson_CE matrix, license, artifact hash.

5 entries · ProbeScore v0.0.1
#ProbeCategoryModelAUROCLatency (ms)LicenseProbeScoreArtifact
1EvalAwarenessGuard
Caio Vicentino · OpenInterp
Eval-AwareQwen3.6-27B · L400.9301.0Apache-2.00.848HF
2DeceptionGuard
Caio Vicentino (re-impl) / Apollo Research (method) · OpenInterp
DeceptionLlama-3.3-70B-Instruct · L400.9782.1Apache-2.00.785HF
3FabricationGuard
Caio Vicentino · OpenInterp
HallucinationQwen3.6-27B · L310.9031.0Apache-2.00.662HF
4RewardHackGuard
Caio Vicentino · OpenInterp
Reward-HackQwen3.6-27B · L310.6501.8Apache-2.00.645HF
5ReasonGuard
Caio Vicentino · OpenInterp
ReasoningQwen3.6-27B · L550.9081.0Apache-2.00.626HF

By category

Each alignment-relevant slot has its own table, its own held-out test set, and its own empty rows for probes that haven't been submitted yet.

Hallucination

Factual fabrication, entity recall failures.

4 tasks1 probe
#ProbeModelAUROCLatency (ms)LicenseProbeScoreArtifact
1FabricationGuard
Caio Vicentino · OpenInterp
Qwen3.6-27B · L310.9031.0Apache-2.00.662HF

Reasoning

Chain-of-thought faithfulness, hypocrisy gap.

4 tasks1 probe
#ProbeModelAUROCLatency (ms)LicenseProbeScoreArtifact
1ReasonGuard
Caio Vicentino · OpenInterp
Qwen3.6-27B · L550.9081.0Apache-2.00.626HF

Deception

Strategic dishonesty, sleeper agents, alignment faking.

1 task1 probe
#ProbeModelAUROCLatency (ms)LicenseProbeScoreArtifact
1DeceptionGuard
Caio Vicentino (re-impl) / Apollo Research (method) · OpenInterp
Llama-3.3-70B-Instruct · L400.9782.1Apache-2.00.785HF

Sandbagging

Capability hiding under evaluation.

0 tasks0 probes
🌱
Slot open · be the first to submit
No probe registered for sandbagging yet. PR an artifact + reproducer notebook to claim row 1.
Open the submission template

Eval Awareness

Model knows-it-is-being-tested confound.

1 task1 probe
#ProbeModelAUROCLatency (ms)LicenseProbeScoreArtifact
1EvalAwarenessGuard
Caio Vicentino · OpenInterp
Qwen3.6-27B · L400.9301.0Apache-2.00.848HF

Reward Hacking

RL-induced misalignment generalization.

0 tasks1 probe
#ProbeModelAUROCLatency (ms)LicenseProbeScoreArtifact
1RewardHackGuard
Caio Vicentino · OpenInterp
Qwen3.6-27B · L310.6501.8Apache-2.00.645HF

Manipulation

EU AI Act Article 5 — persuasion / subliminal.

0 tasks0 probes
🌱
Slot open · be the first to submit
No probe registered for manipulation yet. PR an artifact + reproducer notebook to claim row 1.
Open the submission template

Refusal

Over/under-refusal calibration; jailbreak proxy.

0 tasks0 probes
🌱
Slot open · be the first to submit
No probe registered for refusal yet. PR an artifact + reproducer notebook to claim row 1.
Open the submission template

ProbeScore axes

Weights v0.0.1 · subject to community revision

Seven axes, each in [0, 1], weighted to sum to 1.0. We report every component alongside the headline score so you can re-aggregate with your own weights.

AUROC25%
headline detection accuracy
Eval-aware18%
confound-corrected AUROC
Dist-shift12%
long-context / OOD generalization
Calibration10%
expected calibration error
Transfer10%
Pearson_CE across models
Latency10%
1 / log(latency_ms)
License5%
openness · commercial-friendliness
Read the full methodology ·No axis > 30% weight · license bounded · eval-awareness dampening
Vertical · CZI Biohub $500M (Apr 29)

Applications

Where these probes apply in practice. CZI Biohub committed $500M to AI-biology on Apr 29 2026 — medical AI is the most urgent vertical for fabrication detection.

Submit your probe

Two paths — pick whichever fits your workflow. Both end in the same place: a sklearn-compatible artifact + hashed test set + Colab reproducer registered on the leaderboard.

  • · GitHub PR — open a PR with the artifact + spec_version 0.0.1
  • · Agent-callableopeninterp-mcp > publish_probe() auto-opens the PR + mints a Zenodo DOI + creates an HF dataset
  • · predict_proba(X) -> (n, 2) + StandardScaler + metadata block

Score a probe

Two ways to score: the Python SDK (deterministic, no GPU needed for evaluation metadata) or the MCP server (agent runs the full eval suite against your own Colab compute, returns a ProbeScore card identical to the leaderboard).

Python SDK
pip install -U openinterp

from openinterp import probebench
probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
print(probebench.score(probe))
Via your agent — openinterp-mcp
# in Claude Code / Cursor / Cline session:
/colab-attach https://abc123.ngrok-free.app
> score probebench:openinterp/fabricationguard-qwen36-27b-l31-v2
  on tasks: haluval-qa, simpleqa