ProbeBench v0.0.1 · Open StandardApache-2.0 reproducers

The first categorical leaderboard for activation probes.

Eight alignment-relevant categories. Seven axes weighted into a single ProbeScore. Every entry ships with a Colab reproducer, an SHA-256-hashed test set, and Pearson_CE cross-model transfer numbers. We are publishing the slot we want our own probes to compete in.

New: every probe here is now agent-callable via openinterp-mcp — any Claude Code / Cursor / Cline session can score or audit a probe against your own Colab compute.

Probes registered

seed entries · open to community PRs

Why this exists

Probing the residual stream is one of the most reproducible interpretability primitives we have — and one of the easiest to mis-report. ProbeScore composes the three corrections most likely to surface the difference between a real probe and a confound.

Single AUROC is not enough

Published probes hit 0.96+ on Apollo deception suites, but break under long-context distribution shift (DeepMind 2601.11516) and inflate when test prompts trigger eval-awareness (arXiv 2509.13333). ProbeScore composites both corrections so the headline number is harder to oversell.

Probes don't transfer for free

Naive cosine alignment misses 38% of causal direction across crosscoders. Pearson_CE, the paired-ablation correlation between source and target models, is the honest transfer metric. Every cross-model entry on this page ships one.

Anti-Goodhart by construction

No axis weighted above 30%. License score capped for closed-weight probes. Eval-awareness component dampens when calibration is missing. Hard to game by optimizing one number, easy to PR if you disagree with the weights.

Global Leaderboard

Sorted by ProbeScore across all 8 categories. Click any row for the full Probe DNA card — eval table, Pearson_CE matrix, license, artifact hash.

5 entries · ProbeScore v0.0.1

#	Probe	Category	Model	AUROC	Latency (ms)	License	ProbeScore	Artifact
1	EvalAwarenessGuard Caio Vicentino · OpenInterp	Eval-Aware	Qwen3.6-27B · L40	0.930	1.0	Apache-2.0	0.848	HF
2	DeceptionGuard Caio Vicentino (re-impl) / Apollo Research (method) · OpenInterp	Deception	Llama-3.3-70B-Instruct · L40	0.978	2.1	Apache-2.0	0.785	HF
3	FabricationGuard Caio Vicentino · OpenInterp	Hallucination	Qwen3.6-27B · L31	0.903	1.0	Apache-2.0	0.662	HF
4	RewardHackGuard Caio Vicentino · OpenInterp	Reward-Hack	Qwen3.6-27B · L31	0.650	1.8	Apache-2.0	0.645	HF
5	ReasonGuard Caio Vicentino · OpenInterp	Reasoning	Qwen3.6-27B · L55	0.908	1.0	Apache-2.0	0.626	HF

By category

Each alignment-relevant slot has its own table, its own held-out test set, and its own empty rows for probes that haven't been submitted yet.

Hallucination

Factual fabrication, entity recall failures.

4 tasks1 probe

#	Probe	Model	AUROC	Latency (ms)	License	ProbeScore	Artifact
1	FabricationGuard Caio Vicentino · OpenInterp	Qwen3.6-27B · L31	0.903	1.0	Apache-2.0	0.662	HF

Reasoning

Chain-of-thought faithfulness, hypocrisy gap.

4 tasks1 probe

#	Probe	Model	AUROC	Latency (ms)	License	ProbeScore	Artifact
1	ReasonGuard Caio Vicentino · OpenInterp	Qwen3.6-27B · L55	0.908	1.0	Apache-2.0	0.626	HF

Deception

Strategic dishonesty, sleeper agents, alignment faking.

1 task1 probe

#	Probe	Model	AUROC	Latency (ms)	License	ProbeScore	Artifact
1	DeceptionGuard Caio Vicentino (re-impl) / Apollo Research (method) · OpenInterp	Llama-3.3-70B-Instruct · L40	0.978	2.1	Apache-2.0	0.785	HF

Sandbagging

Capability hiding under evaluation.

0 tasks0 probes

🌱

Slot open · be the first to submit

No probe registered for sandbagging yet. PR an artifact + reproducer notebook to claim row 1.

Open the submission template

Eval Awareness

Model knows-it-is-being-tested confound.

1 task1 probe

#	Probe	Model	AUROC	Latency (ms)	License	ProbeScore	Artifact
1	EvalAwarenessGuard Caio Vicentino · OpenInterp	Qwen3.6-27B · L40	0.930	1.0	Apache-2.0	0.848	HF

Reward Hacking

RL-induced misalignment generalization.

0 tasks1 probe

#	Probe	Model	AUROC	Latency (ms)	License	ProbeScore	Artifact
1	RewardHackGuard Caio Vicentino · OpenInterp	Qwen3.6-27B · L31	0.650	1.8	Apache-2.0	0.645	HF

Manipulation

EU AI Act Article 5 — persuasion / subliminal.

0 tasks0 probes

🌱

Slot open · be the first to submit

No probe registered for manipulation yet. PR an artifact + reproducer notebook to claim row 1.

Open the submission template

Refusal

Over/under-refusal calibration; jailbreak proxy.

0 tasks0 probes

🌱

Slot open · be the first to submit

No probe registered for refusal yet. PR an artifact + reproducer notebook to claim row 1.

Open the submission template

ProbeScore axes

Weights v0.0.1 · subject to community revision

Seven axes, each in [0, 1], weighted to sum to 1.0. We report every component alongside the headline score so you can re-aggregate with your own weights.

AUROC25%

headline detection accuracy

Eval-aware18%

confound-corrected AUROC

Dist-shift12%

long-context / OOD generalization

Calibration10%

expected calibration error

Transfer10%

Pearson_CE across models

Latency10%

1 / log(latency_ms)

License5%

openness · commercial-friendliness

Read the full methodology ·No axis > 30% weight · license bounded · eval-awareness dampening

Vertical · CZI Biohub $500M (Apr 29)

Applications

Where these probes apply in practice. CZI Biohub committed $500M to AI-biology on Apr 29 2026 — medical AI is the most urgent vertical for fabrication detection.

ProbeBench × Medical AI

Submit your probe

Two paths — pick whichever fits your workflow. Both end in the same place: a sklearn-compatible artifact + hashed test set + Colab reproducer registered on the leaderboard.

· GitHub PR — open a PR with the artifact + spec_version 0.0.1
· Agent-callable — openinterp-mcp > publish_probe() auto-opens the PR + mints a Zenodo DOI + creates an HF dataset
· predict_proba(X) -> (n, 2) + StandardScaler + metadata block

Open submission template Submit via your agent Artifact spec

Score a probe

Two ways to score: the Python SDK (deterministic, no GPU needed for evaluation metadata) or the MCP server (agent runs the full eval suite against your own Colab compute, returns a ProbeScore card identical to the leaderboard).

Python SDK

pip install -U openinterp

from openinterp import probebench
probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
print(probebench.score(probe))

Via your agent — openinterp-mcp

# in Claude Code / Cursor / Cline session:
/colab-attach https://abc123.ngrok-free.app
> score probebench:openinterp/fabricationguard-qwen36-27b-l31-v2
  on tasks: haluval-qa, simpleqa

MCP architecture SDK reference PyPI