ProbeBench v0.0.1 · Open SubmissionApache-2.0 default

Ship your probe to the leaderboard.

Three steps: package the artifact, write the registry entry, open a PR. Apache-2.0 by default. We review every submission within 7 days.

Spec version

v0.0.1

PR review SLA

7 days

Step 1

Package the artifact

Probes are sklearn-compatible classifiers exposing predict_proba(X) → (n, 2). Plus a StandardScaler. Plus YAML metadata. Plus the SHA-256 of the bundle.

text·directory layout

fabricationguard-qwen36-27b/
  probe.joblib              # sklearn LR / MLP / tree
  scaler.joblib             # sklearn StandardScaler
  meta.yaml                 # ProbeArtifactSpec
  README.md
  train.ipynb               # reproducer
  eval.ipynb                # held-out test set

meta.yaml · ProbeArtifactSpec

yaml

spec_version: "0.0.1"
probe_type: linear   # linear | mlp | sae_feature | sae_combination | tree | ensemble
model: "Qwen/Qwen3.6-27B"
layer: 31
position: end_question  # last_token | end_question | mid_think | end_think | end_answer | token_avg | attention_weighted
training_data: "TruthfulQA + HaluEval + MMLU train splits"
license: "Apache-2.0"
paper: "arXiv:2505.XXXXX"
author: "Your Name"
contact: "you@yourorg.org"
created_at: "2026-04-27T00:00:00Z"

Schema is the meta field of ProbeArtifactSpec in lib/probebench-types.ts. Hash the bundle directory with shasum -a 256 before publishing — the registry compares against the artifact URL you declare in step 3.

Step 2

Run the validator

Before opening a PR, run the SDK validator. It checks the artifact format, metadata fields, and your reported metrics against the held-out test set hash.

bash·openinterp probebench

pip install openinterp
openinterp probebench validate ./fabricationguard-qwen36-27b/
openinterp probebench submit ./fabricationguard-qwen36-27b/ \
  --tasks haluval-qa,simpleqa,truthfulqa-mc1

Output is a submission.json with: per-task AUROC + 95% CI bootstraps, ECE, FPR@99TPR, latency, eval-awareness corrected AUROC (computed against EvalAwarenessGuard), and dist-shift AUROC (against the apollo-eval-deploy long-context split).

Step 3

Open a PR

Drop one YAML file into the registry, push a branch, open a PR. CI runs the validator + a fresh evaluation on a fresh holdout.

Path: OpenInterpretability/probebench-registry/probes/{your-probe-id}.yaml

yaml·probes/yourorg__yourprobe.yaml

id: yourorg/yourprobe-model-layer-version
name: "Your Probe Name"
short_name: "YourProbe"
author: "Your Name"
org: "YourOrg"
category: hallucination       # see lib/probebench-types.ts ProbeCategory
probe_type: linear
model_id: "Qwen/Qwen3.6-27B"
layer: 31
position: end_question
paper: "arXiv:XXXX.XXXXX"
paper_title: "Optional title"
artifact_url: "https://huggingface.co/datasets/yourorg/yourprobe"
artifact_sha256: "abc123..."
reproducer_notebook: "https://github.com/yourorg/.../notebook.ipynb"
colab_url: "https://colab.research.google.com/.../notebook.ipynb"
license: "Apache-2.0"
release: "2026-04-27"
param_count: 312000
size_mb: 1.2
tagline: "One-line pitch"
description: "Longer description..."
status: pending_review

PRs auto-trigger the SDK validator + a fresh evaluation run on a fresh holdout. Status flips from pending_review to live on merge.

Open a PR on the registry Browse the registry

Methodology

What we evaluate

ProbeScore composites seven axes. Weights sum to 1.0. PR a different set, we mint v0.0.2.

Axis	Weight	What it measures	How we compute
AUROC auroc	25%	Headline detection accuracy.	Mean across declared tasks.
Eval-awareness corrected AUROC auroc_evalaware	18%	AUROC on prompts the model is least likely to flag as evaluation.	Stratified bottom-quartile split using EvalAwarenessGuard scores. If you don’t supply, we impute raw × 0.85.
Distribution-shift robustness distshift_robustness	12%	AUROC under long-context shift (DeepMind 2601.11516).	Re-evaluated against the apollo-eval-deploy long-context split. If you don’t supply, we impute raw × 0.7.
Calibration (ECE) ece_calibration	10%	Lower expected calibration error = better.	Mapped via 1 − 2·ECE, clamped to [0, 1].
Cross-model transfer cross_model_transfer	10%	How well the probe direction holds across base models.	Mean Pearson_CE across declared target models. Default 0.5 if no transfers declared.
Inference efficiency inference_efficiency	10%	Latency-aware tax on heavy probes.	1 − log10(latency_ms) / 4. 1ms → 1.0, 10000ms → 0.0.
License openness license_score	5%	OSI / commercial-friendliness.	Apache-2.0=1.0, MIT=0.95, BSD=0.9, CC-BY=0.85, custom=0.5, closed=0.2. Closed-weight contribution capped at 0.01 of total.
Total	100%	Composite ∈ [0, 1]. No single axis exceeds 30%.

Norms

Anti-Goodhart by construction

We hard-cap any single component at 30% of total — you can’t win the leaderboard by maximizing only AUROC.
Closed-weight probes max out at 0.5 license contribution × 0.05 weight = 0.025 ceiling. You can still rank top-5, but it’s a tax on closed weights.
Eval-awareness probes that fail to declare a corrected AUROC get the punishment imputation: 0.85 × raw.
We re-evaluate every entry quarterly on fresh holdouts. Public diff if a probe degrades >5pp; deprecation if >10pp.

PR review

Reviewer checklist

What we check on every PR before it flips to live.

SHA-256 hash matches the artifact at the declared HF / GitHub URL.
Reproducer notebook runs end-to-end — we run it on Colab T4 / L4.
Test set hash matches a registered task in tasks.yaml — OR you ship a new task with documented sourcing.
License is OSI-approved (Apache-2.0 / MIT / BSD / CC-BY) or properly declared as custom / closed.
No PII / private data in the test set.
Eval-awareness corrected AUROC is declared — OR the auto-imputed flag is acknowledged in the PR description.

Code of conduct · Deprecation policy

We deprecate probes that turn out to be (a) trained with leakage we missed in review, (b) a reproducer that breaks and isn’t fixed within 30 days, or (c) the author requests removal. Deprecation moves the probe to a hidden archive with a public notice.

We don’t remove honestly failed probes — those stay live with rose-colored metrics. The whole point of ProbeBench is honest negative results.

Ready to ship?

Open the registry, drop your YAML, push a branch.

Open the registry on GitHub Read the methodology