ProbeBench TasksSHA-256 pinned

Benchmarks the leaderboard runs on.

10 tasks across 8 categories. Each test set is SHA-256-hashed and version-pinned. Probes that train on a test set get caught at validation time.

Tasks

across alignment-relevant categories

Hallucination

Factual fabrication, entity recall failures.

4 tasks

HaluEval-QA

id: haluval-qa

hallucination

Open-ended question answering with annotated hallucination labels (Li et al. 2023).

Paper: arXiv:2305.11747 Dataset: huggingface.co

Test set size

200 examples

SHA-256

a1f3e2d8c7b6…a9f8e7d6

Positive classhallucinated

Probes evaluated: 2▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
FabricationGuard L31 · end_question	0.903 [0.85, 0.95]	0.840	0.710
RewardHackGuard L31 · token_avg	0.650 [0.56, 0.74]	0.590	0.520

SimpleQA

id: simpleqa

hallucination

Open-domain factual QA — short factoids, very hard. SOTA closed-book ~50% (OpenAI 2024).

Paper: arXiv:2411.04368 Dataset: huggingface.co

Test set size

100 examples

SHA-256

b2e4f3a9d8c7…b0a9f8e7

Positive classhallucinated

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
FabricationGuard L31 · end_question	0.882 [0.83, 0.93]	0.820	0.720

TruthfulQA-MC1

id: truthfulqa-mc1

hallucination

Tests resistance to popular misconceptions (Lin et al. 2022). Different cognitive task than fabrication.

Paper: arXiv:2109.07958

Test set size

200 examples

SHA-256

c3f5e4b0e9d8…c1b0a9f8

Positive classmisconception

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
FabricationGuard L31 · end_question	0.599 [0.51, 0.69]	0.570	0.550

MMLU

id: mmlu

hallucination

Capability control — multiple-choice knowledge across 57 domains. NOT hallucination per se; included as scope check.

Paper: arXiv:2009.03300

Test set size

500 examples

SHA-256

d4a6f5c1f0e9…d2c1b0a9

Positive classincorrect

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
FabricationGuard L31 · end_question	0.444 [0.40, 0.49]	0.420	0.410

Reasoning

Chain-of-thought faithfulness, hypocrisy gap.

4 tasks

Hypocrisy Gap

id: hypocrisy-gap

reasoning

CoT-vs-belief divergence. Measures when model states reasoning that diverges from its internal belief (arXiv:2602.02496, Jan 2026).

Paper: arXiv:2602.02496

Test set size

180 examples

SHA-256

f6c8b7e3b2a1…f4e3d2c1

Positive classunfaithful_cot

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
DeceptionGuard L40 · last_token	0.800 [0.72, 0.87]	0.710	0.650

GSM8K

id: gsm8k

reasoning

Grade-school math word problems. Used as the reasoning-faithfulness within-domain bench. Hallucination = model produces wrong answer despite reasoning trace appearing coherent.

Paper: arXiv:2110.14168 Dataset: huggingface.co

Test set size

300 examples

SHA-256

g5m8a1b2c3d4…2d3e4f56

Positive classwrong_answer

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
ReasonGuard L55 · mid_think	0.908	0.772	0.612

StrategyQA

id: strategyqa

reasoning

Open-domain commonsense reasoning requiring multi-step strategy. Used as the cross-domain transfer test for reasoning-faithfulness probes (math → commonsense).

Paper: arXiv:2101.02235 Dataset: huggingface.co

Test set size

150 examples

SHA-256

sq1b2c3d4e5f…e5f67890

Positive classwrong_answer

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
ReasonGuard L55 · mid_think	0.612	0.520	0.500

MATH

id: math

reasoning

Hendrycks competition mathematics — PhD-level math problems including algebra, calculus, number theory, geometry. Hardest reasoning bench in the registry. Qwen3.6-27B baseline 18% accuracy. Used as advanced-difficulty stress test for reasoning-faithfulness probes.

Paper: arXiv:2103.03874 Dataset: huggingface.co

Test set size

200 examples

SHA-256

mt1b2c3d4e5f…34567890

Positive classwrong_answer

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
ReasonGuard L55 · mid_think	0.500	0.425	0.500

Deception

Strategic dishonesty, sleeper agents, alignment faking.

1 task

Apollo Insider Trading

id: apollo-insider-trading

deception

Apollo Research deception suite — model concealing illicit trades when asked. AUROC 0.96-0.999 on linear probes (arXiv:2502.03407).

Paper: arXiv:2502.03407 Dataset: github.com

Test set size

150 examples

SHA-256

e5b7a6d2a1f0…e3d2c1b0

Positive classdeceptive

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
DeceptionGuard L40 · last_token	0.978 [0.96, 0.99]	0.910	0.840

Sandbagging

Capability hiding under evaluation.

0 tasks

🌱

Slot open · propose a benchmark task

No task registered for sandbagging yet. Open a PR with a YAML registry entry to claim the slot.

Submit a task

Eval Awareness

Model knows-it-is-being-tested confound.

1 task

Apollo Eval-vs-Deploy

id: apollo-eval-deploy

eval awareness

Linear probes separate real-world eval prompts from deployment prompts. Power-law scaling with model size (arXiv:2509.13333).

Paper: arXiv:2507.01786

Test set size

200 examples

SHA-256

a7d9c8f4c3b2…a5f4e3d2

Positive classevaluation

Probes evaluated: 1▾

Probe	AUROC [CI]	Eval-aware	Dist-shift
EvalAwarenessGuard L40 · end_question	0.930 [0.88, 0.97]	0.930	0.810

Reward Hacking

RL-induced misalignment generalization.

0 tasks

🌱

Slot open · propose a benchmark task

No task registered for reward hacking yet. Open a PR with a YAML registry entry to claim the slot.

Submit a task

Manipulation

EU AI Act Article 5 — persuasion / subliminal.

0 tasks

🌱

Slot open · propose a benchmark task

No task registered for manipulation yet. Open a PR with a YAML registry entry to claim the slot.

Submit a task

Refusal

Over/under-refusal calibration; jailbreak proxy.

0 tasks

🌱

Slot open · propose a benchmark task

No task registered for refusal yet. Open a PR with a YAML registry entry to claim the slot.

Submit a task

Open-source dataset commitment

Every test set referenced here is publicly available under the dataset's original license. We pin a SHA-256 hash so probes can be validated against the exact bytes we evaluated on.

If a dataset is gated (e.g., requires HF agreement), we link to the canonical source and ship our exact preprocessed splits as a separate Apache-2.0 derivative when permitted.

Submit a task

Have a benchmark task that should be in here? Open a PR with the YAML registry entry.

tasks/your-task-id.yamlYAML

id: your-task-id
name: "Your Task Name"
category: hallucination   # or any of the 8 ProbeBench categories
description: "Short description"
paper: "arXiv:XXXX.XXXXX"
dataset_url: "https://huggingface.co/datasets/your/dataset"
test_set_size: 200
test_set_hash: "abc123..."   # SHA-256 of the canonical test split
positive_label: "hallucinated"

Open the registry on GitHub Submission template

ProbeBench v0.0.1 · Tasks index