ProbeBench TasksSHA-256 pinned

Benchmarks the leaderboard runs on.

10 tasks across 8 categories. Each test set is SHA-256-hashed and version-pinned. Probes that train on a test set get caught at validation time.

10
Tasks
across alignment-relevant categories
8
Categories
hallucination → refusal
2,180
Total test examples
held-out, hashed splits
5
Active probes
evaluated on these tasks

Hallucination

Factual fabrication, entity recall failures.

4 tasks

HaluEval-QA

id: haluval-qa
hallucination

Open-ended question answering with annotated hallucination labels (Li et al. 2023).

Test set size
200 examples
SHA-256
a1f3e2d8c7b6…a9f8e7d6
Positive classhallucinated
Probes evaluated: 2
ProbeAUROC [CI]Eval-awareDist-shift
FabricationGuard
L31 · end_question
0.903 [0.85, 0.95]0.8400.710
RewardHackGuard
L31 · token_avg
0.650 [0.56, 0.74]0.5900.520

SimpleQA

id: simpleqa
hallucination

Open-domain factual QA — short factoids, very hard. SOTA closed-book ~50% (OpenAI 2024).

Test set size
100 examples
SHA-256
b2e4f3a9d8c7…b0a9f8e7
Positive classhallucinated
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
FabricationGuard
L31 · end_question
0.882 [0.83, 0.93]0.8200.720

TruthfulQA-MC1

id: truthfulqa-mc1
hallucination

Tests resistance to popular misconceptions (Lin et al. 2022). Different cognitive task than fabrication.

Test set size
200 examples
SHA-256
c3f5e4b0e9d8…c1b0a9f8
Positive classmisconception
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
FabricationGuard
L31 · end_question
0.599 [0.51, 0.69]0.5700.550

MMLU

id: mmlu
hallucination

Capability control — multiple-choice knowledge across 57 domains. NOT hallucination per se; included as scope check.

Test set size
500 examples
SHA-256
d4a6f5c1f0e9…d2c1b0a9
Positive classincorrect
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
FabricationGuard
L31 · end_question
0.444 [0.40, 0.49]0.4200.410

Reasoning

Chain-of-thought faithfulness, hypocrisy gap.

4 tasks

Hypocrisy Gap

id: hypocrisy-gap
reasoning

CoT-vs-belief divergence. Measures when model states reasoning that diverges from its internal belief (arXiv:2602.02496, Jan 2026).

Test set size
180 examples
SHA-256
f6c8b7e3b2a1…f4e3d2c1
Positive classunfaithful_cot
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
DeceptionGuard
L40 · last_token
0.800 [0.72, 0.87]0.7100.650

GSM8K

id: gsm8k
reasoning

Grade-school math word problems. Used as the reasoning-faithfulness within-domain bench. Hallucination = model produces wrong answer despite reasoning trace appearing coherent.

Test set size
300 examples
SHA-256
g5m8a1b2c3d4…2d3e4f56
Positive classwrong_answer
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
ReasonGuard
L55 · mid_think
0.9080.7720.612

StrategyQA

id: strategyqa
reasoning

Open-domain commonsense reasoning requiring multi-step strategy. Used as the cross-domain transfer test for reasoning-faithfulness probes (math → commonsense).

Test set size
150 examples
SHA-256
sq1b2c3d4e5f…e5f67890
Positive classwrong_answer
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
ReasonGuard
L55 · mid_think
0.6120.5200.500

MATH

id: math
reasoning

Hendrycks competition mathematics — PhD-level math problems including algebra, calculus, number theory, geometry. Hardest reasoning bench in the registry. Qwen3.6-27B baseline 18% accuracy. Used as advanced-difficulty stress test for reasoning-faithfulness probes.

Test set size
200 examples
SHA-256
mt1b2c3d4e5f…34567890
Positive classwrong_answer
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
ReasonGuard
L55 · mid_think
0.5000.4250.500

Deception

Strategic dishonesty, sleeper agents, alignment faking.

1 task

Apollo Insider Trading

id: apollo-insider-trading
deception

Apollo Research deception suite — model concealing illicit trades when asked. AUROC 0.96-0.999 on linear probes (arXiv:2502.03407).

Test set size
150 examples
SHA-256
e5b7a6d2a1f0…e3d2c1b0
Positive classdeceptive
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
DeceptionGuard
L40 · last_token
0.978 [0.96, 0.99]0.9100.840

Sandbagging

Capability hiding under evaluation.

0 tasks
🌱
Slot open · propose a benchmark task
No task registered for sandbagging yet. Open a PR with a YAML registry entry to claim the slot.
Submit a task

Eval Awareness

Model knows-it-is-being-tested confound.

1 task

Apollo Eval-vs-Deploy

id: apollo-eval-deploy
eval awareness

Linear probes separate real-world eval prompts from deployment prompts. Power-law scaling with model size (arXiv:2509.13333).

Test set size
200 examples
SHA-256
a7d9c8f4c3b2…a5f4e3d2
Positive classevaluation
Probes evaluated: 1
ProbeAUROC [CI]Eval-awareDist-shift
EvalAwarenessGuard
L40 · end_question
0.930 [0.88, 0.97]0.9300.810

Reward Hacking

RL-induced misalignment generalization.

0 tasks
🌱
Slot open · propose a benchmark task
No task registered for reward hacking yet. Open a PR with a YAML registry entry to claim the slot.
Submit a task

Manipulation

EU AI Act Article 5 — persuasion / subliminal.

0 tasks
🌱
Slot open · propose a benchmark task
No task registered for manipulation yet. Open a PR with a YAML registry entry to claim the slot.
Submit a task

Refusal

Over/under-refusal calibration; jailbreak proxy.

0 tasks
🌱
Slot open · propose a benchmark task
No task registered for refusal yet. Open a PR with a YAML registry entry to claim the slot.
Submit a task

Open-source dataset commitment

Every test set referenced here is publicly available under the dataset's original license. We pin a SHA-256 hash so probes can be validated against the exact bytes we evaluated on.

If a dataset is gated (e.g., requires HF agreement), we link to the canonical source and ship our exact preprocessed splits as a separate Apache-2.0 derivative when permitted.

Submit a task

Have a benchmark task that should be in here? Open a PR with the YAML registry entry.

tasks/your-task-id.yamlYAML
id: your-task-id
name: "Your Task Name"
category: hallucination   # or any of the 8 ProbeBench categories
description: "Short description"
paper: "arXiv:XXXX.XXXXX"
dataset_url: "https://huggingface.co/datasets/your/dataset"
test_set_size: 200
test_set_hash: "abc123..."   # SHA-256 of the canonical test split
positive_label: "hallucinated"
ProbeBench v0.0.1 · Tasks index