Activation probes have entered a Goodhart phase. The community keeps shipping new probes with headline numbers in the high 0.9s, and each one is a real result on a real test set — but the numbers don't compose. Apollo Research reports 0.96–0.999 AUROC on deception detection with linear probes. Goodfire's RLFR work shows monotonic scaling of probe accuracy with model size. Our own FabricationGuard hits 0.88 cross-task on SimpleQA.
Every one of these is conditional on a chosen test set, a chosen model, and a chosen layer. Apollo's 0.999 was on Llama-3.3-70B at L40 against the insider-trading suite; move the model, swap the position, or let the prompt smell less like an evaluation, and the number drops. The eval-awareness power-law (Apollo, 2025) shows the confound is large enough to explain a meaningful share of the variance across published numbers.
The community needs a leaderboard the way SAEs needed SAEBench. SAEBench refused to ship a composite — that gap is what InterpScore filled on the SAE side. ProbeBench is the same move for activation probes: pick the categories that matter for alignment, define the corrections that catch the most common confounds, and put a single composite next to every entry so probes from different papers can be compared on the same row.