Single AUROC is not enough
Published probes hit 0.96+ on Apollo deception suites, but break under long-context distribution shift (DeepMind 2601.11516) and inflate when test prompts trigger eval-awareness (arXiv 2509.13333). ProbeScore composites both corrections so the headline number is harder to oversell.