Why do we still weight raw AUROC at 0.30 instead of 0? Because not every deployment is uncalibrated. For probes shipped to production where the user is benign — i.e., not adversarially trying to exploit eval-awareness — raw AUROC is closer to operational performance than the bottom-quartile corrected number. We weight both: 0.30 × raw + 0.20 × corrected. The 30% / 20% split exposes the gap without throwing the headline away.
What if a probe doesn't declare a corrected AUROC? Probes that fail to declare an eval-awareness corrected AUROC are auto-imputed at raw × 0.85 — a punishing default that disincentivizes silence. If the real correction is gentler than 15%, declare it and claim the points.
# ProbeScore weights v0.0.1
auroc = 0.30
auroc_evalaware = 0.20
distshift_robustness = 0.15
ece_calibration = 0.10
cross_model_transfer = 0.10
inference_efficiency = 0.10
license_score = 0.05
# missing → impute
corrected = raw × 0.85