Pair the inputs
For each test prompt, get residual activation h_s from the source model and h_t from the target model at a matched semantic checkpoint (same token position, same prompt template).
Decoder cosine looks high. Causal effects look low. The gap is real. Pearson_CE measures the Pearson correlation between paired ablation effects on source vs target models — the honest transfer signal. Naive cosine alignment overestimates by 38% on cross-model crosscoders (Lindsey 2024 setup). We require it for every cross-model claim on the leaderboard.
Methodology after Lindsey et al. 2024 (Anthropic Crosscoders) and our paper-1 numbers on Gemma-2-2B base/IT — median cosine 0.965 vs median Pearson_CE 0.616, a 38% overstatement.
For each test prompt, get residual activation h_s from the source model and h_t from the target model at a matched semantic checkpoint (same token position, same prompt template).
Ablate the probe's direction in each model; measure Δlogit_s and Δlogit_t on the held-out positive class. One pair per prompt.
ρ_CE = Pearson(Δlogit_s, Δlogit_t) over N ≥ 200 prompts. Bootstrap CIs over 1000 resamples. Report point estimate + 95% CI.
The diagonal is always ρ_CE = 1.00 by construction — a probe transfers perfectly to itself. Off-diagonal cells reveal architecture distance: same-family targets (e.g. Qwen → Qwen) generally exceed cross-family targets (Qwen → Llama, Qwen → Gemma). Cells with ρ_CE < 0.4 indicate the probe direction is not the same causal mechanism in the target model — high decoder cosine on those pairs is a confound, not portability.
Highest off-diagonal transfer: Qwen3.6-27B → Llama-3.3-70B with ρ_CE = 0.420 (transfer AUROC 0.710).
Highest off-diagonal transfer: Llama-3.3-70B → Qwen3.6-27B with ρ_CE = 0.510 (transfer AUROC 0.790).
Highest off-diagonal transfer: Qwen3.6-27B → Llama-3.3-70B with ρ_CE = 0.610 (transfer AUROC 0.860).
Naive Pearson on uncalibrated logits is gameable — large but parallel logit shifts inflate ρ_CE without reflecting shared mechanism. We require token-margin normalization, a held-out test set (no overlap with probe training prompts), and ≥ 3 source models before transfer can count toward ProbeScore. Single-pair transfers are reported but not scored.
High ρ_CE means equal causal effect on the measured downstream behavior — not necessarily on other downstream behaviors. Same direction, different semantics is possible: a probe that ablates "deception" in source and "refusal" in target can still register ρ_CE > 0.7 if both happen to suppress the same logits. Probe transfer ≠ probe portability for arbitrary interventions.
Original crosscoder formulation; introduces the cosine-based decoder alignment that Pearson_CE is meant to validate causally.
Reproducible numbers: median cosine 0.965 vs Pearson_CE 0.616 on Gemma-2-2B base/IT BatchTopK L13. 38% overstatement of feature equivalence by cosine alone.
Cross-model deception probe transfer; one of the few public datasets with paired causal-effect evaluations.
Run the cross-model harness on your probe → emit a YAML entry → open a PR against lib/probebench-data.ts. Required fields: probeId, sourceModel, targetModel, pearson_ce; optional: transfer_auroc, notes.