Back to ProbeBench
Methodology · Pearson_CE

Cross-Model Probe Transfer

Decoder cosine looks high. Causal effects look low. The gap is real. Pearson_CE measures the Pearson correlation between paired ablation effects on source vs target models — the honest transfer signal. Naive cosine alignment overestimates by 38% on cross-model crosscoders (Lindsey 2024 setup). We require it for every cross-model claim on the leaderboard.

Methodology after Lindsey et al. 2024 (Anthropic Crosscoders) and our paper-1 numbers on Gemma-2-2B base/IT — median cosine 0.965 vs median Pearson_CE 0.616, a 38% overstatement.

How Pearson_CE works

Step 1

Pair the inputs

For each test prompt, get residual activation h_s from the source model and h_t from the target model at a matched semantic checkpoint (same token position, same prompt template).

Step 2

Compute paired causal effect

Ablate the probe's direction in each model; measure Δlogit_s and Δlogit_t on the held-out positive class. One pair per prompt.

Step 3

Pearson over the pairs

ρ_CE = Pearson(Δlogit_s, Δlogit_t) over N ≥ 200 prompts. Bootstrap CIs over 1000 resamples. Report point estimate + 95% CI.

Transfer matrix · mean Pearson_CE per pair

7 entries · 3×3 grid
1.00 identity≥ 0.70 strong0.40 – 0.70 partial< 0.40 weak— no data
source ↓ / target →
Qwen3.6-27B
Hybrid GDN +
Llama-3.3-70B
Dense transformer (instruct-tuned)
Gemma-3-27B
Dense transformer with
Qwen3.6-27B
Qwen · 64L · 27B
1.00
identity
n=2
0.52
partial
n=2
0.38
weak
n=1
Llama-3.3-70B
Llama · 80L · 70B
0.51
partial
n=1
no data
0.46
partial
n=1
Gemma-3-27B
Gemma · 62L · 27B
no data
no data
no data

The diagonal is always ρ_CE = 1.00 by construction — a probe transfers perfectly to itself. Off-diagonal cells reveal architecture distance: same-family targets (e.g. Qwen → Qwen) generally exceed cross-family targets (Qwen → Llama, Qwen → Gemma). Cells with ρ_CE < 0.4 indicate the probe direction is not the same causal mechanism in the target model — high decoder cosine on those pairs is a confound, not portability.

Per-probe transfer profiles

FabricationGuard v2
openinterp/fabricationguard-qwen36-27b-l31-v2
mean ρ_CE (off-diagonal)
0.400
Qwen3.6-27BLlama-3.3-70Bρ_CE=0.42Qwen3.6-27BGemma-3-27Bρ_CE=0.38Qwen3.6-27BQwen3.6-27Bρ_CE=1.00 · self

Highest off-diagonal transfer: Qwen3.6-27B Llama-3.3-70B with ρ_CE = 0.420 (transfer AUROC 0.710).

DeceptionGuard (Apollo re-impl)
openinterp/deceptionguard-llama33-70b-l40
mean ρ_CE (off-diagonal)
0.485
Llama-3.3-70BQwen3.6-27Bρ_CE=0.51Llama-3.3-70BGemma-3-27Bρ_CE=0.46

Highest off-diagonal transfer: Llama-3.3-70B Qwen3.6-27B with ρ_CE = 0.510 (transfer AUROC 0.790).

ReasonGuard v0.2
openinterp/reasonguard-qwen36-27b-l55-mid_think
Qwen3.6-27BQwen3.6-27Bρ_CE=1.00 · self
EvalAwarenessGuard
openinterp/evalawareness-qwen36-27b-l40
mean ρ_CE (off-diagonal)
0.610
Qwen3.6-27BLlama-3.3-70Bρ_CE=0.61

Highest off-diagonal transfer: Qwen3.6-27B Llama-3.3-70B with ρ_CE = 0.610 (transfer AUROC 0.860).

What Pearson_CE buys you (and what it doesn't)

Why we don't reward high transfer naively

Naive Pearson on uncalibrated logits is gameable — large but parallel logit shifts inflate ρ_CE without reflecting shared mechanism. We require token-margin normalization, a held-out test set (no overlap with probe training prompts), and ≥ 3 source models before transfer can count toward ProbeScore. Single-pair transfers are reported but not scored.

What Pearson_CE doesn't tell you

High ρ_CE means equal causal effect on the measured downstream behavior — not necessarily on other downstream behaviors. Same direction, different semantics is possible: a probe that ablates "deception" in source and "refusal" in target can still register ρ_CE > 0.7 if both happen to suppress the same logits. Probe transfer ≠ probe portability for arbitrary interventions.

Citations

Lindsey, Templeton, Marcus, et al.Anthropic, 2024
Sparse Crosscoders for Cross-Layer Features and Model Diffing
Transformer Circuits

Original crosscoder formulation; introduces the cosine-based decoder alignment that Pearson_CE is meant to validate causally.

Vicentino2026
Decoder Cosine vs Causal Equivalence in Cross-Model Crosscoders
Paper-1 — notebooks 17b / 17c

Reproducible numbers: median cosine 0.965 vs Pearson_CE 0.616 on Gemma-2-2B base/IT BatchTopK L13. 38% overstatement of feature equivalence by cosine alone.

Goldowsky-Dill, Chughtai, Heimersheim, HobbhahnApollo Research, 2025
Detecting Strategic Deception Using Linear Probes
arXiv:2502.03407

Cross-model deception probe transfer; one of the few public datasets with paired causal-effect evaluations.

Submit a transfer evaluation

Run the cross-model harness on your probe → emit a YAML entry → open a PR against lib/probebench-data.ts. Required fields: probeId, sourceModel, targetModel, pearson_ce; optional: transfer_auroc, notes.