Methodology · Pearson_CE

Cross-Model Probe Transfer

Decoder cosine looks high. Causal effects look low. The gap is real. Pearson_CE measures the Pearson correlation between paired ablation effects on source vs target models — the honest transfer signal. Naive cosine alignment overestimates by 38% on cross-model crosscoders (Lindsey 2024 setup). We require it for every cross-model claim on the leaderboard.

Methodology after Lindsey et al. 2024 (Anthropic Crosscoders) and our paper-1 numbers on Gemma-2-2B base/IT — median cosine 0.965 vs median Pearson_CE 0.616, a 38% overstatement.

How Pearson_CE works

Step 1

Pair the inputs

For each test prompt, get residual activation h_s from the source model and h_t from the target model at a matched semantic checkpoint (same token position, same prompt template).

Step 2

Compute paired causal effect

Ablate the probe's direction in each model; measure Δlogit_s and Δlogit_t on the held-out positive class. One pair per prompt.

Step 3

Pearson over the pairs

ρ_CE = Pearson(Δlogit_s, Δlogit_t) over N ≥ 200 prompts. Bootstrap CIs over 1000 resamples. Report point estimate + 95% CI.

Transfer matrix · mean Pearson_CE per pair

7 entries · 3×3 grid

1.00 identity≥ 0.70 strong0.40 – 0.70 partial< 0.40 weak— no data

source ↓ / target →

Qwen3.6-27B

Hybrid GDN +

Llama-3.3-70B

Dense transformer (instruct-tuned)

Gemma-3-27B

Dense transformer with

Qwen3.6-27B

Qwen · 64L · 27B

1.00

identity

n=2

0.52

partial

n=2

0.38

weak

n=1

Llama-3.3-70B

Llama · 80L · 70B

0.51

partial

n=1

—

no data

0.46

partial

n=1

Gemma-3-27B

Gemma · 62L · 27B

—

no data

—

no data

—

no data

The diagonal is always ρ_CE = 1.00 by construction — a probe transfers perfectly to itself. Off-diagonal cells reveal architecture distance: same-family targets (e.g. Qwen → Qwen) generally exceed cross-family targets (Qwen → Llama, Qwen → Gemma). Cells with ρ_CE < 0.4 indicate the probe direction is not the same causal mechanism in the target model — high decoder cosine on those pairs is a confound, not portability.

Per-probe transfer profiles

FabricationGuard v2

openinterp/fabricationguard-qwen36-27b-l31-v2

mean ρ_CE (off-diagonal)

0.400

Qwen3.6-27B→Llama-3.3-70Bρ_CE=0.42Qwen3.6-27B→Gemma-3-27Bρ_CE=0.38Qwen3.6-27B→Qwen3.6-27Bρ_CE=1.00 · self

Highest off-diagonal transfer: Qwen3.6-27B → Llama-3.3-70B with ρ_CE = 0.420 (transfer AUROC 0.710).

DeceptionGuard (Apollo re-impl)

openinterp/deceptionguard-llama33-70b-l40

mean ρ_CE (off-diagonal)

0.485

Llama-3.3-70B→Qwen3.6-27Bρ_CE=0.51Llama-3.3-70B→Gemma-3-27Bρ_CE=0.46

Highest off-diagonal transfer: Llama-3.3-70B → Qwen3.6-27B with ρ_CE = 0.510 (transfer AUROC 0.790).

ReasonGuard v0.2

openinterp/reasonguard-qwen36-27b-l55-mid_think

Qwen3.6-27B→Qwen3.6-27Bρ_CE=1.00 · self

EvalAwarenessGuard

openinterp/evalawareness-qwen36-27b-l40

mean ρ_CE (off-diagonal)

0.610

Qwen3.6-27B→Llama-3.3-70Bρ_CE=0.61

Highest off-diagonal transfer: Qwen3.6-27B → Llama-3.3-70B with ρ_CE = 0.610 (transfer AUROC 0.860).

What Pearson_CE buys you (and what it doesn't)

Why we don't reward high transfer naively

Naive Pearson on uncalibrated logits is gameable — large but parallel logit shifts inflate ρ_CE without reflecting shared mechanism. We require token-margin normalization, a held-out test set (no overlap with probe training prompts), and ≥ 3 source models before transfer can count toward ProbeScore. Single-pair transfers are reported but not scored.

What Pearson_CE doesn't tell you

High ρ_CE means equal causal effect on the measured downstream behavior — not necessarily on other downstream behaviors. Same direction, different semantics is possible: a probe that ablates "deception" in source and "refusal" in target can still register ρ_CE > 0.7 if both happen to suppress the same logits. Probe transfer ≠ probe portability for arbitrary interventions.

Citations

Lindsey, Templeton, Marcus, et al.Anthropic, 2024

Sparse Crosscoders for Cross-Layer Features and Model Diffing

Transformer Circuits

Original crosscoder formulation; introduces the cosine-based decoder alignment that Pearson_CE is meant to validate causally.

Vicentino2026

Decoder Cosine vs Causal Equivalence in Cross-Model Crosscoders

Paper-1 — notebooks 17b / 17c

Reproducible numbers: median cosine 0.965 vs Pearson_CE 0.616 on Gemma-2-2B base/IT BatchTopK L13. 38% overstatement of feature equivalence by cosine alone.

Goldowsky-Dill, Chughtai, Heimersheim, HobbhahnApollo Research, 2025

Detecting Strategic Deception Using Linear Probes

arXiv:2502.03407

Cross-model deception probe transfer; one of the few public datasets with paired causal-effect evaluations.

Submit a transfer evaluation

Run the cross-model harness on your probe → emit a YAML entry → open a PR against lib/probebench-data.ts. Required fields: probeId, sourceModel, targetModel, pearson_ce; optional: transfer_auroc, notes.

Open submission spec Back to ProbeBench overview