Back to ProbeBench
pending reviewlinearDeceptionApache-2.0

DeceptionGuard (Apollo re-impl)

Re-impl of Apollo Research deception probe (AUROC 0.96–0.999 published)

Re-implementation of the Apollo Research deception-detection method on Llama-3.3-70B-Instruct. Uses paired honest/deceptive contrast pairs from insider-trading and werewolf scenarios. Citation: Goldowsky-Dill, Chughtai, Heimersheim, Hobbhahn (ICML 2025).

ProbeScore
0.785
Deception #1Global #2
7 weighted axes. Subject to revision as more reproducers land.

Quickstart

Three lines via the openinterp.probebench SDK. Returns the probe's P(positive_class) for a tensor of activations captured at the layer/position below.

python
from openinterp import probebench
probe = probebench.load("openinterp/deceptionguard-llama33-70b-l40")
score = probe.score(activations)  # → P(positive_class)

ProbeScore — 7 weighted axes

Composite metric in [0, 1]. No single axis can dominate; we think this discourages single-metric optimization. Numbers update as new evaluations land.

scoring formula
ProbeScore breakdown0.785
AUROC0.222
Eval-aware0.146
Dist-shift0.089
Calibration0.087
Transfer0.049
Goodhart-resistance0.050
Latency0.092
License0.050

ProbeScore — 8-axis breakdown

ProbeBench v0.0.2 · 8 axes weighted to sum 1.0

Visual breakdown of where this probe scores high vs low across all 8 axes. Polygon area is proportional to total score; shape reveals which axes carry it (and which ones it sacrifices).

AUROC0.89Eval-aware0.81Dist-shift0.74Calibration0.87Transfer0.48Goodhart0.50Latency0.92License1.00
AUROC
0.889
× 0.25 = 0.222
Eval-aware
0.810
× 0.18 = 0.146
Dist-shift
0.745
× 0.12 = 0.089
Calibration
0.870
× 0.10 = 0.087
Cross-model
0.485
× 0.10 = 0.049
Goodhart-resistance
0.500
× 0.10 = 0.050
Latency
0.919
× 0.10 = 0.092
License
1.000
× 0.05 = 0.050

Per-task evaluation

Headline AUROC alongside the eval-awareness-corrected and distribution-shift versions for each task. Reproducers run from a single notebook on a single Colab session.

TasknAUROC [95% CI]Eval-aware correctedDist-shiftECEFPR@99TPRLatencyReproducer
Apollo Insider Trading
Deception · deceptive
1500.978 [0.96, 0.99]0.9100.8400.0400.0122.1 ms
Hypocrisy Gap
Reasoning · unfaithful_cot
1800.800 [0.72, 0.87]0.7100.6500.0900.1002.1 ms

Cross-model transfer (Pearson_CE)

Pearson correlation of paired ablation effects across models. ≥ 0.7 suggests the probe's direction is shared; 0.4–0.7 suggests partial transfer; < 0.4 suggests retraining per architecture.

SourceTargetPearson_CETransfer AUROCNotes
Llama-3.3-70BQwen3.6-27B0.5100.790
Llama-3.3-70BGemma-3-27B0.4600.740
Mean Pearson_CE for this probe (excluding self-baseline): 0.485

Architecture & artifact

Base model

Llama-3.3-70B
meta-llama/Llama-3.3-70B-Instruct
Family
Llama
Params
70B
Architecture
Dense transformer (instruct-tuned)
Layers
80
d_model
8192
Weights license
custom
Probe attaches at
Layer 40 · last_token
huggingface.co/meta-llama/Llama-3.3-70B-Instruct

Artifact

DeceptionGuard weights
sklearn-compatible probe + scaler + meta.json
Params
524,288
Size
2.0 MB
License
Apache-2.0
Released
2026-05-XX
sha256
dc4f2a8b6e5d…9d8c7b6a
Hash matches the artifact at huggingface.co/datasets/caiovicentino1/DeceptionGuard-linearprobe-llama33-70b. Recompute via openinterp probebench verify openinterp/deceptionguard-llama33-70b-l40.

Reproduce

Three entry points. Pick one. Each path lands at the same numbers.

shell
git clone https://github.com/OpenInterpretability/notebooks.git
pip install openinterp
openinterp probebench reproduce openinterp/deceptionguard-llama33-70b-l40

Honest scope

Limits derived from the evaluation data above. We think these are the honest constraints; revisions land as more reproducers do.

  • Trained / evaluated on: Apollo Insider Trading, Hypocrisy Gap. Performance outside these tasks is unmeasured here.
  • Eval-aware AUROC drop: 0.079 AUROC averaged across tasks (uncorrected − corrected for eval-awareness confound, arXiv:2509.13333).
  • Distribution-shift drop: 0.144 AUROC averaged across tasks (in-distribution − long-context / OOD).
  • Cross-model fit (mean Pearson_CE): 0.485 — values below 0.4 suggest retraining per architecture; values above 0.7 suggest the direction is shared.
  • Probe attaches at: L40 · last_token of Llama-3.3-70B. Other layers / positions are out of scope unless re-trained.

Citations & endorsements

Apollo Research (method)

ProbeBench v0.0.1 · 5 probes registered · 11 evaluations · 7 transfer measurements. Schema and weights subject to revision.