pending reviewlinearDeceptionApache-2.0

DeceptionGuard (Apollo re-impl)

Re-impl of Apollo Research deception probe (AUROC 0.96–0.999 published)

Re-implementation of the Apollo Research deception-detection method on Llama-3.3-70B-Instruct. Uses paired honest/deceptive contrast pairs from insider-trading and werewolf scenarios. Citation: Goldowsky-Dill, Chughtai, Heimersheim, Hobbhahn (ICML 2025).

byCaio Vicentino (re-impl) / Apollo Research (method)OpenInterp2026-05-XXarXiv:2502.03407

ProbeScore

0.785

Deception #1Global #2

7 weighted axes. Subject to revision as more reproducers land.

Quickstart

Three lines via the openinterp.probebench SDK. Returns the probe's P(positive_class) for a tensor of activations captured at the layer/position below.

python

from openinterp import probebench
probe = probebench.load("openinterp/deceptionguard-llama33-70b-l40")
score = probe.score(activations)  # → P(positive_class)

ProbeScore — 7 weighted axes

Composite metric in [0, 1]. No single axis can dominate; we think this discourages single-metric optimization. Numbers update as new evaluations land.

scoring formula

ProbeScore breakdown0.785

AUROC0.222

Eval-aware0.146

Dist-shift0.089

Calibration0.087

Transfer0.049

Goodhart-resistance0.050

Latency0.092

License0.050

ProbeScore — 8-axis breakdown

ProbeBench v0.0.2 · 8 axes weighted to sum 1.0

Visual breakdown of where this probe scores high vs low across all 8 axes. Polygon area is proportional to total score; shape reveals which axes carry it (and which ones it sacrifices).

AUROC

0.889

× 0.25 = 0.222

Eval-aware

0.810

× 0.18 = 0.146

Dist-shift

0.745

× 0.12 = 0.089

Calibration

0.870

× 0.10 = 0.087

Cross-model

0.485

× 0.10 = 0.049

Goodhart-resistance

0.500

× 0.10 = 0.050

Latency

0.919

× 0.10 = 0.092

License

1.000

× 0.05 = 0.050

Per-task evaluation

Headline AUROC alongside the eval-awareness-corrected and distribution-shift versions for each task. Reproducers run from a single notebook on a single Colab session.

Task	n	AUROC [95% CI]	Eval-aware corrected	Dist-shift	ECE	FPR@99TPR	Latency	Reproducer
Apollo Insider Trading Deception · deceptive	150	0.978 [0.96, 0.99]	0.910	0.840	0.040	0.012	2.1 ms	—
Hypocrisy Gap Reasoning · unfaithful_cot	180	0.800 [0.72, 0.87]	0.710	0.650	0.090	0.100	2.1 ms	—

Cross-model transfer (Pearson_CE)

Pearson correlation of paired ablation effects across models. ≥ 0.7 suggests the probe's direction is shared; 0.4–0.7 suggests partial transfer; < 0.4 suggests retraining per architecture.

Source	Target	Pearson_CE	Transfer AUROC	Notes
Llama-3.3-70B	Qwen3.6-27B	0.510	0.790
Llama-3.3-70B	Gemma-3-27B	0.460	0.740

Mean Pearson_CE for this probe (excluding self-baseline): 0.485

Architecture & artifact

Base model

Llama-3.3-70B

meta-llama/Llama-3.3-70B-Instruct

Family: Llama
Params: 70B
Architecture: Dense transformer (instruct-tuned)
Layers: 80
d_model: 8192
Weights license: custom

Probe attaches at

Layer 40 · last_token

huggingface.co/meta-llama/Llama-3.3-70B-Instruct

Artifact

DeceptionGuard weights

sklearn-compatible probe + scaler + meta.json

Params: 524,288
Size: 2.0 MB
License: Apache-2.0
Released: 2026-05-XX
sha256: dc4f2a8b6e5d…9d8c7b6a

huggingface.co/datasets/caiovicentino1/DeceptionGuard-linearprobe-llama33-70b source notebook on GitHub open reproducer in Colab

Hash matches the artifact at huggingface.co/datasets/caiovicentino1/DeceptionGuard-linearprobe-llama33-70b. Recompute via openinterp probebench verify openinterp/deceptionguard-llama33-70b-l40.

Reproduce

Three entry points. Pick one. Each path lands at the same numbers.

Open in Colab

One-click. Free T4 works for inference; reproducer needs an A100 or RTX 6000.

colab.research.google.com

View on GitHub

Read the notebook source, see commit history, file an issue.

github.com · OpenInterpretability

Download from HF

Probe weights, scaler, and meta.json. Verified by the sha256 above.

huggingface.co · datasets

shell

git clone https://github.com/OpenInterpretability/notebooks.git
pip install openinterp
openinterp probebench reproduce openinterp/deceptionguard-llama33-70b-l40

Honest scope

Limits derived from the evaluation data above. We think these are the honest constraints; revisions land as more reproducers do.

Trained / evaluated on: Apollo Insider Trading, Hypocrisy Gap. Performance outside these tasks is unmeasured here.
Eval-aware AUROC drop: −0.079 AUROC averaged across tasks (uncorrected − corrected for eval-awareness confound, arXiv:2509.13333).
Distribution-shift drop: −0.144 AUROC averaged across tasks (in-distribution − long-context / OOD).
Cross-model fit (mean Pearson_CE): 0.485 — values below 0.4 suggest retraining per architecture; values above 0.7 suggest the direction is shared.
Probe attaches at: L40 · last_token of Llama-3.3-70B. Other layers / positions are out of scope unless re-trained.

Citations & endorsements

Apollo Research (method)

Back to Deception leaderboard Submit your own probe

ProbeBench v0.0.1 · 5 probes registered · 11 evaluations · 7 transfer measurements. Schema and weights subject to revision.