Back to ProbeBench
livelinearReasoningApache-2.0

ReasonGuard v0.2

Multi-bench training tested (FabricationGuard methodology) and FALSIFIED for reasoning probes — domain-bound at all difficulty levels

Linear LR probe at L55/mid_think of Qwen3.6-27B reasoning-mode generation. v0.2 trains multi-bench (GSM8K + StrategyQA + MATH combined, 455 samples, 45.8% halu rate). Within-bench AUROC 0.908 on GSM8K held-out — improvement vs v0.1 (0.888). But cross-domain transfer FAILS: 0.612 on StrategyQA (commonsense), 0.500 (chance) on MATH (advanced). AUROC degrades with task difficulty. Position-of-faithfulness in deep residual stream is more strongly domain-bound than multi-bench training compensates for. Multi-bench training thesis (which worked for FabricationGuard cross-task at 0.882) does not transfer to reasoning-faithfulness probes. Honest negative-ish result registered as canonical case study of ProbeBench anti-Goodhart norms — both v0.1 and v0.2 numbers reported without spin.

byCaio VicentinoOpenInterp2026-04-29
ProbeScore
0.626
Reasoning #1Global #5
7 weighted axes. Subject to revision as more reproducers land.

Quickstart

Three lines via the openinterp.probebench SDK. Returns the probe's P(positive_class) for a tensor of activations captured at the layer/position below.

python
from openinterp import probebench
probe = probebench.load("openinterp/reasonguard-qwen36-27b-l55-mid_think")
score = probe.score(activations)  # → P(positive_class)

ProbeScore — 7 weighted axes

Composite metric in [0, 1]. No single axis can dominate; we think this discourages single-metric optimization. Numbers update as new evaluations land.

scoring formula
ProbeScore breakdown0.626
AUROC0.168
Eval-aware0.103
Dist-shift0.064
Calibration0.061
Transfer0.050
Goodhart-resistance0.030
Latency0.100
License0.050

ProbeScore — 8-axis breakdown

ProbeBench v0.0.2 · 8 axes weighted to sum 1.0

Visual breakdown of where this probe scores high vs low across all 8 axes. Polygon area is proportional to total score; shape reveals which axes carry it (and which ones it sacrifices).

AUROC0.67Eval-aware0.57Dist-shift0.54Calibration0.61Transfer0.50Goodhart0.30Latency1.00License1.00
AUROC
0.673
× 0.25 = 0.168
Eval-aware
0.572
× 0.18 = 0.103
Dist-shift
0.537
× 0.12 = 0.064
Calibration
0.607
× 0.10 = 0.061
Cross-model
0.500
× 0.10 = 0.050
Goodhart-resistance
0.300
× 0.10 = 0.030
Latency
1.000
× 0.10 = 0.100
License
1.000
× 0.05 = 0.050

Per-task evaluation

Headline AUROC alongside the eval-awareness-corrected and distribution-shift versions for each task. Reproducers run from a single notebook on a single Colab session.

TasknAUROC [95% CI]Eval-aware correctedDist-shiftECEFPR@99TPRLatencyReproducer
GSM8K
Reasoning · wrong_answer
900.9080.7720.6120.1000.1801.0 msopen
StrategyQA
Reasoning · wrong_answer
450.6120.5200.5000.1700.5201.0 ms
MATH
Reasoning · wrong_answer
600.5000.4250.5000.3200.9901.0 ms

Cross-model transfer (Pearson_CE)

Pearson correlation of paired ablation effects across models. ≥ 0.7 suggests the probe's direction is shared; 0.4–0.7 suggests partial transfer; < 0.4 suggests retraining per architecture.

SourceTargetPearson_CETransfer AUROCNotes
Qwen3.6-27BQwen3.6-27B1.0000.888Same model — identity baseline (within-bench GSM8K).
Mean Pearson_CE for this probe (excluding self-baseline): 0.500

Architecture & artifact

Base model

Qwen3.6-27B
Qwen/Qwen3.6-27B
Family
Qwen
Params
27B
Architecture
Hybrid GDN + Gated-Attn (dense, reasoning)
Layers
64
d_model
5120
Weights license
Apache-2.0
Probe attaches at
Layer 55 · mid_think
huggingface.co/Qwen/Qwen3.6-27B

Artifact

ReasonGuard weights
sklearn-compatible probe + scaler + meta.json
Params
312,000
Size
1.2 MB
License
Apache-2.0
Released
2026-04-29
sha256
rg9d3b5f2a0e…e5d4c3b2
Hash matches the artifact at huggingface.co/datasets/caiovicentino1/ReasoningGuard-linearprobe-qwen36-27b. Recompute via openinterp probebench verify openinterp/reasonguard-qwen36-27b-l55-mid_think.

Reproduce

Three entry points. Pick one. Each path lands at the same numbers.

shell
git clone https://github.com/OpenInterpretability/notebooks.git
pip install openinterp
openinterp probebench reproduce openinterp/reasonguard-qwen36-27b-l55-mid_think

Honest scope

Limits derived from the evaluation data above. We think these are the honest constraints; revisions land as more reproducers do.

  • Trained / evaluated on: GSM8K, StrategyQA, MATH. Performance outside these tasks is unmeasured here.
  • Eval-aware AUROC drop: 0.101 AUROC averaged across tasks (uncorrected − corrected for eval-awareness confound, arXiv:2509.13333).
  • Distribution-shift drop: 0.136 AUROC averaged across tasks (in-distribution − long-context / OOD).
  • Cross-model fit (mean Pearson_CE): 0.500 — values below 0.4 suggest retraining per architecture; values above 0.7 suggest the direction is shared.
  • Probe attaches at: L55 · mid_think of Qwen3.6-27B. Other layers / positions are out of scope unless re-trained.

ProbeBench v0.0.1 · 5 probes registered · 11 evaluations · 7 transfer measurements. Schema and weights subject to revision.