Entity-recognition v0.0.1 — the failed first try
How a 2× tokenization confound gave a fake AUROC=1.0
Educational "how NOT to do it" notebook. Synthetic Slavic-style fake-entity names had ~2× the token count of famous entities — even the best feature was just counting subword tokens. Posted unchanged so the failure mode is reproducible. The fix is in 24b.
⏱ ~30 min · Colab T4
▸ Colab Free
Ferrando 2024 replication on Qwen3.6-27B
AUROC 0.8379 on real Wikidata entities (vs 0.732 baseline)
The methodology fix for 24. Uses real known/unknown Wikidata entities from javiferran/sae_entities, labels via attribute recall on the 27B model, applies the Pile noise filter (>2% rate dropped), and ranks single latents by Cohen's d. Surfaces feature L11/f61723 — first proper Ferrando replication at 27B scale.
⏱ ~2 h · Colab A100
▸ Colab Pro · A100 recommended
Single-feature steering — the null result
Clamp ±5 on f61723 · no calibration effect
First steering test: clamp the entity-recognition feature to ±5 (additive ±2) at L11 and check whether refusal rate on unknown entities moves. It does not. Detection ≠ control. Sets up the multi-feature experiments in 26 / 27.
⏱ ~45 min · Colab A100
▸ Colab Pro · A100 recommended
Multi-feature steering — top-K (no controls)
−15pp on unknown refusal · would have shipped overclaimed
Ablate top-K (K∈{5,20,50,200}) features sorted by Cohen's d. Naive read: −15pp on unknown-entity refusal at K=200 — looks like a calibration knob. We almost shipped it before adding controls. The honest version is 27.
⏱ ~1.5 h · Colab A100
▸ Colab Pro · A100 recommended
Multi-feature steering with full controls
Random-K null + direction-sort + Claude judge → it induces hallucination
The walk-back. Six controls: random-K (R=30 draws), direction-sorted (top positive-d / top negative-d / mixed |d|), 3-way split, anti-feature, Claude Haiku judge, permutation test. Top-K is 4-8σ outside random null — but the judge shows the "less hedging" is confident-wrong answers (62%→77% on incorrect refusal), not improved correctness. Hallucination-induction mechanism, not a calibration knob.
⏱ ~3 h · Colab A100
▸ Colab Pro · A100 + ANTHROPIC_API_KEY
Paper baselines — Ferrando 2024 on Qwen3.6-27B
L31/f34957 0.81 · LR 0.887 · diff-of-means 0.859
The headline-numbers notebook for the ICML MI Workshop paper-1. Ferrando-style entity-recognition replication with 607 entities, per-layer scan across all 64 layers for linear probe + diff-of-means baselines, 95% bootstrap CI, HF resumable checkpoints. Cleanly compares single SAE feature vs supervised LR ceiling vs cross-bench generalization.
⏱ ~3 h · Colab A100
▸ Colab Pro · A100 recommended
Sensitivity — refusal-only vs Ferrando labelling
Same residual capture · 2 labelling rules → which signal survives?
Ablation companion to 28. Re-uses the cached residual capture, swaps labelling rule (refusal-only vs Ferrando-style confabulation-as-unknown). Builds reviewer defence: shows the L31/f34957 0.81 AUROC is robust to the labelling rule choice, falsifies an earlier "L11 best" claim from v0.0.2.
⏱ ~30 min · Colab T4
▸ Colab Free