2026-04-2518 min readCaio Vicentino · OpenInterpretability

Entity-recognition features in Qwen3.6-27B — a replication, and a methodology lesson

AUROC 0.84 for the "I know this entity" feature on Qwen3.6-27B — vs Ferrando 2024's 0.73 on Gemma-2-2B-IT. Two-day arc with three controls: a tokenization confound that gave fake AUROC=1.0, single-feature steering that didn't move calibration, and multi-feature top-200 ablation that beats the random-K null at 4-8σ — but the LLM judge shows the "less hedging" is purely additional incorrect answers, not correct ones. We found a hallucination-induction mechanism, not a calibration knob.

TL;DR

On caiovicentino1/qwen36-27b-sae-papergrade, the best single SAE latent classifies known vs. unknown Wikidata entities with AUROC 0.8379at layer 11 — beating Ferrando 2024's 0.732 on Gemma-2-2B-IT (their L13).
The signal peaks at L11 (early stack, ~17% depth), not mid-stack. Different from Gemma-2-2B-IT's L9-mid finding. We don't have causal evidence for why yet — one hypothesis is that 27B reasoning-tuned models commit to entity recognition earlier; another is just architectural variance. Worth confirming on more 27B+ reasoning models.
The first run got AUROC = 1.0. That was a tokenization confound, not a breakthrough. The debug + fix is the more useful part of this post.
All artifacts published under Apache-2.0: hallucination_v0_0_2.json on HF, notebook 24b_hallucination_v002_ferrando_proper.ipynb.

What we're replicating

Ferrando et al. ICLR 2025, “Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models” showed that on Gemma-2-2B and Gemma-2-9B, individual SAE features can act as a probe for whether the model “knows” a given entity. Ablating those features changes the model's confabulation rate. It's the cleanest existing argument that SAE-based interpretability can produce a calibration signal for hallucinations.

We have three paper-grade SAEs on Qwen3.6-27B (HF / 200M tokens / L11, L31, L55 / TopK + AuxK / Apache-2.0) and wanted to know whether the same trick works on a much larger, reasoning-tuned model. We didn't know what the answer would be. The interesting bit was finding out.

v0.0.1 — how we got AUROC = 1.0 the wrong way

The first version of the test built a dataset by hand:

200 unambiguously famous people (Tom Hanks, Albert Einstein, Lionel Messi, …)
200 procedurally-generated synthetic names with Slavic / Latinate roots (Vlasik Korpel, Krenadia Brovner, …) — designed to look real but have zero Wikipedia presence

For each entity we ran a contrastive prompt, captured the SAE residual at the entity's last token, and trained a linear probe on the top-50 features by separation. The result:

AUROC = 1.0000 on all three layers (L11 / L31 / L55).

That number is suspicious. Real entity-recognition probes on 2B–9B models live in the 0.7–0.85 range. Perfect separation in interp is almost always a confound.

The diagnosis was a 30-second tokenization check:

KNOWN   mean=3.30 tokens   median=3   range=1–7
UNKNOWN mean=6.03 tokens   median=6   range=4–10

Synthetic names were tokenizing to roughly twice as many tokens as real names. Their last tokens were random subword fragments ('cu', 'nov', 'ic') while the famous-name last tokens were real word endings. The SAE was happily detecting that — the highest-separation feature fired on “long sequence of rare subwords” via attention, not on entity recognition.

The wrong AUROC = 1.0, in other words, was perfectly explained by the surface-tokenization difference. We never had to think about what the model knew.

The lesson

We had read Ferrando's abstract and our own prior research notes — but we had not read the dataset-construction section of the paper, nor opened their javiferran/sae_entities repository. A 30-minute look at either would have surfaced the critical detail: their known and unknown entities both come from the same Wikidata distribution. They never mix synthetic names with real ones.

We've added a hard rule to our memory: before writing a notebook that replicates a published method, read the paper's dataset section and any open replication first. Cost: 30 minutes. Saves: hours of compute on a result that doesn't generalize.

v0.0.2 — what worked

The corrected pipeline mirrors Ferrando's controls:

Real Wikidata entities only. We pulled their pre-processed JSON files (player.json, movie.json, city.json, song.json) — same surface distribution for both classes. No synthetic names anywhere.
Model-defined labels. For each candidate entity, we asked Qwen3.6-27B three attribute questions (place of birth?, director?, etc.) and scored the answers. ≥2/3 correct + zero refusals → known. 0/3 correct + ≥1 refusal → unknown. Anything in between → discarded. That follows their filter_known_unknown protocol exactly.
Pile noise filter. We ran ~2 000 random tokens from NeelNanda/pile-10k through each SAE and dropped any feature that fired on more than 2% of them. This explicitly removes generic surface features — including the kind that drove our v0.0.1 result. Roughly 1 250 features per layer were filtered out.
Single-latent scoring.We didn't train a multi-feature linear probe. Following the paper, the “classifier” is just one feature's raw activation magnitude. Less powerful, more honest.
Train-only feature selection.Top-100 candidate features were chosen by Cohen's d on the train split only, then their AUROC was measured on a disjoint test split.

The final dataset, after Qwen3.6-27B did its own labelling on 1 000 candidates, was 134 known + 92 unknown entities(774 ambiguous / discarded). Smaller than we'd like, but enough to read.

The numbers

Layer (depth)	Best feature	Cross-type AUROC	Pile fire-rate at best
L11 (17%)	f61723	0.8379	0.50%
L31 (48%)	f54622	0.8085	—
L55 (86%)	f29703	0.7724	—

Reference: Ferrando 2024 reports AUROC ≈ 0.732 on Gemma-2-2B-IT at L13. Our 27B beats that across all three layers — the gap is largest at L11, where it is 0.105 above their number.

The headline activation distribution at the best layer is below: most known entities have near-zero activation on that feature, while unknowns spread to the right.

L11/f61723 activation histogram — known vs unknown — single-latent AUROC = 0.8379 — L11/f61723 · single-latent AUROC = 0.8379 · Pile-filtered (0.50% activation rate on random Pile tokens)

Within-type sanity check

The class composition isn't balanced across entity types — 73 of the known entities are movies but only 32 are players, while unknowns are mostly players (53/92). To rule out a class-imbalance confound, we computed AUROC within each entity type separately:

Entity type	N (known + unknown)	Best within-type AUROC
player	32 + 53	0.9113 (L11)
movie	73 + 28	0.8461 (L31)
song	28 + 2	too few unknowns
city	1 + 9	too few knowns

Both player and movie show strong separation independently, so the cross-type 0.84 isn't purely class-imbalance. One caveat:within-type N is small, so we couldn't use a held-out test split — the within-type numbers are biased upward by feature selection on the same data they're evaluated on. The 0.8379 cross-type number is the only fully unbiased one.

The L11 finding

Ferrando reports the entity-recognition signal peaks around layer 9 of the 18-layer Gemma-2-2B-IT, then plateaus deeper. We see the strongest signal at L11 of the 64-layer Qwen3.6-27B— about 17% depth. By L31 (48%) the AUROC is 0.81 and by L55 (86%) it's 0.77.

Two hypotheses for why early-stack might be where the signal lives in this model:

27B + reasoning-tuning may push entity-token detection earlier — the model has “decided” whether it recognizes a name before it starts thinking.
Architectural variance — Qwen3.6 dense transformer vs Gemma-2-2B might just route this kind of token-level lexical matching differently. Same task, different geometry.

We don't have enough data to pick between these. Both predict that running the same test on Llama-3.3-70B-Reasoning or Qwen3.6-35B-A3B (a different architecture) would adjudicate. That's the next experiment.

Caveats we're flagging up front

N is small. 134 + 92 = 226 entities is below the ≥500/class Ferrando recommends; CIs are wide. The AUROC point estimate could move ±0.04 with a larger run.
Songs and cities under-represented. Our labelling pipeline was too strict on numeric attributes (population, elevation, year) — the model often paraphrases or rounds, and substring-match misses correct answers. A v0.0.3 should replace substring-match with a Claude-as-judge labeller, similar to Obalcells et al. 2025 (the Ferrando follow-up).
The top feature (f61723) was also our top feature in v0.0.1.Pile filter and a different dataset distinguished a real signal from a confound, but it still suggests f61723 is partly an “uncommon-name” detector — not a clean semantic “I know this” circuit. A causal ablation experiment would tighten the interpretation.
Single-prompt activation. We capture the residual at one position in one prompt template per entity. Ferrando aggregates across attribute-specific templates. Our number is therefore noisier than theirs.

Update — we tried the steering test (and it didn't work)

2026-04-25 follow-up: Same day as the original post. Causal evidence is the right next test, so we ran it. Result: predictive but not controllable. The feature's AUROC of 0.84 stands as a detection signal, but it isn't a steering knob.

We tested two interventions on L11/f61723 during generation, on 20 known + 20 unknown Wikidata entities (re-labelled with the v0.0.2 protocol):

Clamp ±5: force the feature to 0 (“treat as known”) or 5.0 (“treat as unknown”)
Additive ±2 (Anthropic biology-paper style): add ± 2 × W_dec[f61723] to the residual at L11 — gentler, stays in-distribution

Refusal rates across conditions:

Class	Baseline	Ablate	Amplify
Known (n=20)	0.0%	0.0%	0.0%
Unknown (n=20)	15.0%	20.0% / 25.0%	15.0% / 10.0%

Cells show additive ±2 / clamp ±5. Ablate on unknown was supposed to decrease refusal (treat as known); it increased instead. Amplify on known was supposed to increase refusal; nothing moved because the model never refused at baseline on these niche-but-real entities.

But here's the nuance worth keeping: the additive intervention changed the actual generation text in 60–65% of cases. The feature is causally active — it shifts what facts the model confabulates and how it phrases descriptions — just not on the binary refuse-vs-answer decision the AUROC promised.

Three readings of this, in increasing speculativeness:

The feature is “uncommon-token detector”, not strictly “I don't know”. Real causal effect on style and word choice, but doesn't flip the model's binary commit-or-hedge behavior. AUROC was reading a correlate of the latent we wanted, not the latent itself.
27B reasoning-tuned models route calibration through circuits, not single features. Anthropic's Templeton et al. 2024 already flagged this on Claude 3 Sonnet — single-feature steering for high-level model behaviors is hit-or-miss.
SAE feature decomposition is lossy.The 65k-feature dictionary may capture an aspect of the “I don't know” semantic without isolating the full causal pathway.

What this changes for use cases

Still valid:hallucination warning UI, RAG auto-trigger, production monitoring. Predictive use stands — AUROC 0.84 wasn't invalidated.
Invalidated:steering API (“amplify the feature to reduce confabulation”), RL reward shaper based on this feature, any “we control hallucination” story.
Open:circuit-level interventions (multiple features at once), attention-head steering, or constrained decoding remain to test. We didn't disprove that someintervention controls calibration — just that this single feature doesn't.

The HF artifact is at steering_v0_0_1.json; the notebook is 25_steering_f61723_calibration.ipynb.

Update 2 — multi-feature steering with proper controls

Resolved 2026-04-26.The first version of this section claimed a circuit-level effect and we walked it back the same day pending the random-feature ablation control. We've now run that control plus four others (notebook 27). The headline below replaces the walk-back caveat. The causal effect is real, but the direction reframes the whole story: our intervention induces hallucination, it does not improve calibration.

The result, in one paragraph

Ablating the top-200 entity-recognition features (ranked by Cohen's d on a held-out selection split, after Pile noise filter) at L11 of Qwen3.6-27B reduces refusal rate on unknown entities by −16.7pp (top-|d| sweep) or −8.3pp (fires-on-unknown sweep). Both effects are 4-8 standard deviations below the random-K null distribution (R=30 random draws of 200 features each). The anti-feature control (bottom-|d|) gives 0pp. So the ranking is selecting something causal — that part of the story stood up.

But the LLM-judge analysis on the resulting generations reveals what “reduced refusal” actually means: the model wasn't hedging because it knew it didn't know — it was hedging on top of a baseline that already confabulated 62-100% of the time when it spoke. After our intervention, the incorrect-answer rate on known entities went from 62% to 77%, and the correct-answer rate dropped from 8% to 0%. The “extra non-refusals” the intervention buys are uniformly extra hallucinations, not extra correct answers.

Sweep table

Ranking @ K=200	Δ refusal (unknown)	vs random null	Verdict
top \|d\| (mixed)	−16.7pp	≈ 8σ below null	REAL
top neg d (fires-on-unknown)	−8.3pp	≈ 4σ below null	REAL · direction predicted
top pos d (fires-on-known)	+8.3pp	p≈0.93	within null
bottom \|d\| (anti-feature control)	0pp	at null minimum	control passes ✓
random K=200 (R=30 draws)	+0.6pp ± 2.1pp	—	null distribution

K∈{5, 20, 50} produced 0pp effect for both top-K and random-K — at our SAE width (65k latents), ablating 0.08% or fewer features is too sparse to disrupt the model in either selection regime. The signal lives at K=200 (~0.3%).

The judge finding — confabulation-vs-correct

We asked Claude Haiku to score every non-refusal generation as correct, incorrect, or unverifiable against Wikidata ground truth.

Condition	n	correct	incorrect	unverifiable
Baseline · KNOWN entities	13	8%	62%	31%
top_neg_d K=200 · KNOWN	13	0%	77%	23%
Baseline · UNKNOWN	7	0%	100%	0%
top_neg_d K=200 · UNKNOWN	8	0%	100%	0%

The intervention's “less hedging” effect is not a calibration improvement. Per-condition correctness either stays at zero or drops; per-condition wrong-answer rate goes up. We're not making the model more honest about what it knows — we're making it less honest about what it doesn't.

How this connects to Ferrando 2024

Ferrando's steering experiments showed that amplifying“known” features on unknown entities causes the model to confabulate, and ablating them on known entities causes refusal. Our finding is the symmetric direction: ablating “unknown” features on unknown entities causes confabulation. Same mechanism, different end of the rope. Both directions are evidence for an entity-recognition circuit that gates calibrated refusal — and both directions show that perturbing the circuit doesn't produce honesty, it produces confidently wrong outputs.

Permutation-test caveat

One thing we want to flag honestly. We ran a label-permutation test on the Cohen's d ranking (1000 shuffles of known/unknown labels, recompute top-200, measure overlap with our real top-200). Random permutation gave a mean overlap of 36/200, vs 0.6expected by chance over the Pile-passing feature pool. That's 60× more overlap than chance — meaning a chunk of our “top entity-recognition features” is dominated by baseline feature popularity at L11, not by entity-recognition signal specifically. The causal effect is still real (random-K with the same K gives ~0pp, our top-K gives −8 to −16pp), but the "entity-recognition"interpretation of which features we're ablating is partly contaminated. A cleaner v0.0.3 ranking would residualize Cohen's d against feature firing rate first.

Bottom line, after controls

Three things hold up after the controls:

Multi-feature SAE ablation has real causal effect on entity-retrieval behavior at L11 of Qwen3.6-27B. Random-K, anti-feature, and direction-sorted controls all line up.
The effect direction is hallucination-induction, not calibration improvement. The model becomes more confidently wrong, not more honest. This is the symmetric of Ferrando 2024's known-side ablation result.
The L11 ranking is partly dominated by feature popularity, not pure entity-recognition signal. Causal finding survives this; mechanistic interpretation is more cautious.

Artifacts: multi_feature_steering_v0_0_2.json; notebook 27_multi_feature_steering_with_controls.ipynb.

Full story arc, after controls

Predictive (AUROC 0.84): single SAE feature classifies known/unknown cleanly after Pile filter. Ship-able as detector for UI / RAG / monitoring.
Single-feature steering: not causal.Clamping or perturbing f61723 alone changes 60% of generations but doesn't shift refusal rate in the predicted direction.
Multi-feature steering with controls: real causal effect, but hallucination-induction.Random-K control passes (top-K is 4-8σ outside null at K=200, anti-feature gives 0pp). Direction predicted (less refusal). But the judge analysis shows the “less refusal” is purely additional incorrect answers, not correct ones — the intervention makes the model more confidently wrong, not more honest.

What's next

Targeted-direction multi-feature.Top-K by |Cohen's d| mixes fires-on-known and fires-on-unknown features. Ablating only the fires-on-unknown ones (or only the fires-on-known) might give cleaner direction. Worth a notebook.
Larger held-out test. Re-run the pipeline at N ≥ 500/class, ideally with a Claude-as-judge labeller so we can use songs and cities. The 0.84 might tighten to 0.78 or stretch to 0.88 — either way, a real number.
Cross-model. Same test on Qwen3.6-35B-A3B (triple-hybrid arch) and Llama-3.3-70B (different family) to adjudicate the L11-early hypothesis.
Composite hallucination predictor. Combine the entity-recognition feature with our existing reasoning-quality probe (MCR Stage D, AUROC 0.78) to see whether they capture orthogonal signal.

Acknowledgments

The methodology is Ferrando et al. 2024; the dataset is theirs; the entire framing is theirs. We just ran it on a bigger model and reported what we found. Their javiferran/sae_entities repo is one of the cleanest research codebases we've worked with — most of the methodology details we needed were inline-commented. The successor paper Obalcells et al. 2025 (Ferrando is a co-author) extends this with a Claude-as-judge labeller and tests up to 70B; that's likely the right reference for a v0.1 follow-up.

Reproduce

# Notebook
git clone https://github.com/OpenInterpretability/notebooks
cd notebooks/notebooks
# Open 24b_hallucination_v002_ferrando_proper.ipynb on Colab
# Set HF_TOKEN in Secrets, "Runtime → Run all"

# Cost: ~$15 GPU + ~2h on RTX 6000 Pro 96GB
# Output: hallucination_v0_0_2.json + chart at the SAE HF repo

Comments, replications, push-back, “you missed control X” — open an issue on OpenInterpretability/notebooks or email hi@openinterp.org. Most useful would be: independent replication on a different 27B+ reasoning model.