Back to Observatory
Q2 2026

Compare

N-way diff of reasoning. Put four models side-by-side on the same prompt. Or four prompts on the same model. Heatmap of where feature activations diverge. The killer mode for the reasoning-model era.

Why this matters

When Qwen gets a math question right and Gemma gets it wrong, the interesting question is not "which answer". It is "at what token did the reasoning diverge, and which feature flipped the outcome?" Compare mode answers this visually in 5 seconds.

Thinking-trace mode

For reasoning models with visible thinking traces, Compare aligns the <think> blocks across models and highlights semantic-equivalent steps. "Both models considered path A, only model B pursued path B after token 47" — visual, immediate.

Use cases unlocked

Red-teaming (why did one fine-tune regress on safety?), dataset debugging (which prompts in this eval expose divergent reasoning?), vendor benchmarking, distillation target selection.

Request early access

We prioritize researchers, educators, and safety teams who will use it publicly. Tell us what you want to build; we'll reach out when the beta opens.