Upload your failure dataset. We run your model through an SAE, compute per-token feature activations on every prompt, correlate with your error labels, return the top 50 features most predictive of failure — with semantic descriptions and suggested interventions.
The grail of LLM debugging
Today, debugging an LLM regression is notebook archaeology: check logs, run ablations, read papers, get unlucky. Auto-Interp does the archaeology automatically and returns a ranked list you can act on. Works on any HuggingFace model we have an SAE for.
How it works
Given (prompts, labels), compute L0/L31 features per token, run a regularized logistic regression or MI score on each feature against the label. Filter by AUROC > 0.6, annotate with LLM-judge using top-activating examples, return.
Output
A notebook-ready JSON report plus a web-visualizable dashboard. Every feature linked to its Atlas entry. Suggested Sandbox recipes pre-generated for the top 5.