v0.3.0 live · Apache-2.0 · detect-only by design

Stop wasting agent budget on traces that won't ship.

Mid-reasoning activation gate for LLM-based code agents on Qwen3.6-27B. Two probes (L43 capability + L55 thinking-intent), ~50 ms scoring, three modes ( skip · escalate · proceed). No black-box LLM-judge tax. Detect-only — we did the steering experiments so you don't have to.

Install from PyPI Read the paper View on GitHub

$pip install --upgrade "openinterp[full]"

requires Python ≥ 3.10 · adds torch + transformers + scikit-learn for `[full]` extras

Capability AUROC

0.83

L43 pre_tool · K=10 · N=54

Thinking AUROC

0.848

L55 last-pos · K=5 · N=240

Sanity checks

random-K · control-token · α-sweep

Causal experiments

all confirm detect-only

The five-line integration

Wrap your agent rollout. Get a routing decision before the first tool call burns budget.

from openinterp import AgentProbeGuard

guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)

decision = guard.assess(messages, partial_response=current_thought)
# decision.action ∈ {"skip", "escalate", "proceed"}
# decision.scores = {"capability": 0.18, "thinking": 0.74}
# decision.thresholds = {"skip_below": 0.20, "escalate_below": 0.50}

Three modes, no fights with the model.

The guard exposes a routing decision; your application decides what to do with it. Default thresholds were calibrated on SWE-bench Pro N=54 traces. Override per-deploy.

skip

capability < 0.20

High failure risk. Abort the trace before tools fire. Reformulate the query or surface to user.

escalate

0.20 ≤ capability < 0.50

Moderate risk. Route to a stronger model (Claude Sonnet, GPT-5) or human review. Save the cheap call.

proceed

capability ≥ 0.50

Run the agent normally. ~0 overhead beyond the single forward pass that captured the residual.

Two probes, two signals.

Both numbers reported as top-K diff-of-means with random-K-matched baseline to surface real signal vs over-parameterization. Paper-grade gap is ≥ +0.10 above random.

probe	capacity	N	AUROC	random K-matched	gap	source
L43 pre_tool — capability	K=10	54	0.830	0.660	+0.17	Phase 6c methodology sweep
L55 pre_response — suppressed CoT intent	K=5	240	0.848	0.701	+0.147	Phase 8 redux random-K-matched

We tested causality. The probes don't lever — they detect.

Three intervention experiments converged on the same verdict: linear probe directions at L43 and L55 are epiphenomenal. They correlate with the outcome but do not participate in the causal pathway that produces it. agent-probe-guard ships without a “boost” mode because we have evidence that boosting wouldn't work — and we'd rather be honest than ship a feature that fails silently.

Phase 7 (L43 pre_tool)

Log-prob proxy with control-token normalization: Δrel ≈ 0. Single-shot α=+5 behavioral: 4/4 fails select identical tool. Continuous α=+5: 3/4 keep tool, 1 degenerates without redirecting.

Phase 8 (L55 thinking)

Bidirectional steering α∈±5: zero behavioral change. Amplitude diagnostic up to α=+200 (perturbation 86% above residual norm): 12 generations identical char-by-char.

Phase 8 redux (top-5 retest)

Concentrated direction on the 5 paper-grade signal dims at α=+200: still no token flip. Rules out direction-dilution. Decision is in input tokens (chat template), not residual.

Read eval v6 (template-lock verdict)

Three sanity checks. Each one caught a confident-but-wrong claim.

The methodology is half the contribution. These belong in any future probe-causality paper.

Random-feature baseline + capacity sweep

Mandatory check at N<100. Phase 5d AUROC=1.000 at K=50 N=17 was over-parameterization caught by random-K-matched comparison. Phase 6c at K=10 found the real signal at +0.17 gap.

paper §eval v2/v3

Control-token normalization for steering

Always report Δrel = Δ(target) − mean(Δ(controls)). Phase 7 naive output reported Δlog-prob(finish) = +0.479 at α=+2, looked causal. Subtracting controls revealed Δrel = -0.046 — uniform softmax shift.

paper §eval v5

Structural-rigidity α-sweep diagnostic

When initial steering at α∈{±2,±5} shows zero change, sweep up to α=200 (>‖residual‖) with both probe and random directions. Phase 8 redux: 12 generations identical even at α=+200, structural lock confirmed (decision is in input tokens, not residual).

paper §eval v6 §D.5

Where we sit.

Most agent observability is post-hoc (logs & cost dashboards). agent-probe-guard reads the model's own residual stream before the next tool call.

tool	scope	latency	license
LangSmith Eval	post-hoc traces	—	closed
Helicone Watcher	cost monitoring	—	Apache-2.0
OpenAI Logprobs API	self-confidence	free	closed weights
Anthropic Probes (paper)	safety-relevant	unknown	research-only
OpenInterp agent-probe-guard	mid-reasoning capability + CoT intent	~50 ms	Apache-2.0

When not to use it.

Closed-weights agents.The probe needs a forward hook on the residual stream. If you only have an API endpoint, this won't work. Use logprobs-based heuristics instead.
Non-Qwen3.6 base. Probe weights were trained on Qwen3.6-27B residuals. Cross-model transfer is paper-2 work in progress; for now use the matching model.
Sub-5-turn agents.Single-shot Q&A doesn't have enough budget at risk to justify a 50ms gate. Below ~5 tool calls, post-hoc LLM-judge eval is fine.
You want behavior modification.The probes don't lever (we proved it three ways). If you need to steerthe agent, this isn't the tool. Detection-only is the honest claim.

Ship agents that know when to stop.

Apache-2.0. PyPI. HF dataset. ~50 ms gate. Five lines of integration.

Install from PyPI Probe weights on HF Reproducible harness

Manifesto · Research · Built on · FabricationGuard · ProbeBench