Stop wasting agent budget on traces that won't ship.
Mid-reasoning activation gate for LLM-based code agents on Qwen3.6-27B. Two probes (L43 capability + L55 thinking-intent), ~50 ms scoring, three modes ( skip · escalate · proceed). No black-box LLM-judge tax. Detect-only — we did the steering experiments so you don't have to.
pip install --upgrade "openinterp[full]"requires Python ≥ 3.10 · adds torch + transformers + scikit-learn for `[full]` extras
The five-line integration
Wrap your agent rollout. Get a routing decision before the first tool call burns budget.
from openinterp import AgentProbeGuard
guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)
decision = guard.assess(messages, partial_response=current_thought)
# decision.action ∈ {"skip", "escalate", "proceed"}
# decision.scores = {"capability": 0.18, "thinking": 0.74}
# decision.thresholds = {"skip_below": 0.20, "escalate_below": 0.50}Three modes, no fights with the model.
The guard exposes a routing decision; your application decides what to do with it. Default thresholds were calibrated on SWE-bench Pro N=54 traces. Override per-deploy.
High failure risk. Abort the trace before tools fire. Reformulate the query or surface to user.
Moderate risk. Route to a stronger model (Claude Sonnet, GPT-5) or human review. Save the cheap call.
Run the agent normally. ~0 overhead beyond the single forward pass that captured the residual.
Two probes, two signals.
Both numbers reported as top-K diff-of-means with random-K-matched baseline to surface real signal vs over-parameterization. Paper-grade gap is ≥ +0.10 above random.
| probe | capacity | N | AUROC | random K-matched | gap | source |
|---|---|---|---|---|---|---|
| L43 pre_tool — capability | K=10 | 54 | 0.830 | 0.660 | +0.17 | Phase 6c methodology sweep |
| L55 pre_response — suppressed CoT intent | K=5 | 240 | 0.848 | 0.701 | +0.147 | Phase 8 redux random-K-matched |
We tested causality. The probes don't lever — they detect.
Three intervention experiments converged on the same verdict: linear probe directions at L43 and L55 are epiphenomenal. They correlate with the outcome but do not participate in the causal pathway that produces it. agent-probe-guard ships without a “boost” mode because we have evidence that boosting wouldn't work — and we'd rather be honest than ship a feature that fails silently.
Three sanity checks. Each one caught a confident-but-wrong claim.
The methodology is half the contribution. These belong in any future probe-causality paper.
Mandatory check at N<100. Phase 5d AUROC=1.000 at K=50 N=17 was over-parameterization caught by random-K-matched comparison. Phase 6c at K=10 found the real signal at +0.17 gap.
Always report Δrel = Δ(target) − mean(Δ(controls)). Phase 7 naive output reported Δlog-prob(finish) = +0.479 at α=+2, looked causal. Subtracting controls revealed Δrel = -0.046 — uniform softmax shift.
When initial steering at α∈{±2,±5} shows zero change, sweep up to α=200 (>‖residual‖) with both probe and random directions. Phase 8 redux: 12 generations identical even at α=+200, structural lock confirmed (decision is in input tokens, not residual).
Where we sit.
Most agent observability is post-hoc (logs & cost dashboards). agent-probe-guard reads the model's own residual stream before the next tool call.
| tool | scope | latency | mid-reasoning? | open | license |
|---|---|---|---|---|---|
| LangSmith Eval | post-hoc traces | — | closed | ||
| Helicone Watcher | cost monitoring | — | Apache-2.0 | ||
| OpenAI Logprobs API | self-confidence | free | closed weights | ||
| Anthropic Probes (paper) | safety-relevant | unknown | research-only | ||
| OpenInterp agent-probe-guard | mid-reasoning capability + CoT intent | ~50 ms | Apache-2.0 |
When not to use it.
- Closed-weights agents.The probe needs a forward hook on the residual stream. If you only have an API endpoint, this won't work. Use logprobs-based heuristics instead.
- Non-Qwen3.6 base. Probe weights were trained on Qwen3.6-27B residuals. Cross-model transfer is paper-2 work in progress; for now use the matching model.
- Sub-5-turn agents.Single-shot Q&A doesn't have enough budget at risk to justify a 50ms gate. Below ~5 tool calls, post-hoc LLM-judge eval is fine.
- You want behavior modification.The probes don't lever (we proved it three ways). If you need to steerthe agent, this isn't the tool. Detection-only is the honest claim.
Ship agents that know when to stop.
Apache-2.0. PyPI. HF dataset. ~50 ms gate. Five lines of integration.
Manifesto · Research · Built on · FabricationGuard · ProbeBench