v0.3.0 live · Apache-2.0 · detect-only by design

Stop wasting agent budget on traces that won't ship.

Mid-reasoning activation gate for LLM-based code agents on Qwen3.6-27B. Two probes (L43 capability + L55 thinking-intent), ~50 ms scoring, three modes ( skip · escalate · proceed). No black-box LLM-judge tax. Detect-only — we did the steering experiments so you don't have to.

$pip install --upgrade "openinterp[full]"

requires Python ≥ 3.10 · adds torch + transformers + scikit-learn for `[full]` extras

Capability AUROC
0.83
L43 pre_tool · K=10 · N=54
Thinking AUROC
0.848
L55 last-pos · K=5 · N=240
Sanity checks
3
random-K · control-token · α-sweep
Causal experiments
3
all confirm detect-only

The five-line integration

Wrap your agent rollout. Get a routing decision before the first tool call burns budget.

from openinterp import AgentProbeGuard

guard = AgentProbeGuard.from_pretrained("Qwen/Qwen3.6-27B")
guard.attach(model, tok)

decision = guard.assess(messages, partial_response=current_thought)
# decision.action ∈ {"skip", "escalate", "proceed"}
# decision.scores = {"capability": 0.18, "thinking": 0.74}
# decision.thresholds = {"skip_below": 0.20, "escalate_below": 0.50}

Three modes, no fights with the model.

The guard exposes a routing decision; your application decides what to do with it. Default thresholds were calibrated on SWE-bench Pro N=54 traces. Override per-deploy.

skip
capability < 0.20

High failure risk. Abort the trace before tools fire. Reformulate the query or surface to user.

escalate
0.20 ≤ capability < 0.50

Moderate risk. Route to a stronger model (Claude Sonnet, GPT-5) or human review. Save the cheap call.

proceed
capability ≥ 0.50

Run the agent normally. ~0 overhead beyond the single forward pass that captured the residual.

Two probes, two signals.

Both numbers reported as top-K diff-of-means with random-K-matched baseline to surface real signal vs over-parameterization. Paper-grade gap is ≥ +0.10 above random.

probecapacityNAUROCrandom K-matchedgapsource
L43 pre_tool — capabilityK=1054
0.830
0.660
+0.17Phase 6c methodology sweep
L55 pre_response — suppressed CoT intentK=5240
0.848
0.701
+0.147Phase 8 redux random-K-matched

We tested causality. The probes don't lever — they detect.

Three intervention experiments converged on the same verdict: linear probe directions at L43 and L55 are epiphenomenal. They correlate with the outcome but do not participate in the causal pathway that produces it. agent-probe-guard ships without a “boost” mode because we have evidence that boosting wouldn't work — and we'd rather be honest than ship a feature that fails silently.

Phase 7 (L43 pre_tool)
Log-prob proxy with control-token normalization: Δrel ≈ 0. Single-shot α=+5 behavioral: 4/4 fails select identical tool. Continuous α=+5: 3/4 keep tool, 1 degenerates without redirecting.
Phase 8 (L55 thinking)
Bidirectional steering α∈±5: zero behavioral change. Amplitude diagnostic up to α=+200 (perturbation 86% above residual norm): 12 generations identical char-by-char.
Phase 8 redux (top-5 retest)
Concentrated direction on the 5 paper-grade signal dims at α=+200: still no token flip. Rules out direction-dilution. Decision is in input tokens (chat template), not residual.

Three sanity checks. Each one caught a confident-but-wrong claim.

The methodology is half the contribution. These belong in any future probe-causality paper.

Random-feature baseline + capacity sweep

Mandatory check at N<100. Phase 5d AUROC=1.000 at K=50 N=17 was over-parameterization caught by random-K-matched comparison. Phase 6c at K=10 found the real signal at +0.17 gap.

paper §eval v2/v3
Control-token normalization for steering

Always report Δrel = Δ(target) − mean(Δ(controls)). Phase 7 naive output reported Δlog-prob(finish) = +0.479 at α=+2, looked causal. Subtracting controls revealed Δrel = -0.046 — uniform softmax shift.

paper §eval v5
Structural-rigidity α-sweep diagnostic

When initial steering at α∈{±2,±5} shows zero change, sweep up to α=200 (>‖residual‖) with both probe and random directions. Phase 8 redux: 12 generations identical even at α=+200, structural lock confirmed (decision is in input tokens, not residual).

paper §eval v6 §D.5

Where we sit.

Most agent observability is post-hoc (logs & cost dashboards). agent-probe-guard reads the model's own residual stream before the next tool call.

toolscopelatencymid-reasoning?openlicense
LangSmith Evalpost-hoc tracesclosed
Helicone Watchercost monitoringApache-2.0
OpenAI Logprobs APIself-confidencefreeclosed weights
Anthropic Probes (paper)safety-relevantunknownresearch-only
OpenInterp agent-probe-guardmid-reasoning capability + CoT intent~50 msApache-2.0

When not to use it.

  • Closed-weights agents.The probe needs a forward hook on the residual stream. If you only have an API endpoint, this won't work. Use logprobs-based heuristics instead.
  • Non-Qwen3.6 base. Probe weights were trained on Qwen3.6-27B residuals. Cross-model transfer is paper-2 work in progress; for now use the matching model.
  • Sub-5-turn agents.Single-shot Q&A doesn't have enough budget at risk to justify a 50ms gate. Below ~5 tool calls, post-hoc LLM-judge eval is fine.
  • You want behavior modification.The probes don't lever (we proved it three ways). If you need to steerthe agent, this isn't the tool. Detection-only is the honest claim.

Ship agents that know when to stop.

Apache-2.0. PyPI. HF dataset. ~50 ms gate. Five lines of integration.

Manifesto · Research · Built on · FabricationGuard · ProbeBench