Blog / Why your agent needs a regression gate
CI Behavioral testing Regression gates

Why your agent needs a regression gate, not just a debugger

Refine AI Team · April 11, 2026 · 8 min read

It's a Tuesday afternoon. An engineer on your team ships a prompt update — nothing dramatic, just tightening the system prompt wording and swapping the retrieval model for one with a larger context window. The PR description says "minor quality improvements." The eval suite runs. Output quality: pass. Correctness on the golden dataset: pass. CI is green. The PR merges.

Two days later, a user files a support ticket. "The agent seems really slow." You dig in. Nothing looks obviously broken — responses are accurate. Then you pull the trace. Your agent is now taking 22 steps to complete tasks that previously took 6. Token cost has tripled. Latency is up 4x at P95. A tool that should only fire in specific conditions is being called on every single run.

Everything looked correct on the surface. Your evals said so. And yet: a structural regression slipped through, lived in production for two days, and was only found because a frustrated user happened to notice the slowness.

This is not a hypothetical. This is the normal failure mode for agent development in 2026 — and it's happening on teams that are doing everything else right.

The problem with debuggers alone

Debuggers are excellent tools. Refine AI ships one, and we mean it when we say that. The trace viewer, step-by-step replay, diff view between runs — these are genuinely useful for understanding what your agent did.

But a debugger is a reactive tool. You open it when you already know something is wrong. By that point, the regression is already in production. The damage — in user experience, in cost, in reliability — has already happened.

The mental model of "I'll use the debugger to catch problems" is like saying "I'll investigate the crash after the car hits the wall." It's better than nothing. But investigation after the fact is not prevention. A debugger answers the question: "What went wrong?" It cannot answer: "Will this PR make something go wrong?"

That's a fundamentally different question. And answering it requires a fundamentally different kind of tooling.

Why eval platforms don't close the gap

Braintrust, LangSmith, Weave, and similar platforms do solve a real problem. If you want to know whether your agent's outputs are correct, these tools are purpose-built for that. They're genuinely useful for evaluating answer quality, factual accuracy, rubric compliance, format correctness.

But there's a precision problem: they answer the question "Is my agent correct?" — not "Did my agent's execution behavior change?" These are different questions. And the second question is the one that catches the failure mode described above.

There are three structural reasons why output eval platforms don't catch behavioral regressions:

  • 1. They require ground truth. Eval platforms need golden datasets, rubrics, or LLM-as-judge to score outputs. Building and maintaining this is significant overhead. Worse, for many agent tasks, correct output is not uniquely defined — multiple different execution paths can produce correct final answers.
  • 2. LLM-as-judge adds latency and cost. Using a model to evaluate another model's output adds at minimum one LLM call to every CI run. For teams running hundreds of test cases across every PR, this adds up fast — both in time and in API spend.
  • 3. They're blind to structural change. An eval platform has no concept of "this agent used to take 6 steps and now takes 22." It sees the final output. If the final output is still correct by the rubric, the eval passes — even if the path to get there is completely different and three times more expensive.

The gap isn't about eval platforms being bad. It's about them answering the wrong question for this class of failure.

Behavioral regressions are invisible to output eval

Let's make this concrete with three real scenarios — the kind that happen on any team shipping agents at pace.

Scenario 1 — Model swap

Step count triples after upgrading the base model. Output is correct.

Team swaps gpt-4o-2024-08-06 for the latest checkpoint. On the golden dataset, output quality is equal or slightly better. Eval suite: green. What the eval doesn't see: the new model is more verbose in its reasoning chain. The agent now re-queries the knowledge base 4 times per task instead of 1.5. Step count goes from 8 to 24. Cost per run triples. Nothing in the output diff reveals this.

Scenario 2 — Integration side effect

Adding Salesforce integration causes the agent to call it on every request. Output unchanged.

A developer adds a Salesforce tool and updates the system prompt to describe it. The intent is for the tool to be used only when the user explicitly asks about CRM data. But a subtle wording choice makes the agent treat it as a default enrichment step. The final answer to every query is still correct — the CRM data is just appended silently and ignored. Tool call count doubles. Cost per run is now 40% higher. Your eval rubric doesn't check tool usage patterns. The PR merges.

Scenario 3 — Latency cliff

Latency goes from 3s to 14s after a prompt change. Evals pass.

A rewording of the system prompt causes the agent to consistently attempt a longer reasoning chain before invoking tools. Mean latency stays similar in fast cases. But at P99, response time goes from 3.2s to 14.8s. The eval suite measures output quality, not latency distribution. Users on slow queries start timing out. SLAs are breached. The eval suite: never noticed.

In all three cases: output quality evaluation passes. The regression is real. The regression is in production. And no eval platform was designed to catch it — because the thing that changed wasn't correctness, it was execution structure.

The CI gate mental model

The right mental model for behavioral regression gating isn't "better evals." It's closer to performance testing.

Think about how PageSpeed Insights works. It doesn't care whether your content is good. It doesn't evaluate the quality of your writing or the correctness of your data. It checks a set of structural metrics — First Contentful Paint, Cumulative Layout Shift, Total Blocking Time — and fails you if they cross a threshold. The invariant is: "These metrics must stay within acceptable bounds across changes."

That's exactly the right mental model for agent behavior. The question isn't "is the output correct?" — your eval suite handles that. The question is: "Did the execution profile change in a way that violates our invariants?"

If your agent ran in 6 steps on main, and now it runs in 22 steps on this branch — that's a signal worth gating on, regardless of whether the output looks good. If a new tool appears in the execution path that was never there before, that needs human review before it ships. If cost per run jumped 3x, the PR should be blocked until someone signs off.

These aren't quality metrics. They're structural invariants. And CI is the right place to enforce them — automatically, on every PR, before anything merges.

What a regression gate checks

Refine AI runs 8 behavioral check categories on every PR. Each one is deterministic — no LLM calls, no probabilistic scoring, no ground truth required.

step_count

Total reasoning steps the agent took to complete the task. A meaningful increase often means a model change caused more verbose reasoning, or a prompt change caused the agent to second-guess itself.

tool_calls

Number of tool invocations across the run. Sudden increases indicate unintended tool use patterns. Decreases might indicate a regression where the agent stopped using a critical tool.

unexpected_tool_path

Flags any tool that appears in the new run's execution trace but was absent from the baseline. This catches "the agent started calling Salesforce on every request" before it ships.

loop_risk

Detects repeated identical tool calls within a single run — a strong signal that the agent is stuck in a loop. Without a gate, looping agents can run for minutes and cost dollars per request.

guardrail_events

Counts how many times the agent triggered a safety or policy boundary. A spike here often means a prompt change inadvertently caused the agent to explore restricted territory more often.

cost_estimate

Estimated token cost per run, calculated from the trace. Threshold-based: if cost increases more than N% over baseline, the PR is blocked. No surprises on your API bill.

latency

Wall-clock time per run, compared against baseline. Reports both mean and P95/P99 so latency tail regressions don't hide behind a healthy mean.

stop_condition

Verifies the agent reached a clean stop in all test cases. If a code change causes the agent to hit max steps or fail to return a final answer, this check catches it.

How to add one to your CI in 5 minutes

Setup is three steps: instrument your agent, capture a baseline, add the GitHub Action.

Step 1 — Instrument with @trace

agent.py
from agentdbg import trace

@trace
def run_agent(user_input: str) -> str:
    # your existing agent code — no other changes needed
    result = agent.invoke({"input": user_input})
    return result["output"]

Step 2 — Capture a baseline from main

terminal
$ agentdbg baseline capture --suite tests/agent_suite.py

Running 12 test cases against current branch...
  Baseline captured: main@a3f72c1
    step_count p50: 6    p95: 9
    tool_calls p50: 4    p95: 6
    cost_est   p50: $0.011
    latency    p50: 2.8s  p95: 4.1s
  Saved to agentdbg.dev/baselines/your-repo

Step 3 — Add the GitHub Action

.github/workflows/agentdbg.yml
name: Refine AI behavioral gate
on: [pull_request]

jobs:
  agentdbg:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: agentdbg/action@v1
        with:
          api-key: ${{ secrets.AGENTDBG_API_KEY }}
          suite: tests/agent_suite.py
          baseline: main

What a failed PR check looks like

Refine AI CI · PR #847 · feat/new-retrieval-model
$ agentdbg assert --baseline main --compare HEAD

  step_count       6 → 22   (+267%)  FAIL
  tool_calls       4 → 19   (+375%)  FAIL
  unexpected_tool  salesforce_api (never seen before)  FAIL
  cost_estimate    $0.012 → $0.041  (+242%)  FAIL
  loop_risk        none detected          PASS
  stop_condition   reached in all cases   PASS

❌  4 checks failed · PR blocked
   Full trace diff: agentdbg.com/runs/pr-847

The PR is blocked. The engineer sees exactly which checks failed and by how much. The trace diff link shows a step-by-step comparison between the baseline run and the PR run. They can investigate, fix, and re-push — with the gate enforcing the bar on every iteration.

The debugger and the gate work together

The point of this post isn't to say "stop using debuggers" or "stop using eval platforms." Both are useful. The debugger is indispensable for investigation. Eval platforms are the right tool for measuring output quality.

The point is that there's a third layer that most teams are missing: something that sits in CI, runs on every PR, and checks whether the behavioral structure of your agent has changed in a way that violates your invariants — before it ships.

The debugger becomes the investigation layer. The gate becomes the prevention layer. They're complementary, not competing. When the gate fires, you open the debugger to find out why. When the gate passes, you can merge with confidence knowing that behavioral structure is preserved.

You don't have to choose between "we move fast and hope nothing breaks" and "we run slow eval suites on every PR." There's a third path: deterministic behavioral checks in CI that run in seconds, catch structural regressions before they merge, and give engineers a clear signal when something needs investigation.

That's what a regression gate does. And it's something your agents need whether you're using one today or not.

Try Refine AI on your next PR

Free to start. No LLM calls in your critical path. 5-minute setup via GitHub Action. Catches step count explosions, unexpected tool calls, loop risk, cost spikes, and latency regressions — before they merge.

More posts
Share: Share on X