GitHub Action · Free to try · 5 min setup →

Block PRs when
AI agents regress.

Refine AI gates your PRs with opinionated behavioral checks. When your agent regresses — more steps, unexpected tool calls, cost spikes — the PR fails. The built-in debugger shows you exactly what changed.

Add to GitHub Actions View docs

Refine AI CI · PR #847 · feat/new-retrieval-model

$ agentdbg assert --baseline main --compare HEAD

✗ step_count 6 → 22 (+267%) FAIL

✗ tool_calls 4 → 19 (+375%) FAIL

✗ new_tool_path salesforce_api (never seen before) FAIL

✗ cost_estimate $0.012 → $0.041 (+242%) FAIL

✓ loop_risk none detected PASS

✓ stop_condition reached in all test cases PASS

❌ 4 checks failed · PR blocked

Full trace diff: agentdbg.com/runs/pr-847

The problem

Code changes break agents.
Nobody notices.

A prompt tweak, a model swap, an infra update — any change can alter how your agent behaves. Output looks correct. Tests pass. But your agent is now taking 22 steps instead of 6, calling Salesforce, and costing 3x more. You won't know until a user reports it.

Step count explodes

Your agent used to complete the task in 6-9 steps. A new model version finds a different reasoning path. Now it takes 22. Output is correct. Cost is not.

Unexpected tool calls appear

A developer adds a Salesforce integration and forgets to add guardrails. Your agent starts calling it mid-session. Nobody sees it until the audit.

Latency quietly doubles

P95 latency goes from 3s to 14s. Users churn before you correlate the deploy to the slowdown. This happens at every company, every size.

Eval tools ask: "Is my output correct?" — we ask: "Did behavior change?"

Braintrust, LangSmith, and Confident AI are great at evaluating output quality. Refine AI checks something different: structural behavioral change. No golden dataset. No LLM-as-judge. No rubrics. Just: did your agent do something different than before?

How it works

Gate every PR.
In 5 minutes.

Add one decorator. Capture a baseline. Gate every PR. No hosted infrastructure, no cloud required — your traces stay on your CI runner.

Instrument your agent

Add one decorator to your agent entry point. Works with LangChain, LlamaIndex, AutoGen, or any raw Python agent. Takes under 5 minutes.

                  from agentdbg import trace

@trace
def run_agent(input: str):
    # your existing code
    # nothing else changes
    ...

Capture a baseline

Run your agent on a fixed set of test inputs. Refine AI records the full execution trace — steps, tool calls, cost, latency. This becomes your regression baseline.

                  # On main branch:
agentdbg baseline capture \
  --suite ./tests/agent_scenarios/ \
  --save baseline.json

# Baseline stored. Ready to gate.

Gate every PR

Add the GitHub Action. On every PR, Refine AI replays the same test suite against the new branch and compares. Behavioral regression = PR fails.

                  # .github/workflows/agent-check.yml
- uses: agentdbg/action@v1
  with:
    baseline: baseline.json
    max-steps: 15
    max-tool-calls: 10
    no-loops: true
    max-cost: 0.05

PR gate flow

PR opened

code change

Refine AI

replay + diff

Baseline

main branch

Pass / Fail

PR check

No hosted infra. Runs on your GitHub Actions runner. Traces never leave your environment.

Behavioral checks

8 checks that block
the regressions that matter.

Every check is deterministic. No ML classifiers. No LLM scoring. Each check compares a measured property of the current run against the baseline — and fails the PR if it exceeds your threshold.

Step Count

e.g. Was 6-9 steps. Now 22.

Catches when a code change causes your agent to take a dramatically different number of reasoning steps to complete the same task.

Tool Call Count

e.g. Was 4 calls. Now 19.

Detects when total tool invocations spike across a test case, indicating a more expensive or less efficient execution path.

Unexpected Tool Path

e.g. salesforce_api (never seen).

Flags any tool call that did not appear in the baseline trace. A new Salesforce call, a new DB read — anything your agent was not doing before.

Loop Risk

e.g. Same subgraph hit 7× in a row.

Detects when the agent revisits the same reasoning subgraph repeatedly — a sign of an infinite loop that will burn tokens and never resolve.

Guardrail Events

e.g. Guardrail fired that never fired before.

If a guardrail that never triggered on the baseline now triggers on the new branch, Refine AI surfaces it — even if the agent continued.

Token Cost

e.g. Median spend up 3.4×.

Compares estimated token cost per test case against the baseline. A 3x cost increase on a code change is a regression, even if output looks correct.

Latency

e.g. P95 went from 3s to 14s.

Tracks wall-clock time per run. Latency regressions are often invisible in evals but immediately visible to users. Catch them before merge.

Stop Condition

e.g. Agent no longer reaches final state.

Verifies your agent still reaches its expected terminal state (task_complete, handoff, etc.) across all test cases. Regression = agent never finishes.

Configure thresholds and custom checks in docs

What teams are finding

Real signals from engineers using Refine AI in CI.

"A new model version tripled our step count on the summarization agent. Refine AI caught it on the PR. We would never have noticed from evals alone — the output quality didn't change at all."

Eng team lead

AI-first SaaS startup

"We added a new tool to the agent and forgot to test it. Refine AI flagged it as an unexpected tool path on the PR. Two lines of YAML later and it's explicitly approved in our policy."

Founding engineer

Series A fintech

"Our agent stopped reaching the handoff state after a prompt change. Evals still passed because the early steps were fine. Refine AI blocked the PR because the stop condition wasn't hit."

Platform engineer

Enterprise AI team

Pricing

Free to debug.
Pay to gate your PRs.

The local debugger is free, forever. You pay for CI assertions — the same model Codecov and Snyk use.

Free

$0 forever

Local debugging — unlimited runs
Full timeline viewer (agentdbg view)
Loop detection & guardrails
Step-through trace inspection
Community Slack

Install free

In production
in under 5 minutes.

Three steps. No sign-up required to start.

Step 1 — Install & instrument

terminal

pip install agentdbg

# Add @trace to your agent entry point
from agentdbg import trace

@trace
def run_agent(input: str):
    ...

Step 2 — Add to GitHub Actions

.github/workflows/agent-check.yml

- name: Refine AI behavioral check
  uses: agentdbg/action@v1
  with:
    baseline: baseline.json
    max-steps: 15
    max-tool-calls: 10
    no-loops: true
    max-cost: 0.05
    max-latency-p95: 5000

Read the 5-minute quickstart View on GitHub

Block PRs when AI agents regress.

Code changes break agents.Nobody notices.