Don't let broken
agent changes merge.

Refine AI gates your PRs with opinionated behavioral checks. When your agent regresses — more steps, unexpected tool calls, cost spikes — the PR fails. The built-in debugger shows you exactly what changed.

Refine AI CI · PR #847 · feat/new-retrieval-model
$ agentdbg assert --baseline main --compare HEAD
step_count      6 → 22 (+267%) FAIL
tool_calls      4 → 19 (+375%) FAIL
new_tool_path   salesforce_api (never seen before) FAIL
cost_estimate   $0.012 → $0.041 (+242%) FAIL
loop_risk       none detected PASS
stop_condition  reached in all test cases PASS
❌ 4 checks failed · PR blocked
Full trace diff: agentdbg.com/runs/pr-847

Built for teams shipping AI agents in production

0
LLM calls in your critical path
5 min
setup in any GitHub repo
8+
behavioral check categories
100%
deterministic — no ML scoring
The problem

Code changes break agents.
Nobody notices.

A prompt tweak, a model swap, an infra update — any change can alter how your agent behaves. Output looks correct. Tests pass. But your agent is now taking 22 steps instead of 6, calling Salesforce, and costing 3x more. You won't know until a user reports it.

Step count explodes

Your agent used to complete the task in 6-9 steps. A new model version finds a different reasoning path. Now it takes 22. Output is correct. Cost is not.

Unexpected tool calls appear

A developer adds a Salesforce integration and forgets to add guardrails. Your agent starts calling it mid-session. Nobody sees it until the audit.

Latency quietly doubles

P95 latency goes from 3s to 14s. Users churn before you correlate the deploy to the slowdown. This happens at every company, every size.

Eval tools ask: "Is my output correct?" — we ask: "Did behavior change?"

Braintrust, LangSmith, and Confident AI are great at evaluating output quality. Refine AI checks something different: structural behavioral change. No golden dataset. No LLM-as-judge. No rubrics. Just: did your agent do something different than before?

How it works

Gate every PR.
In 5 minutes.

Add one decorator. Capture a baseline. Gate every PR. No hosted infrastructure, no cloud required — your traces stay on your CI runner.

01

Instrument your agent

Add one decorator to your agent entry point. Works with LangChain, LlamaIndex, AutoGen, or any raw Python agent. Takes under 5 minutes.

                  from agentdbg import trace

@trace
def run_agent(input: str):
    # your existing code
    # nothing else changes
    ...
                
02

Capture a baseline

Run your agent on a fixed set of test inputs. Refine AI records the full execution trace — steps, tool calls, cost, latency. This becomes your regression baseline.

                  # On main branch:
agentdbg baseline capture \
  --suite ./tests/agent_scenarios/ \
  --save baseline.json

# Baseline stored. Ready to gate.
                
03

Gate every PR

Add the GitHub Action. On every PR, Refine AI replays the same test suite against the new branch and compares. Behavioral regression = PR fails.

                  # .github/workflows/agent-check.yml
- uses: agentdbg/action@v1
  with:
    baseline: baseline.json
    max-steps: 15
    max-tool-calls: 10
    no-loops: true
    max-cost: 0.05
                
PR gate flow
PR opened
code change
Refine AI
replay + diff
Baseline
main branch
Pass / Fail
PR check

No hosted infra. Runs on your GitHub Actions runner. Traces never leave your environment.

Behavioral checks

8 checks that block
the regressions that matter.

Every check is deterministic. No ML classifiers. No LLM scoring. Each check compares a measured property of the current run against the baseline — and fails the PR if it exceeds your threshold.

Step Count
e.g. Was 6-9 steps. Now 22.

Catches when a code change causes your agent to take a dramatically different number of reasoning steps to complete the same task.

Tool Call Count
e.g. Was 4 calls. Now 19.

Detects when total tool invocations spike across a test case, indicating a more expensive or less efficient execution path.

Unexpected Tool Path
e.g. salesforce_api (never seen).

Flags any tool call that did not appear in the baseline trace. A new Salesforce call, a new DB read — anything your agent was not doing before.

Loop Risk
e.g. Same subgraph hit 7× in a row.

Detects when the agent revisits the same reasoning subgraph repeatedly — a sign of an infinite loop that will burn tokens and never resolve.

Guardrail Events
e.g. Guardrail fired that never fired before.

If a guardrail that never triggered on the baseline now triggers on the new branch, Refine AI surfaces it — even if the agent continued.

Token Cost
e.g. Median spend up 3.4×.

Compares estimated token cost per test case against the baseline. A 3x cost increase on a code change is a regression, even if output looks correct.

Latency
e.g. P95 went from 3s to 14s.

Tracks wall-clock time per run. Latency regressions are often invisible in evals but immediately visible to users. Catch them before merge.

Stop Condition
e.g. Agent no longer reaches final state.

Verifies your agent still reaches its expected terminal state (task_complete, handoff, etc.) across all test cases. Regression = agent never finishes.

Refine AI vs the alternatives

They ask: "Is it good?"
We ask: "Did it change?"

Eval platforms evaluate output quality. Refine AI detects structural behavioral change. Different question. Different tool. Complementary, not competing.

Capability
Refine AI
Braintrust / LangSmith Confident AI / DeepEval Arize / Cascade
PR gate that blocks on regression
Zero LLM calls in check path
No golden dataset required
Behavioral invariant checks
Output quality evaluation
LLM-as-judge scoring
Custom rubrics / prompts
Step count & tool call tracking
CI integration (GitHub Actions)
Local-first, no cloud required

"–" means limited or configuration-dependent support. Different tools, different jobs — use both.

What teams are finding

Real signals from engineers using Refine AI in CI.

"A new model version tripled our step count on the summarization agent. Refine AI caught it on the PR. We would never have noticed from evals alone — the output quality didn't change at all."

E
Eng team lead
AI-first SaaS startup

"We added a new tool to the agent and forgot to test it. Refine AI flagged it as an unexpected tool path on the PR. Two lines of YAML later and it's explicitly approved in our policy."

F
Founding engineer
Series A fintech

"Our agent stopped reaching the handoff state after a prompt change. Evals still passed because the early steps were fine. Refine AI blocked the PR because the stop condition wasn't hit."

P
Platform engineer
Enterprise AI team
Pricing

Free to debug.
Pay to gate your PRs.

The local debugger is free, forever. You pay for CI assertions — the same model Codecov and Snyk use.

Free
$0 forever
  • Local debugging — unlimited runs
  • Full timeline viewer (agentdbg view)
  • Loop detection & guardrails
  • Step-through trace inspection
  • Community Slack
Install free
Most popular
Team
$29 / seat / month
  • Everything in Free
  • CI gate — agentdbg assert
  • GitHub Action (agentdbg/action@v1)
  • PR comments with trace diffs
  • Baseline management
  • All 8 behavioral check types
  • Custom thresholds per check
  • Slack / webhook alerts on fail
Start 14-day trial
Enterprise
Custom volume pricing
  • Everything in Team
  • Unlimited seats
  • SSO / SAML
  • On-prem / VPC deployment
  • SLA + dedicated support
  • Audit log export
  • Custom check authoring
  • Compliance packages
Talk to us

No credit card for the free tier. Team trial requires no card for 14 days. Cancel any time.

Get started

In production
in under 5 minutes.

Three steps. No sign-up required to start.

Step 1 — Install & instrument
terminal
pip install agentdbg

# Add @trace to your agent entry point
from agentdbg import trace

@trace
def run_agent(input: str):
    ...
Step 2 — Add to GitHub Actions
.github/workflows/agent-check.yml
- name: Refine AI behavioral check
  uses: agentdbg/action@v1
  with:
    baseline: baseline.json
    max-steps: 15
    max-tool-calls: 10
    no-loops: true
    max-cost: 0.05
    max-latency-p95: 5000