Pricing

Free to debug.
Pay to gate your PRs.

The local debugger is free, forever. You pay for CI assertions — the same model Codecov and Snyk use. No credit card to start.

Free
$0 / forever

Local debugging — no cloud required.

  • Unlimited local runs
  • agentdbg view timeline viewer
  • Loop & guardrail detection
  • Step-through trace inspection
  • LangChain, LlamaIndex, AutoGen, raw Python
  • Community Slack
Install free
Most popular
Team
$29 / seat / month

14-day free trial — no card required.

  • Everything in Free
  • CI gate (agentdbg assert)
  • GitHub Action (agentdbg/action@v1)
  • PR comments with behavioral trace diffs
  • Baseline management (capture, version, compare)
  • All 8 behavioral check types
  • Custom thresholds per check
  • Slack + webhook alerts on regression
  • Priority email support
  • 14-day free trial — no card required
Start 14-day trial
Enterprise
Custom

Per-seat or volume-based. Contact us to scope.

  • Everything in Team
  • Unlimited seats
  • SSO / SAML
  • On-prem / VPC deployment
  • Custom baseline retention policy
  • Dedicated Slack support + SLA
  • Audit log export
  • Custom check authoring assistance
  • Compliance packages (SOC 2, etc.)
Talk to us

No credit card for the free tier. Team trial requires no card for 14 days. Cancel any time.

How Refine AI compares

The proven devtools model,
applied to AI agents.

Codecov gates code coverage. Snyk gates vulnerabilities. Refine AI gates behavioral correctness. Same model — new surface.

Feature Codecov Snyk SonarQube Refine AI
Free local tool
Paid CI gate
Per-seat pricing
Specific to AI agents
FAQ

Common questions

What counts as a "behavioral regression"?
Any of the following: step count outside the configured threshold, new or removed tool calls, elevated loop risk score, cost above budget, latency spike beyond the allowed delta, guardrail fires, or a stop condition being missed. You configure thresholds — Refine AI measures deltas against your baseline.
Do I need to define what "correct" looks like?
No. There are no golden datasets to curate, no rubrics to write, and no LLM judges to prompt. Refine AI compares the execution structure of your agent before and after the code change. If the structure changed outside your thresholds, the check fails.
How does baseline management work?
Run agentdbg baseline capture on your main branch. The baseline is stored as a JSON file and versioned in your repo. The GitHub Action compares every PR against it automatically. When you intentionally change behavior, run baseline capture again to update it.
Is there a usage limit on the Free tier?
No limits on local runs — ever. The Free tier is local-only: you can run as many agent traces as you like and use the full timeline viewer. CI gate features (agentdbg assert and the GitHub Action) require the Team plan.
Do traces leave my environment?
No. Refine AI runs entirely on your CI runner. Traces are generated, compared, and discarded within your GitHub Actions runner environment. Nothing is sent to our servers. Enterprise customers on on-prem deployments have full control over data residency.
Can I use Refine AI with TypeScript / Node?
The Python SDK is available today and supports LangChain, LlamaIndex, AutoGen, and raw Python agents. A TypeScript / Node.js SDK is in active development. Sign up for early access and we'll notify you when it ships.
How is this different from Braintrust or LangSmith?
Braintrust and LangSmith evaluate output quality using LLM-as-judge — they answer "was the answer good?" Refine AI checks whether the behavioral structure of execution changed — it answers "did this code change alter how the agent behaves?" No LLM calls in our check path. It's a different question. Both tools are complementary.
What's the Enterprise pricing model?
Per-seat or volume-based depending on team size and deployment model. We work with engineering teams to scope the right package. Reach out at founders@refinehq.ai and we'll get back to you within one business day.

Ready to gate
your first PR?

No credit card required.