About

We're building the gate that keeps
broken agents out of production.

Refine AI started as a local debugger. Talking to teams building AI agents in production taught us the real problem: nobody catches behavioral regressions before they merge. We're fixing that.

The problem we're solving

Every week, engineering teams ship AI agent changes that silently break behavioral properties. Step counts triple. Unexpected APIs get called. Latency doubles. Eval tools pass. Users notice first.

We think every PR touching an AI agent should be gated on behavioral invariants — the same way every PR touching network code gets latency budgets. That's what Refine AI does.

Our approach

Deterministic over probabilistic

No LLM-as-judge in our check path. Behavioral checks compare measured properties (step count, tool calls, latency) against baselines. No false positives from scoring variance.

CI-first, not production-first

The right place to catch a regression is the PR, not the incident. We gate before merge so you never have to chase a production regression.

Local-first trust

Your traces never leave your environment. Refine AI runs on your CI runner. No telemetry without consent. Open-source core.

What we believe

AI agents will be held to the same engineering standards as any production system.

Behavioral testing for agents is still manual and reactive — that's the problem we solve.

The devtools model is proven: free for individuals, paid for teams. Codecov and Snyk showed the way.

No golden dataset should be required to catch a behavioral regression.

Get in touch

For pilots and partnerships

founders@refinehq.ai

We respond to every email from teams building AI agents in production.

GitHub

github.com/agentdbg

File issues, read release notes, and follow product updates.

Twitter / X

@agent_dbg

We post about behavioral testing, agent reliability, and devtools.

We're building the gate that keeps broken agents out of production.