An evals-first approach to shipping AI agents
The single highest-leverage decision in building a production agent isn't choosing a model. It's deciding what 'good' looks like, and measuring it from day one.
Most AI projects start with a model choice. We start with an eval suite. Here's why, and how that changes the engagement.
What 'evals-first' means
Before we write production code, we agree on the test set. A representative collection of inputs the agent will see in the wild, paired with criteria for what 'good' looks like for each. This is the contract we hold ourselves to and the artifact you can use to grade us.
Why before? Because once code is being written, the temptation to retrofit the success criteria around what the agent happens to do well is overwhelming. Setting evals upfront forces specificity. It also surfaces disagreements about quality that would otherwise emerge mid-project, when they're expensive.
What an eval suite looks like
For a customer-service agent, our typical eval suite has three layers:
- ▸Intent coverage: 200+ real conversations across the top 20 intent categories, with 'acceptable' and 'unacceptable' response examples
- ▸Edge cases: 50-100 known-tricky scenarios (multi-turn, ambiguity, conflicting tickets, frustrated tone)
- ▸Adversarial: 30-50 prompts designed to stress-test (prompt injection, irrelevant requests, requests for unauthorized actions)
Each interaction in the suite has a graded rubric scored 0-3 across dimensions like correctness, tone-fit, and escalation appropriateness. The aggregate score is what we deploy against.
How the suite changes day-to-day work
Once the eval is wired up, every code change runs against it automatically. A prompt tweak, a retrieval threshold change, a tool addition: all get scored. If the score drops below baseline, the change doesn't merge.
This sounds bureaucratic. In practice it's the opposite. It's the thing that lets us move fast without breaking quality. Engineers can experiment freely because the suite is the safety net. Drift gets caught at PR time, not in production.
Common objections we hear
'Our use case is too custom for evals'
Every use case is custom. The eval suite is also custom. That's the point. The work is constructing the suite from your actual data, not finding a generic benchmark to use.
'We don't have enough data yet'
You don't need a million examples. A 200-interaction eval, hand-graded, gets you 80% of the value. We've seen teams stall for months trying to build the perfect eval set when 'good enough' would have shipped weeks ago.
'It's too expensive to run continuously'
A 200-item suite against Claude Sonnet runs about $3-8 per pass. Run it twice a day and on every PR. You'll spend less than a single bad incident costs to investigate.
What it costs to skip
Without an eval suite, you're flying blind. Quality changes will happen. Model providers ship updates, your prompts evolve, retrieval data shifts. You won't know whether things are getting better or worse. The first sign will be a customer complaint, and by then the damage is done.
Heads up:If you're considering an AI vendor and they don't talk about evals on the first call, walk. The vendors who skip this layer are the ones whose work fails six months in.
How to start
If you have an agent in production already and no eval suite, start here:
- ▸Pull 200 real conversations or queries from the last month
- ▸Get one or two domain experts to grade each one as 'good' / 'okay' / 'bad' with a one-line reason
- ▸Save them as a fixed test set in version control
- ▸Wire up a script that re-runs the agent against the set and produces a grade aggregate
- ▸Run it on every change and at least daily
That's it. You now have a real measurement of agent quality and a way to know when it changes. The fanciness comes later. The discipline comes first.