Evals as a deliverable: how we ship eval harnesses with every agent.

Every agentic-AI project we've seen go sideways went sideways the same way. There's a great demo on a Tuesday. The team starts iterating on prompts, tools, and memory. Two weeks later, nobody can tell whether the agent is getting better or worse — only that some users are happy and some are filing tickets that make no sense.

That gap — between "the demo works" and "the agent works in production" — is where most of the pain in this industry lives. It's also where most agencies hand off and walk away. We treat it differently. Evals are not a phase. They are a deliverable. Every engagement we ship ends with a working eval harness, a published methodology page, and a green-or-red signal the team can run on every commit.

What an eval harness actually is

An eval harness, the way we use the term, is three things glued together:

A golden set — a curated collection of inputs paired with what a correct response looks like, written by a human who understands the domain.
A scorer — code that compares the agent's output to the expected behavior and produces a number, a boolean, or both.
A threshold — a published bar the agent must clear before a change is allowed to ship.

That's it. No magic. Most teams have one of the three. Few have all three wired into CI. The teams that do are the ones that ship agents that don't embarrass them six months later.

What an eval harness is not

It is not a Jupyter notebook with a few hand-picked examples that someone runs before a release. It is not vibes. It is not asking GPT-4 to grade GPT-4's output without a human ever seeing the rubric. It is not benchmarks like MMLU or HumanEval, which measure something adjacent to but not the same as your agent's job.

If a stranger can't reproduce your eval result by cloning the repo and running one command, you don't have an eval harness. You have an opinion.

The four-axis framework we use

Every agent we build is measured along the same four axes. The exact metrics differ; the structure does not.

1. Accuracy

Does the agent produce the right answer on the cases we already know the right answer for? On RecipeGuidethis is a 200-recipe golden set scored on title, ingredient set equality, and step count. Threshold: ≥ 95% pass.

2. Routing / trajectory

For agentic systems, the right answer isn't enough — the right path matters. We build a smaller set (typically 50–80 examples) of user prompts paired with the expected sequence of tool calls. We score on whether the agent invoked the right tools in the right order, with a small allowance for permutations that don't change the outcome.

This axis catches a class of regressions accuracy alone misses: the agent gets the right answer by accident, by skipping a tool, by hallucinating a result it should have looked up. Trajectory eval forces it to do its job, not just guess well.

3. Quality (subjective)

Some outputs aren't right or wrong — they're better or worse. Tone, brevity, ingredient substitutions that feel culturally-aware, story prose for a 7-year-old reader. We score these with a small panel rating on a 1–5 Likert scale, with a published rubric. Threshold: median ≥ 4 / 5, with no axis below 3.

Yes, this is human-graded and slow. We run it once per release, not once per commit. The trick is keeping the panel size honest (we use 3–5 raters) and the rubric public.

4. Robustness

How does the agent behave when the input is noisy, ambiguous, or adversarial? On voice agents, this is wake-word recognition under three noise profiles (silent, kitchen, dishwasher). On extraction pipelines, it's the same input rephrased five different ways. On on-device inference, it's a thermal-stress test.

Anatomy of a golden set

A golden set isn't a benchmark. It's a curated, opinionated list that reflects your users and your failure modes. We grow ours from four sources:

Domain interviews. Before we build anything, we spend a day with whoever does this job today. Their hardest cases go straight into the golden set.
Real production traces. The first day the agent serves a real user, we start sampling sessions and converting the interesting ones — successes and failures — into eval entries.
Adversarial cases. Inputs designed to break the agent. Empty pantries. Recipes with no ingredients. Five-word meeting transcripts. We add at least 10% adversarial.
Regression entries. Every time a user reports a bug, the failing input goes into the golden set before we fix it. The agent has to pass that case forever.

How the harness runs

The fast axes (accuracy, routing, robustness) run on every PR via CI. Median wall time on RecipeGuide's harness is 90 seconds because we run inference on-device through a Mac mini farm, not against a hosted API. A change that drops any axis below its threshold blocks merge.

The slow axis (quality) runs nightly on the main branch and on every release candidate. Failing nightly evals page whoever made the most recent merge.

The methodology page

Every project we ship gets a public-spec methodology page. It documents:

What each axis measures, and why we picked it
The exact thresholds, with their rationale
How the golden set was constructed and how it grows
Current pass rates, updated on every release
Known gaps — the things this harness doesn't measure

That last bullet is the one that matters most. An honest methodology page tells you what your evals can't tell you. Buyers and users can read it and decide whether the bar is high enough for their use case. We've never had a buyer hold this against us. We've had several thank us for it.

Why this is the right thing for buyers to demand

If you're hiring an agentic-AI agency in 2026, here is what to ask before you sign anything:

Will I get a working eval harness as part of the deliverable?
Will it run in my CI on every PR?
Will the methodology be documented in a page I can read?
What thresholds will the agent need to clear, and who picks them?
What does the harness explicitly not measure?

If the answer to any of these is hand-wavy, the agency is selling you a demo. They will leave, and you will own a system you cannot safely change.

Why this is the right thing for us to do

Evals discipline our own work. They keep us honest about what we know works and what we're hoping works. They make hand-off real — when the engagement ends, the team that inherits the agent has a green light they can trust. They turn an art project into a system somebody else can keep alive.

And — selfishly — they make us faster. The harness is the thing that lets us iterate on prompts and tools at the speed we do. Without it, every change is a guess.

What to do with this

If you're building agents, build a harness alongside. Even a bad harness is better than no harness, because it forces the questions: what counts as right, what counts as wrong, what is the bar.

If you're hiring someone to build agents for you, demand an eval harness as a deliverable. Read the methodology page. Look at the golden set. Ask what it doesn't measure.

Either way: agents without evals are agents you can't change safely. And agents you can't change safely are agents that go stale the moment the world does.