An agent's answer graded by an LLM judge against the gold standard, producing a scorecard that feeds back into the next run

A compliance agent can produce a fluent, well-cited answer that reads exactly like what a compliance officer would write — and still be wrong in a way that matters to a regulator. The hard part was never getting an answer out of the model. It’s knowing whether that answer would survive expert review.

That’s what an eval system is for. Without one, every change you make — a new model, a tweaked prompt, a different retrieval strategy — is a guess about whether things got better or worse.

We wrote earlier about why general-purpose LLMs fail at regulated compliance work, and what changed once we built the infrastructure to fix it. This is the engineering side of that story: how the evaluation actually works.

The framework runs across every compliance agent we ship, but it’s easier to explain through one. So this post follows a single agent end to end — the screening agent, which reviews documents and customer interactions for regulatory issues.

Building an evaluation flywheel

The whole system is a loop with five stages:

  1. Execute — run the agent against realistic scenarios, under a range of user behaviors.
  2. Measure — collect programmatic metrics and LLM judge scores.
  3. Analyze — enrich each result with a structured failure classification and land it in BigQuery.
  4. Improve — fix bugs, adjust configs, refine the agent.
  5. Re-measure — rerun the suite and quantify what moved.

The first rotation took months. The most recent took days. That speedup is the whole point: once the infrastructure exists, iterating is cheap. The part that stays expensive — and that no amount of tooling replaces — is the scenarios themselves, and the expert judgment that decides what “correct” even means inside them.

Diagram: The eval flywheel — a circular diagram of the five stages (Execute → Measure → Analyze → Improve → Re-measure) connected by directional arrows.

Growing the benchmark from synthetic to real

You can’t wait for real customer traffic to start measuring quality, so the benchmark cold-started entirely on synthetic scenarios — cases written by hand to look like the work the agent would eventually face. That gets you moving, but synthetic data has a ceiling: it reflects what you imagine users will do, not what they actually do.

So as real traffic came in, we started swapping it in. A production case that exposed something interesting — a messy document, an edge-case activity, a phrasing we hadn’t seen before — gets pulled into the benchmark, stripped of anything identifying, and turned into a permanent test case. Over time the suite drifts from mostly-synthetic toward mostly-real, and it keeps drifting as new traffic surfaces new behavior.

Both still earn their place. Synthetic lets you manufacture rare or dangerous conditions on demand, at volume. Real keeps the whole thing honest about what’s actually coming through the product.

Coverage is the other half of the job. There are 500+ compliance-monitored activities in the taxonomy, grouped into buckets — marketing review, customer-facing communications, transaction monitoring, KYC, and more. Each case targets a specific activity, so the benchmark reflects the real breadth of the work, not just the cases that are easy to assemble.

How the judges score

Two layers of judges grade each run.

Correctness judges check the basics of the run — task-completion, groundedness, instruction-adherence. Did the agent finish what it was asked? Did it stay grounded in the evidence it actually had? Did it follow the instructions it was given?

A quality judge looks at the final screening output. It compares the agent’s findings against an expert-labeled gold set and sorts each finding into one of five buckets: MATCHES_GOLD, VALID_NOVEL, BORDERLINE, UNSUPPORTED, or DUPLICATE_OR_TOO_VAGUE.

A screening output is naturally rate-based — lots of findings, each one independently right or wrong — so the metrics come straight out of those buckets as gradient scores rather than a single pass/fail:

MetricWhat it captures
recall_p1, recall_p2, recall_p3Per-priority recall across the gold findings. Each is COVERED (1.0), PARTIALLY_COVERED (0.5), or MISSED (0.0), so a miss on a high-priority issue can’t hide behind a clean sweep of minor ones.
hallucination_rateunsupported_count / predicted_total
precision_w1 - (unsupported_count + 0.5 · duplicate_or_vague_count) / predicted_total
citation_scoreEach citation scored on relevance, support strength, and utility — then weighted by whether the finding itself was valid, so a well-cited hallucination can’t inflate the number.
band_agreementAgreement with the expert risk score on a 0–100 scale (bands at 20/40/60/80), with ±5 ramps around each boundary so a 78 vs. an 82 isn’t punished as a full band apart.
signed_deltapredicted_risk − gold_risk — surfaces directional bias, whether the agent tends to over- or under-call.
novelty_yieldvalid_novel_count / predicted_total — credit for valid findings outside the gold set. It’s the counterweight that keeps the agent willing to flag real issues an expert missed.

Enriching scores into diagnoses

A judge score sitting on a dashboard isn’t an improvement signal. A judge score that also tells you what failed and why is. That distinction is the entire reason the pipeline exists.

Programmatic metrics take the short path. Test execution writes them straight to the warehouse — turn counts, duration, task breakdowns, state transitions, outcome classification. Cheap, reproducible, and good enough to catch the gross failures.

Judge scores take the long path, because they need enrichment first. They land on a queue, and a background workflow picks each one up and tags it: a one-line summary, a primary failure category drawn from an 18-category taxonomy, and a concrete improvement hint. Only then does the enriched score reach the warehouse, where it sits alongside everything else for analysis.

Diagram: Data pipeline. Two paths flow from Test Execution. Programmatic metrics go straight to the Warehouse. Judge scores pass through a Queue and an Enrichment step that adds failure classification before reaching the Warehouse. Both feed Dashboards.

The enrichment is what makes a bad day investigable. Group the scores by failure category and you get a daily breakdown of where the agent is breaking — and, more usefully, which specific cases to open first.

Calibrating judges against experts

The hard part of using an LLM as a judge isn’t writing the prompt. It’s pinning down what the judge should measure in a way that matches how an expert actually thinks. And the place that got genuinely hard was severity.

Every finding carries a risk score from 0 to 100, and the gold score the judge calibrates against comes from compliance officers. So there’s a question you have to answer before any of it means anything: hand the same case to two compliance officers, and do they even agree on the score?

Not always. And the disagreement wasn’t random noise — it was two different ways of thinking.

One officer worked deductively. Start from the triggers — a regulatory keyword, a jurisdiction, an exposure threshold — and let them set hard floors and ceilings. Risk as a checklist.

The other worked inductively. Break risk into dimensions — regulatory exposure, consumer impact, reputational harm, operational complexity — score each against the whole picture, then weigh them. Risk as a portrait.

Both are right. Both are how working compliance officers actually reason. So the disagreement wasn’t a problem to average away — it was telling us something about the domain. We folded both styles into a single layered procedure, and that’s what now generates every gold severity score in the benchmark:

  1. Anchor on external exposure — triggers set a floor. (Deductive.)
  2. Rate across dimensions — score each one independently. (Inductive.)
  3. Apply context — weigh those dimensions against the full narrative.
  4. Aggregate with primacy — regulatory and consumer exposure anchor the final number.
  5. Consistency check — the score has to agree with the written reasoning.

That rubric is what band_agreement and signed_delta measure the agent against. Not one person’s gut, but a synthesis that captures how compliance officers actually reason about risk — and gives the experts and the judge a shared way to explain why a score is what it is.

If there’s one lesson in here, it’s this: when your expert reviewers disagree on something that matters, don’t average them. Find out why. The disagreement is usually telling you something fundamental about the work.

Guarding against Goodhart

The eval system runs on two tiers by design. A small, expensive tier of expert-labeled cases defines what “correct” means. A much larger tier, scored by the LLM judge, applies that standard across the thousands of runs no expert team could grade by hand. So day to day, most scores come from the judge — and that scale is exactly the point.

It also carries a risk. The judge is only a stand-in for expert review, and once you start tuning the agent to raise its score, you can end up with an agent that gets better at pleasing the judge without getting any better at the actual work.

Three things keep that honest:

  1. Holdout scenarios. Some cases never enter the calibration set, so an improvement has to generalize — not just flatter the rubric the judge was tuned on.
  2. Expert spot-checks. Compliance officers re-score sampled traces, and those go head to head with the judge. If the judge passes something an expert would fail, that’s a bug in the judge, not an acceptable miss.
  3. Disagreement as a signal. When the judge and an expert diverge, it’s a trigger to investigate, not a number to smooth over. Either the agent regressed, the rubric needs work, or the judge prompt is off — and every one of those is worth knowing.

The system earns its keep precisely because it isn’t trusted blindly. It’s an instrument under constant validation, not an oracle.

One lesson, learned the hard way

If there’s one thing we’d do differently, it’s the thing that cost us the most time. We treated expert evaluation as the final check on a system we’d already built — simulation, pipeline, judges, and dashboards all in place, with expert annotations brought in at the end to validate it.

Those annotations changed everything: the scoring methodology, the judge design, even the agent’s architecture. Expert feedback is slow and expensive, which is exactly why it belongs first, not last. The small, rough annotations you collect in week one are what define “good” in your domain — and every scenario, judge, and metric downstream is built on that definition.

Start with the experts.