The Two Graders Inside Every LLM Eval

An LLM eval isn't a single thing. It's a pipeline with two distinct graders - one mechanical, one probabilistic - and knowing which to reach for changes what you can actually trust.

Most teams treat evals as a box to check before shipping a prompt change. Run the suite, watch the numbers, ship if nothing collapsed. That framing obscures what an eval actually is - and why a passing score can mean very different things depending on how the grader was built.

Here is what happens mechanically when a real eval runs.

The golden dataset comes first

Before any grader runs, you need inputs and their expected outputs. This is the golden dataset: a curated set of test cases that define what "correct" looks like for your specific application.

A golden dataset is a curated collection of input-output pairs that represent the expected behavior of your AI system. The hard part is sourcing it honestly. Spending two days labeling real sessions produces a dataset that reflects actual failure patterns. Your golden dataset should come from production, not from synthetic inputs.

This is the step most teams skip or rush. They generate synthetic examples, find that the model passes them all, and feel reassured. Then a real user finds a failure mode in the first week. Synthetic inputs tend to be too clean. Production logs are where the actual edge cases live.

Grader one: deterministic assertions

Once you have a dataset, the first pass is code-based. You run the model against each input and check the output programmatically.

Apply heuristic checks first. Before running expensive LLM judges, use deterministic metrics - for example, Equals for exact matches, RegexMatch for format validation, and IsJson for structured output checks. These are fast, cheap, and reliable for structural requirements.

A concrete example: if your agent extracts a booking date from a user message, a user query like "can I see the apartment on July 4th, 2026?" has one - and only one - correct interpretation. The AI's job is to extract that date and put it into the right format for a downstream tool. This is a deterministic, objective task that's either right or wrong.

For tasks like this, a code-based eval is simply an assertion: assert output["date"] == "2026-07-04". It runs in milliseconds, costs nothing beyond compute, and you can hook it into your CI pipeline so it catches regressions on every commit. Code-based evals are cheaper to create and maintain than other kinds of evals. Since there is an expected output, you only have to run the LLM to generate the answer, followed by a simple assertion. This means you can run these more often - for example, on every commit to prevent regressions.

The limitation is obvious: not everything has a single correct answer.

Grader two: the LLM judge

When a task involves judgment - tone, safety, relevance, whether a chatbot should escalate to a human - you cannot write a deterministic rule. What about a more ambiguous problem, like knowing when to hand off a conversation to a human agent? If a user says "I'm confused," should the AI hand off immediately, or try to clarify things first? There's no single right answer; it's a judgment call based on product philosophy. A code-based test can't evaluate this kind of nuance.

Here is where LLM-as-judge enters. You write a rubric - a detailed prompt describing what a good response looks like for this criterion - and ask a second, usually stronger model to score the output against it. LLM judges are often prompted with detailed instructions and sometimes few-shot examples to guide their judging behavior. Some frameworks use chain-of-thought prompting, where the judge is asked to reason step-by-step about the output's quality before giving a score.

The judge produces a score, often 1-5, plus a short rationale. You then check that score against a threshold.

The LLM judge is a second model reading the first model's output and asking: does this meet the rubric?

LLM-as-judge handles the quality dimensions that deterministic checks can't reach. Deterministic rules handle format validation, length constraints, and required elements quickly and cheaply. The two graders are complementary, not interchangeable.

Why you have to validate the judge

Here is the part that trips people up: the LLM judge itself can be wrong. Single LLM judges have known biases and vulnerabilities. A judge prompted poorly will systematically favour longer responses, or agree with whatever framing the rubric implies, regardless of actual quality.

Before trusting a judge at scale, you validate it against your golden dataset. You check how often its scores agree with your human labels. Start from real failure modes in your traces, pick the right metric, write a clear rubric, and validate the judge against a golden dataset - aim for 75-90% agreement with human labels before you scale it.

A 100% pass rate is a sign the eval isn't hard enough, not a sign everything is fine. If your judge approves every output, your rubric is probably too permissive.

How the two graders work together in practice

A well-structured eval run has a specific order. First, run all deterministic assertions - they're cheap and fast, and a failure there is unambiguous. If those pass, run the LLM judge only on the outputs that require semantic or subjective judgment. The LLM judge should not run by default. It only fires when the heuristics genuinely cannot decide.

This matters for cost at scale. Paying per evaluation call does not scale if the judge is your default. A tool like Beagle that generates summaries across dozens of Slack threads, for example, might use a regex check to confirm a summary was produced at all, and only invoke a judge to score whether the summary actually captured the decision buried in the thread.

The offline version of this pipeline runs before a release - you get regression coverage. The online version samples live traffic and runs the same graders on real outputs after the fact, catching drift that only surfaces in production. Online evaluations act as a smoke detector for AI agent quality degradation. You can detect drift or emerging failure patterns in real time rather than discovering problems after user complaints.

The instinct to reach for infrastructure first - a framework, a dashboard, a scoring pipeline - tends to produce evals that are technically running but measuring the wrong things. The discipline is in the sourcing: real production logs, manually labeled, rubric validated against human judgment. After that, the mechanics are just plumbing.