Imagine you've just changed the system prompt for a customer-support bot. The bot now sounds friendlier. Your gut says it's better. But is it? That instinct-"looks good to me"-is what evals exist to replace.
An eval is, at its core, a test suite for LLM outputs. You feed in inputs, collect outputs, and score them. The hard part is that last step: scoring. For a traditional unit test, you check whether a function returns 42. For a language model, you're asking whether a paragraph is accurate, helpful, and not subtly wrong in a way that only an expert would catch. That gap is what makes evals genuinely interesting engineering.
The anatomy of an eval run
Automated evaluation workflows have two parts. First, you need data-synthetic examples, curated test cases, or real logs from your LLM app. Second, you need a scoring method, which might return a pass/fail, a label, or a numerical score.
Those two things-data and a grader-are everything. The data is your test cases. The grader is the logic that decides whether each output is good or bad. Everything else is scaffolding.
A typical test case looks like this: