Steven's Knowledge

Automated Evaluation

LLM-as-judge, assertion-based scoring, rubric grading, and the tools that tie it together

You can't have humans review every LLM output in production. You also can't trust a simple string match to tell you if a paragraph-length response is good. Automated evaluation sits in between — using a mix of programmatic checks and model-based judging to score outputs at scale. The key insight: different evaluation strategies suit different quality dimensions, and the best systems layer multiple approaches.

Assertion-Based Evaluation

The simplest automated evaluation: programmatic checks that verify concrete properties of the output.

Exact match — the output must be exactly this string. Useful for classification tasks, single-value extraction, yes/no questions.

Contains / not-contains — the output must include (or exclude) specific strings. Good for checking that required information is present or that forbidden content is absent.

Regex match — the output must match a pattern. Useful for format validation: phone numbers, dates, email addresses, structured IDs.

JSON schema validation — the output must parse as valid JSON and conform to a schema. Essential for any system that produces structured output.

Length bounds — the output must be within a word/character/token count range. Catches runaway generation and truncated responses.

Custom functions — arbitrary code that takes the output and returns pass/fail with a reason. The escape hatch for domain-specific checks.

Assertion-based evaluation is deterministic, cheap, and fast. Use it for every dimension you can express as a programmatic rule. It's the foundation layer — always run assertions before reaching for model-based evaluation.

LLM-as-Judge

For dimensions that can't be captured by assertions — coherence, helpfulness, tone, factual accuracy of free text — use another LLM to evaluate the output.

The basic pattern:

  1. Give the judge model the original input, the system output, and evaluation criteria.
  2. Ask it to score the output on a scale (1-5) or assign a label (pass/fail, good/acceptable/poor).
  3. Require a brief justification before the score (chain-of-thought improves calibration).

Critical design choices:

  • Judge model selection — use a stronger model than the one being evaluated. If you're evaluating GPT-4o-mini outputs, judge with Claude Opus or GPT-4o. Judging with a weaker model leads to unreliable scores.
  • Criteria specificity — vague criteria ("is this response good?") produce inconsistent scores. Specific criteria ("does the response answer the user's question without introducing information not present in the source document?") produce reliable scores.
  • Score anchoring — provide examples of what each score level looks like. "A score of 5 means... A score of 3 means... A score of 1 means..."
  • Position bias mitigation — LLM judges tend to prefer the first option in comparisons. Randomize order and average across positions.

Rubric-Based Grading

A rubric is a structured scoring guide that breaks evaluation into multiple dimensions, each with defined score levels. This is the most reliable LLM-as-judge pattern.

Example rubric for a customer support response:

Dimension1 (Poor)3 (Acceptable)5 (Excellent)
AccuracyContains factual errorsMostly correct, minor gapsFully accurate, well-sourced
CompletenessMisses the main questionAnswers the main questionAddresses all parts including edge cases
ToneRude, dismissive, or roboticProfessional but genericWarm, empathetic, brand-appropriate
ActionabilityNo clear next stepsSuggests a path forwardSpecific, actionable steps the user can take

Score each dimension independently, then aggregate (weighted average, minimum, or custom logic). Dimensional scoring gives you diagnostic power — you know not just that quality dropped, but which dimension dropped.

Pairwise Comparison

Instead of scoring outputs on an absolute scale, ask the judge to compare two outputs and pick the better one. This is more reliable than absolute scoring because relative judgments are easier than absolute calibration.

The pattern:

  1. Present two outputs (A and B) for the same input.
  2. Ask the judge: "Which response is better according to [criteria]? Explain your reasoning, then answer A, B, or Tie."
  3. Swap the order (B, A) and judge again to control for position bias.
  4. If both orderings agree, you have a confident result. If they disagree, it's a tie.

Use pairwise comparison for:

  • A/B testing prompts — which prompt variant produces better outputs?
  • Model comparison — does Model X outperform Model Y on your specific task?
  • Regression detection — is the new version's output better, worse, or equivalent to the old version's?

Calibrating Automated Evaluators

Your automated evaluator is only useful if it agrees with human judgment. Calibration is the process of verifying and improving that agreement.

Steps:

  1. Collect human judgments — have humans score 100-200 outputs using the same criteria as your automated evaluator.
  2. Compute agreement — measure correlation (Spearman/Kendall for ordinal scores) or Cohen's kappa (for categorical labels) between human and automated scores.
  3. Analyze disagreements — where does the evaluator diverge from humans? Common failure modes: over-penalizing style, ignoring factual errors, being too lenient on partial answers.
  4. Iterate on criteria and prompts — adjust the judge prompt to fix systematic biases.
  5. Re-calibrate periodically — as your product evolves, the evaluator can drift. Re-run calibration quarterly.

A good target: 0.7+ Spearman correlation with human scores. Below 0.5, the evaluator is not trustworthy enough to use as a gate.

Tool Landscape

The ecosystem for automated LLM evaluation is maturing fast. Key tools:

  • Promptfoo — open-source eval framework. Config-driven, supports assertions, LLM-as-judge, and custom evaluators. Strong CI integration. Good for teams that want to own their eval pipeline.
  • Braintrust — managed evaluation platform. Logging, scoring, comparison, and analytics in a dashboard. Good for teams that want visibility without building infrastructure.
  • DeepEval — Python-native eval framework with built-in metrics (faithfulness, answer relevancy, hallucination detection). Good for teams already in Python.
  • Ragas — focused on RAG evaluation specifically. Measures context relevance, faithfulness, answer correctness for retrieval-augmented systems.
  • Custom scripts — for many teams, a simple Python script that runs assertions + one LLM-as-judge call is all they need to start. Don't over-engineer early.

Putting It Together

A layered evaluation stack:

  1. Layer 1: Assertions — deterministic checks on format, length, required content, schema compliance. Run on every eval. Cost: near zero.
  2. Layer 2: Embedding similarity — compare output to reference using cosine similarity. Fast, cheap, catches large semantic drifts. Run on every eval.
  3. Layer 3: LLM-as-judge (single) — score outputs on specific rubric dimensions. Run on prompt-changing PRs and nightly. Cost: moderate.
  4. Layer 4: LLM-as-judge (pairwise) — compare new outputs to baseline. Run on nightly and pre-release. Cost: higher (2x judge calls).
  5. Layer 5: Human review — sample flagged outputs for human evaluation. Run weekly or on significant changes.

Each layer catches failures the previous layer misses. Together, they give you confidence without bankrupting your eval budget.

On this page