Steven's Knowledge

CI/CD for LLM Evaluation

Running LLM evals in your pipeline without blowing your budget or blocking every merge

Most teams bolt LLM evaluation onto CI as an afterthought — a single "vibe check" job that runs all prompts against a live API and either passes or blocks the merge for twenty minutes. That approach is expensive, slow, and flaky. A good LLM CI pipeline is designed around cost, latency, and signal quality from day one.

Why LLM Testing in CI Is Different

Traditional unit tests are fast, deterministic, and free. LLM evals are none of those things:

  • Non-deterministic — the same prompt can produce different outputs across runs, even at temperature 0 (logprob ties break differently).
  • Expensive — running 500 eval cases through GPT-4-class models costs real money on every push.
  • Slow — a full eval suite can take minutes to hours depending on throughput.

The goal is not to run every eval on every commit. It's to run the right evals at the right time.

The Three-Tier Model

Structure your LLM CI into tiers based on cost and signal:

  1. Tier 1: On every PR — fast, cheap, high-signal checks.

    • Prompt template linting (schema validation, variable injection checks).
    • Deterministic assertion tests against cached/mocked outputs.
    • Diff detection: did any prompt template actually change? If not, skip LLM evals entirely.
  2. Tier 2: On prompt-touching PRs — moderate cost, real model calls.

    • Run a sampled subset (20-50 cases) from your golden set against a live model.
    • Check for regressions on the specific prompt that changed.
    • Use a smaller/cheaper model for fast signal, flag for Tier 3 if suspicious.
  3. Tier 3: Nightly / pre-release — full eval suite.

    • Run the complete golden set.
    • Run adversarial tests.
    • Run cross-model comparison if you support multiple providers.
    • Generate a report, not a pass/fail gate.

Prompt Regression Detection in PRs

The most valuable CI check for LLM apps: detect when a prompt change causes output drift.

The pattern:

  1. On PR open, identify which prompt templates were modified.
  2. Run those prompts against a fixed set of inputs.
  3. Compare outputs to a stored baseline (snapshot).
  4. Flag diffs for human review — don't auto-fail, because some drift is intentional.

This is analogous to visual snapshot testing. The engineer reviews the diff and either approves the new baseline or fixes the regression.

Sampling Strategies for Cost Control

You can't run 10,000 eval cases on every PR. Sampling strategies that work:

  • Stratified sampling — partition your golden set by category (topic, difficulty, edge case type) and sample proportionally. 30-50 cases can catch most regressions.
  • Change-targeted sampling — if a prompt change affects the "summarization" path, only run summarization evals.
  • Progressive widening — run 20 cases immediately, expand to 200 if anything looks off.
  • Hash-based deterministic sampling — use the PR number as a seed so the sample is reproducible.

GitHub Actions Example

A minimal workflow for Tier 2 evaluation:

name: LLM Eval
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/prompts/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Detect changed prompts
        id: changes
        run: |
          changed=$(git diff --name-only origin/main...HEAD -- prompts/)
          echo "changed_prompts=$changed" >> $GITHUB_OUTPUT
      - name: Run targeted eval
        if: steps.changes.outputs.changed_prompts != ''
        run: npx promptfoo eval --config eval/ci.yaml --output eval-results.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Post results to PR
        if: always()
        run: npx promptfoo report --format github-comment

GitLab CI Example

The same pattern for .gitlab-ci.yml:

llm-eval:
  stage: test
  rules:
    - changes:
        - prompts/**
  script:
    - pip install promptfoo
    - promptfoo eval --config eval/ci.yaml --output eval-results.json
    - promptfoo report --format markdown > eval-report.md
  artifacts:
    paths:
      - eval-report.md

Eval Suites as Merge Gates

Use eval results as soft gates, not hard blockers:

  • Hard gate — block merge if deterministic assertions fail (wrong JSON schema, missing required fields, safety filter violations).
  • Soft gate — post a warning comment if quality scores drop, but let the engineer decide.
  • Report-only — for nightly runs, generate a dashboard; don't block anything.

Hard-gating on LLM quality scores leads to flaky pipelines and frustrated engineers. Gate on structure; advise on quality.

Cost Budgets

Set a per-PR cost budget. A reasonable starting point:

  • Tier 1: $0 (no model calls).
  • Tier 2: $0.50-2.00 per PR.
  • Tier 3: $20-50 per nightly run.

Track eval spend alongside your production inference costs. If CI evals cost more than production, something is wrong.

On this page