CI/CD for LLM Evaluation
Running LLM evals in your pipeline without blowing your budget or blocking every merge
Most teams bolt LLM evaluation onto CI as an afterthought — a single "vibe check" job that runs all prompts against a live API and either passes or blocks the merge for twenty minutes. That approach is expensive, slow, and flaky. A good LLM CI pipeline is designed around cost, latency, and signal quality from day one.
Why LLM Testing in CI Is Different
Traditional unit tests are fast, deterministic, and free. LLM evals are none of those things:
- Non-deterministic — the same prompt can produce different outputs across runs, even at temperature 0 (logprob ties break differently).
- Expensive — running 500 eval cases through GPT-4-class models costs real money on every push.
- Slow — a full eval suite can take minutes to hours depending on throughput.
The goal is not to run every eval on every commit. It's to run the right evals at the right time.
The Three-Tier Model
Structure your LLM CI into tiers based on cost and signal:
-
Tier 1: On every PR — fast, cheap, high-signal checks.
- Prompt template linting (schema validation, variable injection checks).
- Deterministic assertion tests against cached/mocked outputs.
- Diff detection: did any prompt template actually change? If not, skip LLM evals entirely.
-
Tier 2: On prompt-touching PRs — moderate cost, real model calls.
- Run a sampled subset (20-50 cases) from your golden set against a live model.
- Check for regressions on the specific prompt that changed.
- Use a smaller/cheaper model for fast signal, flag for Tier 3 if suspicious.
-
Tier 3: Nightly / pre-release — full eval suite.
- Run the complete golden set.
- Run adversarial tests.
- Run cross-model comparison if you support multiple providers.
- Generate a report, not a pass/fail gate.
Prompt Regression Detection in PRs
The most valuable CI check for LLM apps: detect when a prompt change causes output drift.
The pattern:
- On PR open, identify which prompt templates were modified.
- Run those prompts against a fixed set of inputs.
- Compare outputs to a stored baseline (snapshot).
- Flag diffs for human review — don't auto-fail, because some drift is intentional.
This is analogous to visual snapshot testing. The engineer reviews the diff and either approves the new baseline or fixes the regression.
Sampling Strategies for Cost Control
You can't run 10,000 eval cases on every PR. Sampling strategies that work:
- Stratified sampling — partition your golden set by category (topic, difficulty, edge case type) and sample proportionally. 30-50 cases can catch most regressions.
- Change-targeted sampling — if a prompt change affects the "summarization" path, only run summarization evals.
- Progressive widening — run 20 cases immediately, expand to 200 if anything looks off.
- Hash-based deterministic sampling — use the PR number as a seed so the sample is reproducible.
GitHub Actions Example
A minimal workflow for Tier 2 evaluation:
name: LLM Eval
on:
pull_request:
paths:
- 'prompts/**'
- 'src/prompts/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Detect changed prompts
id: changes
run: |
changed=$(git diff --name-only origin/main...HEAD -- prompts/)
echo "changed_prompts=$changed" >> $GITHUB_OUTPUT
- name: Run targeted eval
if: steps.changes.outputs.changed_prompts != ''
run: npx promptfoo eval --config eval/ci.yaml --output eval-results.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Post results to PR
if: always()
run: npx promptfoo report --format github-commentGitLab CI Example
The same pattern for .gitlab-ci.yml:
llm-eval:
stage: test
rules:
- changes:
- prompts/**
script:
- pip install promptfoo
- promptfoo eval --config eval/ci.yaml --output eval-results.json
- promptfoo report --format markdown > eval-report.md
artifacts:
paths:
- eval-report.mdEval Suites as Merge Gates
Use eval results as soft gates, not hard blockers:
- Hard gate — block merge if deterministic assertions fail (wrong JSON schema, missing required fields, safety filter violations).
- Soft gate — post a warning comment if quality scores drop, but let the engineer decide.
- Report-only — for nightly runs, generate a dashboard; don't block anything.
Hard-gating on LLM quality scores leads to flaky pipelines and frustrated engineers. Gate on structure; advise on quality.
Cost Budgets
Set a per-PR cost budget. A reasonable starting point:
- Tier 1: $0 (no model calls).
- Tier 2: $0.50-2.00 per PR.
- Tier 3: $20-50 per nightly run.
Track eval spend alongside your production inference costs. If CI evals cost more than production, something is wrong.