2026-06-16

LLM regression protocol: golden sets, deterministic matchers, and update policy

How to structure a golden set to catch regressions on prompt and model changes without an LLM-as-judge: file format, matcher tiers, CI policy, and update rules.

Who this is for

Teams that already run an LLM system in production (RAG, agent, classifier) and have hit: “I changed the prompt or upgraded the model and I don’t know what I broke.” This is not a beginner’s introduction to LLMs—it assumes you know what a prompt, an inference endpoint, and a deploy pipeline are.

The problem this solves

Prompt changes

You tweak wording, add context, or tighten output format. The model family may be the same, but behavior shifts often and in small steps. You need a deterministic signal that, on known cases, the system still satisfies the assertions you treat as correct (intent, JSON fields, forbidden phrases).

Model changes

You move between versions or snapshots. The new model can look “better” on generic benchmarks and still regress on your inputs. These events are rare but high blast radius: one deploy can change thousands of answers.

Why treat them separately

Prompt changes: high frequency, localized impact; you want fast CI feedback (pre-merge).
Model changes: low frequency, global impact; you want pre-deploy gates and often a wider review window.

The operational issue with “let another LLM score the output” is not ideology: it is not bit-repeatable, it does not scale cleanly in CI without variance, and when it fails it rarely tells you which assertion broke. This article replaces that pattern with explicit golden cases, matchers, and update policy.

What a golden set is (and is not)

It is: a collection of (input, expected assertions) that pin acceptable behavior on representative examples. Each case is a small executable spec.

It is not: a fine-tuning dataset, an exhaustive long-tail test suite, or a vault of “perfect answers.” It does not replace production monitoring.

How to pick cases:

prioritize edge cases already seen failing in prod or staging;
cover the main input variants (not every combinatorial permutation);
if the system has multiple output shapes (e.g. tool JSON vs free text), include at least one case per shape.

Size: for a medium RAG or classifier, 20–50 cases is a healthy band. Below ~20 the signal is thin; above ~200 maintenance hurts without dedicated tooling (split by domain, clear ownership).

File format

YAML works well: readable in review, Git-diff-friendly, lintable.

Suggested repo layout

eval/
  golden/
    cases/
      refund-intent-001.yaml
      shipping-delay-002.yaml
    README.md

Single case (matchers only up to level 2)

id: refund-intent-001
description: "User asks for refund with explicit order ID"
created_at: "2026-05-01"
updated_at: "2026-05-01"
update_reason: "Added from ticket #4412"
deprecated: false
input:
  user_message: "I want a refund for order #1234"
  context: "Verified customer; last order shipped 10 days ago."
matchers:
  - type: json_field
    field: intent
    op: eq
    value: refund
  - type: json_field
    field: confidence
    op: gte
    value: 0.85
  - type: contains
    value: "1234"
  - type: not_contains
    value: "I cannot help you"
level: 2
tags:
  - refund
  - entity-extraction

Example with tier 3 (`json_subset` and extra keys across model versions)

The API (or tool) returns JSON with extra keys that differ by model version or snapshot (model_reasoning, debug_trace, experimental fields). You only care that a subset of key/value pairs holds: intent, operational status, etc. Tier 2 alone forces you either to assert every irrelevant new key away or to churn the golden constantly. Use json_subset: the parsed output must contain at least the expected object; everything else is ignored.

id: tool-output-contract-015
description: "Tracking tool: model adds extra keys; client contract is intent + status + confidence only"
created_at: "2026-05-12"
updated_at: "2026-05-12"
update_reason: "After model upgrade undocumented optional keys appear in payload"
deprecated: false
input:
  user_message: "Where is my package?"
matchers:
  - type: json_subset
    expected:
      intent: tracking
      status: ok
  - type: json_field
    field: confidence
    op: gte
    value: 0.72
  - type: normalized_text_equal
    field: summary
    expected: "The parcel is in transit to the operations hub."
level: 3
tags:
  - tool-json
  - tracking

json_subset pins the minimal structure; json_field on confidence keeps a local threshold check. normalized_text_equal (still tier 3 here) applies runner-defined, fixed normalization (e.g. NFC, lowercase, collapsed whitespace) before comparing the summary field—useful when the model tweaks punctuation or spacing but meaning must stay equivalent. Case level is 3 because the weakest structural matcher (not byte-for-byte on the whole JSON) is subset/normalization.

Field reference

Field	Role
`id`	Stable key for the CI runner and reports. Do not rename casually—prefer `deprecated` plus a new file.
`description`	Human context for review: why this case exists.
`created_at` / `updated_at`	Audit trail; any contract change updates `updated_at` and `update_reason`.
`update_reason`	Required when assertions or inputs change: explains the decision.
`deprecated`	`true` when the case must no longer gate CI but stays in repo for history.
`input`	Payload the runner passes into the function or endpoint under test.
`matchers`	Ordered list of deterministic checks (see tier table).
`level`	Maximum matcher tier used in the case (`1` … `5`). Used to aggregate reports (“how many cases touch tier 4?”). Must match the `type` list.
`tags`	Filters in dashboards or product ownership.

Example with tier 4 (embedding)

Use when answers are free-form but you still want an automated signal before human review:

id: tone-paraphrase-010
description: "Empathetic reply equivalent, not literal"
created_at: "2026-05-10"
updated_at: "2026-05-10"
update_reason: "Threshold tuned on 30 human-labeled pairs"
deprecated: false
input:
  user_message: "I am really upset about the delay."
matchers:
  - type: embedding_similarity
    golden_text: "I understand the frustration about the delay; here is the latest on your shipment."
    threshold: 0.88
    model: text-embedding-3-small
level: 4
tags:
  - tone
  - shipping

Here level is 4 because the top matcher is semantic; do not label such a case level: 2.

Matcher tiers (1 → 5)

Standing rule: pick the lowest tier that works. If most cases land at 4 or 5, your output contract is probably too open—not your testers’ fault.

Tier 1 — Exact equality

What: byte-for-byte string equality on the canonical serialized output (or strict rules you document as tier 1).
Example type: exact_text on deterministic JSON serialization (sorted keys if your runner always emits the same shape).
When: rigid templates, temperature 0, no acceptable surface variation.
Cost: negligible.

Tier 2 — Structure or substring predicates

What: json_field (eq, ne, gte, lte, in), contains, not_contains, regex.
When: structured tool output, intent + confidence, lexical bans.
Cost: linear in output length, all local.

Tier 3 — Structural contract or controlled normalization

What: json_subset (parsed output contains at least the nested pairs under expected), json_schema validation, normalized_text_equal (runner-fixed normalization rules, e.g. NFC + lowercase + collapsed whitespace, then equality on the named field).
When: Extra keys are acceptable or vary by model version (see tool-output-contract-015 above); free text in one field where only the normalized surface must match.
Cost: JSON parse + optional schema validation; all local.

Tier 2 vs 3: at tier 2 you drive json_field for every field you care about and keep updating the golden whenever a new irrelevant key appears. At tier 3 json_subset pins the sub-contract that matters and ignores structural noise outside it.

Tier 4 — Embedding similarity

What: embedding_similarity between output and golden_text (or cached golden embedding vs actual), with a threshold calibrated on a human validation slice.
When: paraphrases are OK; you need a measurable semantic band.
Cost: one or two embedding calls per case (cache the golden side in CI when possible).

Tier 5 — Human review only

What: the case documents inputs and qualitative criteria (brand, compliance, legal edge) you will not encode as a reliable binary matcher.
When: no generic automation should gate pass/fail; humans or a bespoke process decide.
Cost: human time; never treat as automatic green in CI.

CI integration contract

No mandate for Vitest, GitHub Actions, etc.—define a contract any runner can implement.

Inputs

Directory or manifest of golden YAML files.
Function or endpoint under test (prompt and model version pinned or identified by pipeline env vars).
Runner config: timeouts, whether retries are allowed (usually no for regression checks), secrets path for embeddings if tier 4 is used.

Outputs (machine + human)

Pass/fail counts overall and by max case level.
For each failure: which matcher broke, expected vs actual (truncated strings, JSON paths, similarity scores).
review_required: true on every tier 4 failure (evidence for review tied to a possible override) and on all tier 5 cases.
Optional: textual or pretty JSON diff—mask PII such as order IDs in logs.

When to run

Every prompt change touching the system: mandatory run pre-merge (or pre-push for tiny teams).
Every model change (API version, snapshot, production-default temperature): mandatory pre-deploy.
Periodic runs (e.g. weekly) with no code changes: providers ship silent updates—model drift happens even if your lockfile did not move.

Failure policy (recommended)

Tier involved in the failure	Pipeline behavior
1–3	Fail blocks merge/deploy.
4	Default under this protocol: same as 1–3 — fail blocks merge/deploy. Overrides are allowed only with explicit, traceable approval (linked ticket, PR label, or mandatory signed field in the CI report from an owner) after reading the report (similarity score, actual snippet). Do not clear red without a trail: tier 4 is inherently less binary than `eq` and would otherwise hollow out CI. Larger orgs may soften this only by replacing the default in written policy.
5	Never a final automatic pass/fail; human queue only.

Blocking tier 4 by default stops merges where the only signal was “embedding below threshold”: thresholds have variance (embedding model, text, tokens). Review must be a deliberate step, not a hole in the pipeline.

Golden set update policy

A stale golden set is worse than none: it creates false confidence. This section is the social contract between engineering and product.

When to update

A new production input shape appears → add a case with created_at / update_reason.
A case fails but the new output is correct under an updated spec → fix the golden, document why the old expectation was wrong.
Intentional behavior change (refund policy, tone) → golden updates ship in the same change request as code or prompt.

How to update (operational rules)

No silent auto-updates from CI (“overwrite expected with actual”). Every change to matchers or input is a human decision with a trail.
Every edit bumps updated_at and update_reason (a single sentence in YAML is enough).
Do not hard-delete cases overnight: set deprecated: true, optionally deprecated_at / deprecated_reason, and keep at least one release cycle for grep and audit.

Deprecation example:

id: refund-intent-legacy-000
description: "Legacy wording pre April 2026 spec"
deprecated: true
deprecated_at: "2026-05-15"
deprecated_reason: "Replaced by refund-intent-001 after order-ID policy change"

Degraded set signal

If a single model upgrade forces you to rewrite assertions for more than ~30% of cases, the set is too coupled to surface form (verbatim punctuation) instead of semantic contract. Fix by pushing matchers down to tiers 2–3 (structured fields), tightening schemas, and reducing unnecessary free text.

Out of scope

LLM-as-judge evaluators as the primary gate: different pattern, not deterministic in the sense used here.
Live A/B traffic, fine-tuning evals, commercial frameworks (LangSmith, RAGAS, …): useful, but not required to implement the standard above with an internal script.

Closing

The takeaway question: “If I change the model tomorrow morning, how long until I know whether I broke my real cases?”

If the answer is I don’t know or about a day, you need an operational golden set. If it becomes ten minutes after the pipeline finishes, this article did its job.

Want to ship ideas like these into your product?