2026-06-16
LLM regression protocol: golden sets, deterministic matchers, and update policy
How to structure a golden set to catch regressions on prompt and model changes without an LLM-as-judge: file format, matcher tiers, CI policy, and update rules.
Who this is for
Teams that already run an LLM system in production (RAG, agent, classifier) and have hit: “I changed the prompt or upgraded the model and I don’t know what I broke.” This is not a beginner’s introduction to LLMs—it assumes you know what a prompt, an inference endpoint, and a deploy pipeline are.
The problem this solves
Prompt changes
You tweak wording, add context, or tighten output format. The model family may be the same, but behavior shifts often and in small steps. You need a deterministic signal that, on known cases, the system still satisfies the assertions you treat as correct (intent, JSON fields, forbidden phrases).
Model changes
You move between versions or snapshots. The new model can look “better” on generic benchmarks and still regress on your inputs. These events are rare but high blast radius: one deploy can change thousands of answers.
Why treat them separately
- Prompt changes: high frequency, localized impact; you want fast CI feedback (pre-merge).
- Model changes: low frequency, global impact; you want pre-deploy gates and often a wider review window.
The operational issue with “let another LLM score the output” is not ideology: it is not bit-repeatable, it does not scale cleanly in CI without variance, and when it fails it rarely tells you which assertion broke. This article replaces that pattern with explicit golden cases, matchers, and update policy.
What a golden set is (and is not)
It is: a collection of (input, expected assertions) that pin acceptable behavior on representative examples. Each case is a small executable spec.
It is not: a fine-tuning dataset, an exhaustive long-tail test suite, or a vault of “perfect answers.” It does not replace production monitoring.
How to pick cases:
- prioritize edge cases already seen failing in prod or staging;
- cover the main input variants (not every combinatorial permutation);
- if the system has multiple output shapes (e.g. tool JSON vs free text), include at least one case per shape.
Size: for a medium RAG or classifier, 20–50 cases is a healthy band. Below ~20 the signal is thin; above ~200 maintenance hurts without dedicated tooling (split by domain, clear ownership).
File format
YAML works well: readable in review, Git-diff-friendly, lintable.
Suggested repo layout
eval/
golden/
cases/
refund-intent-001.yaml
shipping-delay-002.yaml
README.md
Single case (matchers only up to level 2)
id: refund-intent-001
description: "User asks for refund with explicit order ID"
created_at: "2026-05-01"
updated_at: "2026-05-01"
update_reason: "Added from ticket #4412"
deprecated: false
input:
user_message: "I want a refund for order #1234"
context: "Verified customer; last order shipped 10 days ago."
matchers:
- type: json_field
field: intent
op: eq
value: refund
- type: json_field
field: confidence
op: gte
value: 0.85
- type: contains
value: "1234"
- type: not_contains
value: "I cannot help you"
level: 2
tags:
- refund
- entity-extraction
Example with tier 3 (json_subset and extra keys across model versions)
The API (or tool) returns JSON with extra keys that differ by model version or snapshot (model_reasoning, debug_trace, experimental fields). You only care that a subset of key/value pairs holds: intent, operational status, etc. Tier 2 alone forces you either to assert every irrelevant new key away or to churn the golden constantly. Use json_subset: the parsed output must contain at least the expected object; everything else is ignored.
id: tool-output-contract-015
description: "Tracking tool: model adds extra keys; client contract is intent + status + confidence only"
created_at: "2026-05-12"
updated_at: "2026-05-12"
update_reason: "After model upgrade undocumented optional keys appear in payload"
deprecated: false
input:
user_message: "Where is my package?"
matchers:
- type: json_subset
expected:
intent: tracking
status: ok
- type: json_field
field: confidence
op: gte
value: 0.72
- type: normalized_text_equal
field: summary
expected: "The parcel is in transit to the operations hub."
level: 3
tags:
- tool-json
- tracking
json_subset pins the minimal structure; json_field on confidence keeps a local threshold check. normalized_text_equal (still tier 3 here) applies runner-defined, fixed normalization (e.g. NFC, lowercase, collapsed whitespace) before comparing the summary field—useful when the model tweaks punctuation or spacing but meaning must stay equivalent. Case level is 3 because the weakest structural matcher (not byte-for-byte on the whole JSON) is subset/normalization.
Field reference
| Field | Role |
|---|---|
id | Stable key for the CI runner and reports. Do not rename casually—prefer deprecated plus a new file. |
description | Human context for review: why this case exists. |
created_at / updated_at | Audit trail; any contract change updates updated_at and update_reason. |
update_reason | Required when assertions or inputs change: explains the decision. |
deprecated | true when the case must no longer gate CI but stays in repo for history. |
input | Payload the runner passes into the function or endpoint under test. |
matchers | Ordered list of deterministic checks (see tier table). |
level | Maximum matcher tier used in the case (1 … 5). Used to aggregate reports (“how many cases touch tier 4?”). Must match the type list. |
tags | Filters in dashboards or product ownership. |
Example with tier 4 (embedding)
Use when answers are free-form but you still want an automated signal before human review:
id: tone-paraphrase-010
description: "Empathetic reply equivalent, not literal"
created_at: "2026-05-10"
updated_at: "2026-05-10"
update_reason: "Threshold tuned on 30 human-labeled pairs"
deprecated: false
input:
user_message: "I am really upset about the delay."
matchers:
- type: embedding_similarity
golden_text: "I understand the frustration about the delay; here is the latest on your shipment."
threshold: 0.88
model: text-embedding-3-small
level: 4
tags:
- tone
- shipping
Here level is 4 because the top matcher is semantic; do not label such a case level: 2.
Matcher tiers (1 → 5)
Standing rule: pick the lowest tier that works. If most cases land at 4 or 5, your output contract is probably too open—not your testers’ fault.
Tier 1 — Exact equality
- What: byte-for-byte string equality on the canonical serialized output (or strict rules you document as tier 1).
- Example type:
exact_texton deterministic JSON serialization (sorted keys if your runner always emits the same shape). - When: rigid templates, temperature 0, no acceptable surface variation.
- Cost: negligible.
Tier 2 — Structure or substring predicates
- What:
json_field(eq,ne,gte,lte,in),contains,not_contains,regex. - When: structured tool output, intent + confidence, lexical bans.
- Cost: linear in output length, all local.
Tier 3 — Structural contract or controlled normalization
- What:
json_subset(parsed output contains at least the nested pairs underexpected),json_schemavalidation,normalized_text_equal(runner-fixed normalization rules, e.g. NFC + lowercase + collapsed whitespace, then equality on the named field). - When: Extra keys are acceptable or vary by model version (see
tool-output-contract-015above); free text in one field where only the normalized surface must match. - Cost: JSON parse + optional schema validation; all local.
Tier 2 vs 3: at tier 2 you drive json_field for every field you care about and keep updating the golden whenever a new irrelevant key appears. At tier 3 json_subset pins the sub-contract that matters and ignores structural noise outside it.
Tier 4 — Embedding similarity
- What:
embedding_similaritybetween output andgolden_text(or cached golden embedding vs actual), with a threshold calibrated on a human validation slice. - When: paraphrases are OK; you need a measurable semantic band.
- Cost: one or two embedding calls per case (cache the golden side in CI when possible).
Tier 5 — Human review only
- What: the case documents inputs and qualitative criteria (brand, compliance, legal edge) you will not encode as a reliable binary matcher.
- When: no generic automation should gate pass/fail; humans or a bespoke process decide.
- Cost: human time; never treat as automatic green in CI.
CI integration contract
No mandate for Vitest, GitHub Actions, etc.—define a contract any runner can implement.
Inputs
- Directory or manifest of golden YAML files.
- Function or endpoint under test (prompt and model version pinned or identified by pipeline env vars).
- Runner config: timeouts, whether retries are allowed (usually no for regression checks), secrets path for embeddings if tier 4 is used.
Outputs (machine + human)
- Pass/fail counts overall and by max case
level. - For each failure: which
matcherbroke, expected vs actual (truncated strings, JSON paths, similarity scores). review_required: trueon every tier 4 failure (evidence for review tied to a possible override) and on all tier 5 cases.- Optional: textual or pretty JSON diff—mask PII such as order IDs in logs.
When to run
- Every prompt change touching the system: mandatory run pre-merge (or pre-push for tiny teams).
- Every model change (API version, snapshot, production-default temperature): mandatory pre-deploy.
- Periodic runs (e.g. weekly) with no code changes: providers ship silent updates—model drift happens even if your lockfile did not move.
Failure policy (recommended)
| Tier involved in the failure | Pipeline behavior |
|---|---|
| 1–3 | Fail blocks merge/deploy. |
| 4 | Default under this protocol: same as 1–3 — fail blocks merge/deploy. Overrides are allowed only with explicit, traceable approval (linked ticket, PR label, or mandatory signed field in the CI report from an owner) after reading the report (similarity score, actual snippet). Do not clear red without a trail: tier 4 is inherently less binary than eq and would otherwise hollow out CI. Larger orgs may soften this only by replacing the default in written policy. |
| 5 | Never a final automatic pass/fail; human queue only. |
Blocking tier 4 by default stops merges where the only signal was “embedding below threshold”: thresholds have variance (embedding model, text, tokens). Review must be a deliberate step, not a hole in the pipeline.
Golden set update policy
A stale golden set is worse than none: it creates false confidence. This section is the social contract between engineering and product.
When to update
- A new production input shape appears → add a case with
created_at/update_reason. - A case fails but the new output is correct under an updated spec → fix the golden, document why the old expectation was wrong.
- Intentional behavior change (refund policy, tone) → golden updates ship in the same change request as code or prompt.
How to update (operational rules)
- No silent auto-updates from CI (“overwrite expected with actual”). Every change to
matchersorinputis a human decision with a trail. - Every edit bumps
updated_atandupdate_reason(a single sentence in YAML is enough). - Do not hard-delete cases overnight: set
deprecated: true, optionallydeprecated_at/deprecated_reason, and keep at least one release cycle for grep and audit.
Deprecation example:
id: refund-intent-legacy-000
description: "Legacy wording pre April 2026 spec"
deprecated: true
deprecated_at: "2026-05-15"
deprecated_reason: "Replaced by refund-intent-001 after order-ID policy change"
Degraded set signal
If a single model upgrade forces you to rewrite assertions for more than ~30% of cases, the set is too coupled to surface form (verbatim punctuation) instead of semantic contract. Fix by pushing matchers down to tiers 2–3 (structured fields), tightening schemas, and reducing unnecessary free text.
Out of scope
- LLM-as-judge evaluators as the primary gate: different pattern, not deterministic in the sense used here.
- Live A/B traffic, fine-tuning evals, commercial frameworks (LangSmith, RAGAS, …): useful, but not required to implement the standard above with an internal script.
Closing
The takeaway question: “If I change the model tomorrow morning, how long until I know whether I broke my real cases?”
If the answer is I don’t know or about a day, you need an operational golden set. If it becomes ten minutes after the pipeline finishes, this article did its job.