05 — Evaluation & Validation: Earning 99.9%#
Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Reference baseline — May 2026. Thresholds shown are org-set placeholders pending QMS ratification. Audience: engineering, quality/QA, regulatory affairs, validation (CSV/CSA) owners. Cross-references: 01-requirements.md, 02-maturity-model.md, 03-reference-architecture.md, 04-model-strategy-and-finetuning.md, 06-agentic-workflows.md, 07-security-and-compliance.md, 08-token-and-gpu-economics.md, 09-adoption-roadmap.md.
This document is the core of the 99.9% argument. Everything else in the program — the model fleet, the harness, the maturity ladder — exists to feed and be governed by the evaluation and validation system described here. The thesis is Principle P1: 99.9% is a system property, not a model property. No open-weight model, fine-tuned or otherwise, will be claimed to be 99.9% correct. Instead we engineer a pipeline — Generate → Verify → Repair → Gate — whose aggregate release-gate correctness reaches the target, and we prove it with deterministic, recorded, reproducible evidence (P2, P4, P7).
1. Reframing the 99.9%: what number are we actually defending?#
The single most common error in agentic-SDLC programs is conflating three different quantities under one "99.9%" banner. They are measured differently, owned differently, and validated differently.
| # | Quantity | Definition | Who is responsible | Do we engineer it? |
|---|---|---|---|---|
| (a) | Model accuracy | Probability a single model invocation produces a correct artifact on a representative task distribution | Model strategy (04) | No — informative, never gated on directly |
| (b) | System release-gate correctness | Probability that an artifact admitted past the release gate is correct, given all verifiers, repair, and HITL | Harness + Eval (this doc) | Yes — primary target ≥ 99.9% |
| (c) | Escape-rate | Probability that a defect reaches a regulated artifact (released code, DHF/DHR record, submission content) undetected | Whole SDLC + post-market | Yes — minimized; the safety-relevant number |
State plainly: we do not promise model accuracy (a). We engineer system correctness (b) and we drive down escape-rate (c). A model that is only 92% accurate but abstains or fails loudly on the other 8% can feed a 99.9% system, because the system never admits the bad 8% — it routes it to repair, escalation, or human control. Conversely, a 99% model with weak gating can have a worse escape-rate than a 90% model with strong gating. The model is a generator; the gate is the guarantee.
1.1 Metric definitions (precise)#
Let the unit of evaluation be a change-set (a PR-sized artifact) or an eval item (a single golden task). Define a "correct" outcome against the deterministic ground truth for that item.
| Metric | Formula | Meaning / use |
|---|---|---|
| Precision (of the gate) | TP / (TP + FP) | Of artifacts the gate passes, fraction actually correct. This is (b). Drives release-gate correctness. |
| Recall (of the gate) | TP / (TP + FN) | Of correct artifacts, fraction the gate passes. A proxy for first-pass throughput; low recall = wasted generation, not a safety issue. |
| Escape-rate | (defects reaching a regulated artifact) / (total artifacts released) | (c). The post-gate failure probability. Target is an org-set ceiling, e.g. ≤ 1e-3, tracked per safety class. |
| Abstention-rate | (abstained or IDK items) / (total items) | How often the system declines and escalates rather than guessing. A yield lever, not a defect (§6.4). |
| First-Pass Yield (FPY) | (items passing all gates with zero repair iterations) / (total items) | Generation quality + harness efficiency. An economic metric (P6, ties to 08), not a safety gate. |
| Repair-adjusted yield | (items passing after ≤ K repair iterations) / (total items) | Throughput including the bounded repair loop. |
| False-pass rate (FPR_gate) | FP / (TP + FP + TN + FN) admitted | The complement view of escape-rate at the gate boundary; the number a CSA risk assessment cares about most. |
A critical conceptual rule, established here and enforced throughout: precision and escape-rate are the safety metrics; recall and FPY are the economic metrics. When the two trade off, safety wins — we prefer a system that abstains and escalates (lower FPY, higher cost) over one that admits a marginal artifact (higher FPY, higher escape-rate). This is a direct expression of risk-proportional autonomy (P3).
2. The assurance pipeline#
Every artifact produced by an agent — code, test, spec, config, document fragment — flows through the same assurance pipeline before it can become evidence or be released. Determinism dominates the critical path (P2): probabilistic checks may inform and escalate, but they may never be the sole admit/reject decision for Class B or Class C work.
flowchart TD
A[Intent / Spec<br/>requirement, Gherkin, ticket] --> B[Generate<br/>model fleet S/M/L/V/E<br/>constrained + spec-conditioned decoding]
B --> C{Deterministic<br/>Verify Layer}
C -- fail --> D[Repair Loop<br/>bounded budget K<br/>feed verifier output back]
D --> B
C -- pass --> E{Eval Gate<br/>golden suites + thresholds<br/>statistical acceptance}
E -- below threshold --> D
E -- secondary signal only --> F[LLM-as-judge / rubric<br/>NON-gating for B/C<br/>escalation trigger only]
F --> G
E -- pass --> G{HITL<br/>risk-proportional<br/>Class C = dual control}
G -- changes requested --> D
G -- abstain / IDK --> H[Escalate to human owner]
G -- approved --> I[Release Record<br/>signed evidence bundle<br/>traceability + Part 11 audit]
D -- budget exhausted --> H
I --> J[(Production / DHF / DHR<br/>regulated artifact)]
J -. production failures captured .-> K[Regression Capture<br/>new golden items -> §8]
K -. closed loop L5 .-> E
| Stage | Role | Why it sits here |
|---|---|---|
| Generate | Produce candidate artifact(s); may emit N samples for self-consistency | Probabilistic by nature; cheapest to make wrong loudly via structure constraints |
| Deterministic Verify | Hard, reproducible accept/reject from compilers, tests, analyzers, schema/policy checks | The workhorse. Same input → same verdict, always (P7). This is what makes the gate auditable. |
| Repair loop | Feed verifier diagnostics back to the generator; retry within a bounded budget K | Converts a probabilistic generator into a system that converges; budget prevents infinite/cost-runaway loops |
| Eval Gate | Compare against versioned golden suites + statistical acceptance criteria by safety class | Decides admit/reject on evidence, not vibes; applies confidence-interval thresholds (§6) |
| HITL | Human review proportional to IEC 62304 class; Class C always dual control (P3) | Eval score never substitutes for required human judgment on high-risk artifacts |
| Release record | Emit a signed, immutable evidence bundle (Part 11) with full traceability | Everything is evidence (P4); the record is the validation artifact |
3. Deterministic verification layer (the workhorse)#
This layer is why the program works. Principle P5: the harness is the product. Each verifier is a deterministic function verify(artifact, context) → {PASS, FAIL(diagnostics)} that is reproducible (P7), versioned, and recorded. They compose: an artifact is admitted only if it survives the conjunction of all applicable verifiers for its class.
| Verifier class | What it guarantees | Critical-path position | Determinism |
|---|---|---|---|
| Build / compile | Artifact is syntactically valid and integrates into the target | Gate 0 — runs first; cheap, eliminates gross failures | Fully deterministic |
| Type systems / static typing | Type-level contracts hold; whole classes of interface defects excluded | Gate 0/1 — pre-test | Fully deterministic |
| Unit tests | Specified behaviors hold on chosen inputs | Gate 1 — core correctness | Deterministic given seeded fixtures |
| Property-based tests | Invariants hold across generated input distributions, not just examples | Gate 1/2 — depth beyond unit | Deterministic with fixed seed; record seed in evidence |
| Mutation testing | The test suite itself is adequate (kills injected faults) — guards against vacuous green | Gate 2 — meta-check on tests | Deterministic; expensive, sampled or scheduled |
| Differential / metamorphic testing | New implementation matches a reference oracle, or obeys metamorphic relations when no oracle exists | Gate 2 — oracle problem mitigation | Deterministic |
| Fuzzing | No crashes/UB/memory-safety violations on adversarial inputs | Gate 2/3 — robustness | Deterministic per corpus + seed; time-boxed |
| SAST / DAST / SCA | No known-pattern vulnerabilities, runtime exposures, or vulnerable dependencies | Gate 2 — security (ties to 07) | Deterministic given ruleset + version pinning |
| Schema / contract validation | Interfaces, data, and API contracts conform; structured outputs are well-formed | Gate 0/1 — fast structural guard | Fully deterministic |
| Formal methods / specification checks | Critical properties provably hold (model checking, SMT, refinement) — reserved for Class C safety functions | Gate 3 — strongest, narrowest | Deterministic (proof or counterexample) |
| Policy-as-code | Org/regulatory rules enforced mechanically (licensing, banned APIs, segregation of duties, sign-off rules) | Gate at every stage — governance | Fully deterministic |
3.1 How composition drives escape-rate down#
Verifiers are arranged as a defense-in-depth conjunction, ordered cheap-to-expensive so that most bad artifacts are killed early at low cost (P6). If verifier i has independent miss-probability mᵢ (it lets a defect through), and a defect must evade all of them to escape, the residual escape probability is bounded by the product ∏ mᵢ — provided the verifiers are sufficiently independent (catch different defect families). Independence is the design objective: type checkers catch interface defects, property tests catch invariant violations, mutation testing catches weak tests, SAST catches vulnerability patterns. Correlated verifiers (e.g., two SAST tools sharing a ruleset) do not multiply their protection, and we do not claim they do (§6.1, §11). The Eval Owner (§8) maintains a documented defect-family-to-verifier coverage matrix so that the independence claim underlying the math is auditable rather than assumed.
4. Generation-time techniques that raise first-pass yield#
These raise FPY and cut repair cost (an economic win, P6/08). They are not a substitute for the gate — they make the generator produce gate-passable artifacts more often.
| Technique | Mechanism | Yield effect | Caveat |
|---|---|---|---|
| Constrained / structured decoding | Force output to a grammar/JSON schema/AST shape during generation | Eliminates malformed-output failures outright | Constrain form, not correctness — still must pass verifiers |
| Spec-conditioned generation (BDD/Gherkin) | Condition the model on executable acceptance criteria | Aligns generation to the exact gate it must pass | Spec quality bounds outcome quality |
| Retrieval grounding | Inject relevant internal code, standards, prior decisions into context | Reduces hallucinated APIs and policy violations | Retrieved context must itself be governed (07) |
| N-sample self-consistency + verifier selection | Generate N candidates; select by deterministic verifier outcome, not by model vote | Raises probability ≥1 candidate passes; selection stays deterministic | Cost ∝ N — budget per safety class (08) |
| Test-first generation | Agent writes a failing test from the spec, then code to satisfy it | Forces an explicit, checkable success criterion | Test must be reviewed so the agent doesn't write a weak/vacuous test (mutation testing guards this) |
| Bounded repair loops | Feed verifier diagnostics back; retry up to budget K | Recovers near-misses cheaply | Budget exhaustion → abstain/escalate, never force-pass |
Selection rule (load-bearing): when using N-sample self-consistency, the selector is the deterministic verifier suite, not an LLM majority vote. Self-consistency raises the chance a good candidate exists; determinism decides which one is admitted. This keeps P2 intact even while exploiting probabilistic breadth.
5. Probabilistic evaluation (secondary, never sole gate)#
Probabilistic evaluation has real value for coverage of qualities deterministic checks cannot express (clarity of a rationale, plausibility of a design narrative, reasonableness of a trajectory). It is admitted only under strict containment.
| Probabilistic signal | What it assesses | Permitted role | Prohibited role |
|---|---|---|---|
| Golden datasets + rubrics | Behavior vs curated expected outcomes | Trend/regression metric; gate only where ground truth is deterministic | — |
| LLM-as-judge | Subjective quality, rationale adequacy | Secondary signal; escalation trigger; advisory on Class A only with human confirmation | Sole gate for any Class B or Class C artifact |
| Trajectory evaluation | Did the agent take valid steps / legal tool calls in a sane order (via OpenTelemetry traces) | Detects unsafe process even when output passes; flags for review | Cannot admit on its own |
| Behavioral drift detection | Distribution shift in outputs/scores vs a baseline | Monitoring + revalidation trigger (§9) | Cannot gate a single release |
Gating rules (binding):
- For Class B and Class C, every admit decision must rest on a deterministic verdict. Probabilistic signals may only block (raise concern → escalate) or inform; they may never admit (P2).
- LLM-as-judge is never the sole gate for B/C, full stop. Where used, the judge model, prompt, rubric, and version are recorded as part of the evidence and are themselves subject to drift monitoring.
- An LLM-as-judge disagreement with a deterministic verdict resolves in favor of the deterministic verdict, and the disagreement is logged for Eval Owner review.
6. The math of 99.9%#
6.1 Composing imperfect independent checks#
Suppose a generated artifact is defective with prior probability p_def. It must pass n independent verifiers to be admitted; verifier i misses a defect with probability mᵢ. The probability a defect escapes the gate is bounded by:
P(escape) ≤ p_def · ∏(i=1..n) mᵢWith three genuinely independent checks each missing 10% of defects (mᵢ = 0.1) and a 20% defect prior, P(escape) ≤ 0.2 · 0.001 = 2e-4 — already below a 1e-3 ceiling. Add the bounded repair loop: each repair iteration re-subjects the artifact to the full conjunction, compounding the protection on the retained (repaired) artifacts. The headline number — release-gate correctness ≥ 99.9% — is therefore (b) precision achieved by composition, not by any single model or check.
The independence caveat is the whole ballgame. The product rule holds only to the degree verifiers catch different defect families. We do not claim multiplicative protection for correlated checks; the Eval Owner's coverage matrix (§3.1) documents which mᵢ are credibly independent. Treating correlated checks as independent is the central statistical lie of "eval theater" (§11) and is explicitly disallowed.
6.2 Statistical acceptance by safety class#
We never assert 99.9% from a point estimate. We require a lower confidence bound from a sample of sufficient size, scaled by safety class (risk-based, per CSA/GAMP 5).
| Safety class (IEC 62304) | Acceptance criterion (placeholder) | Sampling rigor | Human sign-off |
|---|---|---|---|
| Class A (no injury) | Wilson/Clopper–Pearson lower 95% bound on gate precision ≥ 99.0% | Standard golden suite | Optional / sampled |
| Class B (non-serious injury) | Lower 95% bound ≥ 99.9%; escape-rate ≤ 1e-3 | Expanded suite + property/mutation depth | Required single reviewer |
| Class C (death/serious injury) | Lower 95–99% bound ≥ 99.9%; formal checks on safety functions; escape-rate ceiling set per risk file (14971) | Maximum rigor; formal methods where feasible | Required dual human control (P3) — non-negotiable |
To demonstrate a true rate ≥ 99.9% with a 95% lower confidence bound and zero observed failures, the rule-of-three approximation requires roughly n ≈ 3 / (1 − 0.999) ≈ 3000 independent passing items; any observed failure raises the required n substantially. Sample-size targets per suite are recorded with the eval suite version (§8). These are gate thresholds; they do not replace required human review.
6.3 Why Class C still needs human sign-off regardless of score#
A 99.9% gate is a statistical statement about a population. A single Class C artifact may be the 1-in-1000 that the population statistics tolerate but a patient cannot. IEC 62304, ISO 14971, and CSA intended-use logic all require human judgment proportional to harm. Therefore dual human control on Class C is independent of the eval score — even a hypothetical 100% historical pass-rate does not waive it (P3). The eval score informs the reviewers; it never replaces them.
6.4 Abstention as a yield lever#
A model that can say "I don't know" and escalate (per the fleet design in 04) converts a potential false pass into an honest escalation. Mathematically, abstention removes the hardest, least-certain items from the admitted population, which raises precision (b) and lowers escape-rate (c) at the cost of throughput (FPY) and human effort. We therefore treat a calibrated abstention-rate as a feature to tune, not a failure to suppress. Suppressing abstention to inflate FPY is an anti-pattern (§11).
7. Validation under FDA CSA / GAMP 5 / IEC 62304#
The agent and its harness are production/quality-system software that must be validated — not merely a developer convenience. We apply FDA Computer Software Assurance (risk-based assurance, intended use, recorded evidence), GAMP 5 (2nd ed) risk-based CSV, IEC 62304 V&V activities, and ISO/IEC 42001 for the AI management system.
| CSV/CSA element | How it is satisfied here |
|---|---|
| Intended-use statement | Each agent/harness component has a documented intended use, operational boundaries, and the safety classes it may act on (drives risk-based rigor) |
| Risk-based test rigor | Test depth scales by IEC 62304 class (§6.2); CSA "unscripted vs scripted" assurance applied proportional to harm |
| IQ analog | Harness, model fleet, verifier toolchain deployed to K8s with pinned versions; installation verified and recorded (ties to 03) |
| OQ analog | Each verifier and gate exercised against golden suites; deterministic verdicts reproduced; thresholds demonstrated |
| PQ analog | End-to-end performance on representative change-sets at target precision/escape-rate; continuous in production (§9) |
| Documented/recorded evidence | Every gate emits a signed evidence bundle; nothing is asserted without a record (P4, Part 11) |
| Traceability | requirement → spec → code → test → eval → release, linked bidirectionally and stored immutably (01, 03) |
7.1 Revalidation triggers#
Validation is a state, not an event. Any of the following triggers risk-assessed revalidation, scoped by impact:
- Model change — new fine-tune, base-weight update, quantization, or decoding-config change (ties to 04).
- Harness change — new/updated verifier, gate-threshold change, repair-budget change, orchestration change (06).
- Eval-dataset change — new golden items, rubric change, contamination remediation (§8).
- Drift — behavioral/score drift past a control limit in production (§9).
- Regulatory/process change — change in intended use, standard revision, or risk-file update (14971).
Each trigger, the affected scope, and the revalidation outcome are recorded; the traceability graph identifies the blast radius automatically.
8. Eval suites as controlled artifacts#
Eval suites are controlled, validated artifacts with the same rigor as the software they judge — because under CSA they are quality-system software.
- Versioned & signed — each suite has a semantic version, content hash, and a Part 11 signature; gate runs record exactly which suite version produced the verdict (P7).
- Owned — an explicit Eval Owner role is accountable for suite integrity, the defect-family coverage matrix (§3.1), sample-size adequacy (§6.2), and contamination control. The Eval Owner is segregated from the model-training function (segregation of duties, policy-as-code enforced — §3, 07).
- Contamination control vs fine-tuning data — golden eval items are held out of, and continuously diffed against, fine-tuning corpora (ties to 04). Any leakage invalidates the affected results and triggers re-curation. Provenance and hashes of eval vs train sets are recorded so contamination is detectable, not merely asserted.
- Regression capture — every production escape becomes a new golden item (the closed loop in §2). This is the mechanism by which the system learns from failures and is the L5 "self-optimizing" behavior in 02 — but the new item enters the controlled suite through the same versioning/sign-off, never silently.
9. Continuous evaluation & monitoring in production#
A gate that is only run pre-merge cannot detect post-deployment drift. Continuous evaluation closes the loop (PQ analog, §7).
| Capability | What it does | Ties to |
|---|---|---|
| Live eval | Periodically re-run golden suites against the current deployed fleet+harness; alert on regression | §6.2 thresholds |
| Drift detection | Track score/behavior distributions against control limits; flag shift | §5, §7.1 revalidation |
| Auto-regression | Re-run the captured-failure golden items on every model/harness change to prevent recurrence | §8 |
| MTTR | Track mean-time-to-remediate gate regressions and production escapes; an operational KPI | 09 |
| Eval-cost budgeting | Cost-aware scheduling of expensive checks (mutation, fuzzing, N-sample, formal); spend governed per cost-per-green-PR (P6) | 08 |
Expensive deterministic checks are scheduled risk-proportionally: always-on for Class C critical paths, sampled or nightly for lower-risk surfaces, so assurance scales without unbounded GPU/compute cost (08).
10. Worked example: a Class B code change through every gate#
A change-set modifies a Class B data-formatting routine. It is dispatched to a mid-tier (M) model with a constrained, spec-conditioned prompt.
| Step | Action | Evidence produced |
|---|---|---|
| 0. Intent | Requirement + Gherkin acceptance criteria retrieved and linked | Trace edge: requirement → spec |
| 1. Generate | N=4 candidates via structured decoding; agent first writes a failing test | 4 candidate diffs + 1 test, OTel trajectory trace |
| 2. Build/type/schema | Gate 0 run on all candidates; 1 fails compile, dropped | Compile + type-check logs (deterministic) |
| 3. Unit + property | Surviving candidates run against unit + property suites; verifier selects the passing candidate | Test results, property seeds recorded |
| 4. Mutation | Mutation run confirms the test suite kills injected faults (not vacuously green) | Mutation score report |
| 5. SAST/SCA | Security scan clean; dependencies unchanged | SAST/SCA report (07) |
| 6. Eval Gate | Golden suite for this surface meets Class B criterion (lower 95% bound ≥ 99.9%) | Statistical acceptance record (§6.2) |
| 7. LLM-as-judge | Secondary rationale check — advisory only; agrees, logged, does not gate | Judge model+prompt+version, score (non-gating) |
| 8. HITL | Single qualified reviewer (Class B) approves with comments | Signed human review record (Part 11) |
| 9. Release record | Signed evidence bundle assembled; full traceability sealed | requirement→spec→code→test→eval→release graph |
Had any deterministic gate failed, the diagnostics would feed the bounded repair loop; on budget exhaustion the item would abstain and escalate, never force-pass. Had this been Class C, step 8 would require dual human control and step 6 would add formal/specification checks regardless of the eval score (§6.3).
11. Anti-patterns#
| Anti-pattern | Why it is dangerous | Countermeasure |
|---|---|---|
| Eval theater | Impressive dashboards over weak/correlated checks; the 99.9% is unbacked | Coverage matrix + independence audit (§3.1); mutation testing on the eval suite itself |
| LLM-as-sole-gate | A probabilistic judge admitting B/C artifacts violates P2; no reproducible verdict | Hard rule: deterministic admit for B/C; judge is secondary/escalation only (§5) |
| Training on the eval set | Contamination inflates scores and hides true escape-rate | Held-out signed suites, train/eval diffing, provenance hashes (§8, 04) |
| Gaming first-pass yield by hiding repairs | FPY treated as a target → incentive to suppress/mask repair iterations and abstentions | FPY is an economic metric, never a gate; repairs and abstentions are recorded and audited (§1.1, §6.4) |
| Treating correlated checks as independent | Overstates the composition math; real escape-rate higher than claimed | Document independence; only credit independent miss-probabilities (§6.1) |
| Waiving Class C human control on a high score | Statistics about a population do not protect the individual patient | Dual control on C is independent of eval score (P3, §6.3) |
Summary#
We do not sell a 99.9% model; we engineer a 99.9% system. Deterministic verifiers on the critical path, composed for independence and proven with statistical acceptance, produce the headline release-gate correctness and the low escape-rate. Probabilistic evaluation stays strictly secondary, abstention is a deliberate yield lever, human control scales with IEC 62304 risk, and every verdict is signed, traceable, reproducible evidence under CSA / GAMP 5 / Part 11. The eval system itself is validated, owned, and version-controlled — because in a regulated medical-device SDLC, the harness is the product (P5), and its evidence is the validation.