05 — Evaluation & Validation: Earning 99.9%#

Figure B — The 99.9% Release-Gate Assurance Pipeline · open SVG

Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Reference baseline — May 2026. Thresholds shown are org-set placeholders pending QMS ratification. Audience: engineering, quality/QA, regulatory affairs, validation (CSV/CSA) owners. Cross-references: 01-requirements.md, 02-maturity-model.md, 03-reference-architecture.md, 04-model-strategy-and-finetuning.md, 06-agentic-workflows.md, 07-security-and-compliance.md, 08-token-and-gpu-economics.md, 09-adoption-roadmap.md.

This document is the core of the 99.9% argument. Everything else in the program — the model fleet, the harness, the maturity ladder — exists to feed and be governed by the evaluation and validation system described here. The thesis is Principle P1: 99.9% is a system property, not a model property. No open-weight model, fine-tuned or otherwise, will be claimed to be 99.9% correct. Instead we engineer a pipeline — Generate → Verify → Repair → Gate — whose aggregate release-gate correctness reaches the target, and we prove it with deterministic, recorded, reproducible evidence (P2, P4, P7).

1. Reframing the 99.9%: what number are we actually defending?#

The single most common error in agentic-SDLC programs is conflating three different quantities under one "99.9%" banner. They are measured differently, owned differently, and validated differently.

#	Quantity	Definition	Who is responsible	Do we engineer it?
(a)	Model accuracy	Probability a single model invocation produces a correct artifact on a representative task distribution	Model strategy (04)	No — informative, never gated on directly
(b)	System release-gate correctness	Probability that an artifact admitted past the release gate is correct, given all verifiers, repair, and HITL	Harness + Eval (this doc)	Yes — primary target ≥ 99.9%
(c)	Escape-rate	Probability that a defect reaches a regulated artifact (released code, DHF/DHR record, submission content) undetected	Whole SDLC + post-market	Yes — minimized; the safety-relevant number

State plainly: we do not promise model accuracy (a). We engineer system correctness (b) and we drive down escape-rate (c). A model that is only 92% accurate but abstains or fails loudly on the other 8% can feed a 99.9% system, because the system never admits the bad 8% — it routes it to repair, escalation, or human control. Conversely, a 99% model with weak gating can have a worse escape-rate than a 90% model with strong gating. The model is a generator; the gate is the guarantee.

1.1 Metric definitions (precise)#

Let the unit of evaluation be a change-set (a PR-sized artifact) or an eval item (a single golden task). Define a "correct" outcome against the deterministic ground truth for that item.

Metric	Formula	Meaning / use
Precision (of the gate)	TP / (TP + FP)	Of artifacts the gate passes, fraction actually correct. This is (b). Drives release-gate correctness.
Recall (of the gate)	TP / (TP + FN)	Of correct artifacts, fraction the gate passes. A proxy for first-pass throughput; low recall = wasted generation, not a safety issue.
Escape-rate	(defects reaching a regulated artifact) / (total artifacts released)	(c). The post-gate failure probability. Target is an org-set ceiling, e.g. ≤ 1e-3, tracked per safety class.
Abstention-rate	(abstained or `IDK` items) / (total items)	How often the system declines and escalates rather than guessing. A yield lever, not a defect (§6.4).
First-Pass Yield (FPY)	(items passing all gates with zero repair iterations) / (total items)	Generation quality + harness efficiency. An economic metric (P6, ties to 08), not a safety gate.
Repair-adjusted yield	(items passing after ≤ K repair iterations) / (total items)	Throughput including the bounded repair loop.
False-pass rate (FPR_gate)	FP / (TP + FP + TN + FN) admitted	The complement view of escape-rate at the gate boundary; the number a CSA risk assessment cares about most.

A critical conceptual rule, established here and enforced throughout: precision and escape-rate are the safety metrics; recall and FPY are the economic metrics. When the two trade off, safety wins — we prefer a system that abstains and escalates (lower FPY, higher cost) over one that admits a marginal artifact (higher FPY, higher escape-rate). This is a direct expression of risk-proportional autonomy (P3).

2. The assurance pipeline#

Every artifact produced by an agent — code, test, spec, config, document fragment — flows through the same assurance pipeline before it can become evidence or be released. Determinism dominates the critical path (P2): probabilistic checks may inform and escalate, but they may never be the sole admit/reject decision for Class B or Class C work.

flowchart TD
    A[Intent / Spec<br/>requirement, Gherkin, ticket] --> B[Generate<br/>model fleet S/M/L/V/E<br/>constrained + spec-conditioned decoding]
    B --> C{Deterministic<br/>Verify Layer}
    C -- fail --> D[Repair Loop<br/>bounded budget K<br/>feed verifier output back]
    D --> B
    C -- pass --> E{Eval Gate<br/>golden suites + thresholds<br/>statistical acceptance}
    E -- below threshold --> D
    E -- secondary signal only --> F[LLM-as-judge / rubric<br/>NON-gating for B/C<br/>escalation trigger only]
    F --> G
    E -- pass --> G{HITL<br/>risk-proportional<br/>Class C = dual control}
    G -- changes requested --> D
    G -- abstain / IDK --> H[Escalate to human owner]
    G -- approved --> I[Release Record<br/>signed evidence bundle<br/>traceability + Part 11 audit]
    D -- budget exhausted --> H
    I --> J[(Production / DHF / DHR<br/>regulated artifact)]
    J -. production failures captured .-> K[Regression Capture<br/>new golden items -> §8]
    K -. closed loop L5 .-> E

Stage	Role	Why it sits here
Generate	Produce candidate artifact(s); may emit N samples for self-consistency	Probabilistic by nature; cheapest to make wrong loudly via structure constraints
Deterministic Verify	Hard, reproducible accept/reject from compilers, tests, analyzers, schema/policy checks	The workhorse. Same input → same verdict, always (P7). This is what makes the gate auditable.
Repair loop	Feed verifier diagnostics back to the generator; retry within a bounded budget K	Converts a probabilistic generator into a system that converges; budget prevents infinite/cost-runaway loops
Eval Gate	Compare against versioned golden suites + statistical acceptance criteria by safety class	Decides admit/reject on evidence, not vibes; applies confidence-interval thresholds (§6)
HITL	Human review proportional to IEC 62304 class; Class C always dual control (P3)	Eval score never substitutes for required human judgment on high-risk artifacts
Release record	Emit a signed, immutable evidence bundle (Part 11) with full traceability	Everything is evidence (P4); the record is the validation artifact

3. Deterministic verification layer (the workhorse)#

This layer is why the program works. Principle P5: the harness is the product. Each verifier is a deterministic function verify(artifact, context) → {PASS, FAIL(diagnostics)} that is reproducible (P7), versioned, and recorded. They compose: an artifact is admitted only if it survives the conjunction of all applicable verifiers for its class.

Verifier class	What it guarantees	Critical-path position	Determinism
Build / compile	Artifact is syntactically valid and integrates into the target	Gate 0 — runs first; cheap, eliminates gross failures	Fully deterministic
Type systems / static typing	Type-level contracts hold; whole classes of interface defects excluded	Gate 0/1 — pre-test	Fully deterministic
Unit tests	Specified behaviors hold on chosen inputs	Gate 1 — core correctness	Deterministic given seeded fixtures
Property-based tests	Invariants hold across generated input distributions, not just examples	Gate 1/2 — depth beyond unit	Deterministic with fixed seed; record seed in evidence
Mutation testing	The test suite itself is adequate (kills injected faults) — guards against vacuous green	Gate 2 — meta-check on tests	Deterministic; expensive, sampled or scheduled
Differential / metamorphic testing	New implementation matches a reference oracle, or obeys metamorphic relations when no oracle exists	Gate 2 — oracle problem mitigation	Deterministic
Fuzzing	No crashes/UB/memory-safety violations on adversarial inputs	Gate 2/3 — robustness	Deterministic per corpus + seed; time-boxed
SAST / DAST / SCA	No known-pattern vulnerabilities, runtime exposures, or vulnerable dependencies	Gate 2 — security (ties to 07)	Deterministic given ruleset + version pinning
Schema / contract validation	Interfaces, data, and API contracts conform; structured outputs are well-formed	Gate 0/1 — fast structural guard	Fully deterministic
Formal methods / specification checks	Critical properties provably hold (model checking, SMT, refinement) — reserved for Class C safety functions	Gate 3 — strongest, narrowest	Deterministic (proof or counterexample)
Policy-as-code	Org/regulatory rules enforced mechanically (licensing, banned APIs, segregation of duties, sign-off rules)	Gate at every stage — governance	Fully deterministic

3.1 How composition drives escape-rate down#

Verifiers are arranged as a defense-in-depth conjunction, ordered cheap-to-expensive so that most bad artifacts are killed early at low cost (P6). If verifier i has independent miss-probability mᵢ (it lets a defect through), and a defect must evade all of them to escape, the residual escape probability is bounded by the product ∏ mᵢ — provided the verifiers are sufficiently independent (catch different defect families). Independence is the design objective: type checkers catch interface defects, property tests catch invariant violations, mutation testing catches weak tests, SAST catches vulnerability patterns. Correlated verifiers (e.g., two SAST tools sharing a ruleset) do not multiply their protection, and we do not claim they do (§6.1, §11). The Eval Owner (§8) maintains a documented defect-family-to-verifier coverage matrix so that the independence claim underlying the math is auditable rather than assumed.

4. Generation-time techniques that raise first-pass yield#

These raise FPY and cut repair cost (an economic win, P6/08). They are not a substitute for the gate — they make the generator produce gate-passable artifacts more often.

Technique	Mechanism	Yield effect	Caveat
Constrained / structured decoding	Force output to a grammar/JSON schema/AST shape during generation	Eliminates malformed-output failures outright	Constrain form, not correctness — still must pass verifiers
Spec-conditioned generation (BDD/Gherkin)	Condition the model on executable acceptance criteria	Aligns generation to the exact gate it must pass	Spec quality bounds outcome quality
Retrieval grounding	Inject relevant internal code, standards, prior decisions into context	Reduces hallucinated APIs and policy violations	Retrieved context must itself be governed (07)
N-sample self-consistency + verifier selection	Generate N candidates; select by deterministic verifier outcome, not by model vote	Raises probability ≥1 candidate passes; selection stays deterministic	Cost ∝ N — budget per safety class (08)
Test-first generation	Agent writes a failing test from the spec, then code to satisfy it	Forces an explicit, checkable success criterion	Test must be reviewed so the agent doesn't write a weak/vacuous test (mutation testing guards this)
Bounded repair loops	Feed verifier diagnostics back; retry up to budget K	Recovers near-misses cheaply	Budget exhaustion → abstain/escalate, never force-pass

Selection rule (load-bearing): when using N-sample self-consistency, the selector is the deterministic verifier suite, not an LLM majority vote. Self-consistency raises the chance a good candidate exists; determinism decides which one is admitted. This keeps P2 intact even while exploiting probabilistic breadth.

5. Probabilistic evaluation (secondary, never sole gate)#

Probabilistic evaluation has real value for coverage of qualities deterministic checks cannot express (clarity of a rationale, plausibility of a design narrative, reasonableness of a trajectory). It is admitted only under strict containment.

Probabilistic signal	What it assesses	Permitted role	Prohibited role
Golden datasets + rubrics	Behavior vs curated expected outcomes	Trend/regression metric; gate only where ground truth is deterministic	—
LLM-as-judge	Subjective quality, rationale adequacy	Secondary signal; escalation trigger; advisory on Class A only with human confirmation	Sole gate for any Class B or Class C artifact
Trajectory evaluation	Did the agent take valid steps / legal tool calls in a sane order (via OpenTelemetry traces)	Detects unsafe process even when output passes; flags for review	Cannot admit on its own
Behavioral drift detection	Distribution shift in outputs/scores vs a baseline	Monitoring + revalidation trigger (§9)	Cannot gate a single release

Gating rules (binding):

For Class B and Class C, every admit decision must rest on a deterministic verdict. Probabilistic signals may only block (raise concern → escalate) or inform; they may never admit (P2).
LLM-as-judge is never the sole gate for B/C, full stop. Where used, the judge model, prompt, rubric, and version are recorded as part of the evidence and are themselves subject to drift monitoring.
An LLM-as-judge disagreement with a deterministic verdict resolves in favor of the deterministic verdict, and the disagreement is logged for Eval Owner review.

6. The math of 99.9%#

6.1 Composing imperfect independent checks#

Suppose a generated artifact is defective with prior probability p_def. It must pass n independent verifiers to be admitted; verifier i misses a defect with probability mᵢ. The probability a defect escapes the gate is bounded by:

P(escape) ≤ p_def · ∏(i=1..n) mᵢ

With three genuinely independent checks each missing 10% of defects (mᵢ = 0.1) and a 20% defect prior, P(escape) ≤ 0.2 · 0.001 = 2e-4 — already below a 1e-3 ceiling. Add the bounded repair loop: each repair iteration re-subjects the artifact to the full conjunction, compounding the protection on the retained (repaired) artifacts. The headline number — release-gate correctness ≥ 99.9% — is therefore (b) precision achieved by composition, not by any single model or check.

The independence caveat is the whole ballgame. The product rule holds only to the degree verifiers catch different defect families. We do not claim multiplicative protection for correlated checks; the Eval Owner's coverage matrix (§3.1) documents which mᵢ are credibly independent. Treating correlated checks as independent is the central statistical lie of "eval theater" (§11) and is explicitly disallowed.

6.2 Statistical acceptance by safety class#

We never assert 99.9% from a point estimate. We require a lower confidence bound from a sample of sufficient size, scaled by safety class (risk-based, per CSA/GAMP 5).

Safety class (IEC 62304)	Acceptance criterion (placeholder)	Sampling rigor	Human sign-off
Class A (no injury)	Wilson/Clopper–Pearson lower 95% bound on gate precision ≥ 99.0%	Standard golden suite	Optional / sampled
Class B (non-serious injury)	Lower 95% bound ≥ 99.9%; escape-rate ≤ 1e-3	Expanded suite + property/mutation depth	Required single reviewer
Class C (death/serious injury)	Lower 95–99% bound ≥ 99.9%; formal checks on safety functions; escape-rate ceiling set per risk file (14971)	Maximum rigor; formal methods where feasible	Required dual human control (P3) — non-negotiable

To demonstrate a true rate ≥ 99.9% with a 95% lower confidence bound and zero observed failures, the rule-of-three approximation requires roughly n ≈ 3 / (1 − 0.999) ≈ 3000 independent passing items; any observed failure raises the required n substantially. Sample-size targets per suite are recorded with the eval suite version (§8). These are gate thresholds; they do not replace required human review.

6.3 Why Class C still needs human sign-off regardless of score#

A 99.9% gate is a statistical statement about a population. A single Class C artifact may be the 1-in-1000 that the population statistics tolerate but a patient cannot. IEC 62304, ISO 14971, and CSA intended-use logic all require human judgment proportional to harm. Therefore dual human control on Class C is independent of the eval score — even a hypothetical 100% historical pass-rate does not waive it (P3). The eval score informs the reviewers; it never replaces them.

6.4 Abstention as a yield lever#

A model that can say "I don't know" and escalate (per the fleet design in 04) converts a potential false pass into an honest escalation. Mathematically, abstention removes the hardest, least-certain items from the admitted population, which raises precision (b) and lowers escape-rate (c) at the cost of throughput (FPY) and human effort. We therefore treat a calibrated abstention-rate as a feature to tune, not a failure to suppress. Suppressing abstention to inflate FPY is an anti-pattern (§11).

7. Validation under FDA CSA / GAMP 5 / IEC 62304#

The agent and its harness are production/quality-system software that must be validated — not merely a developer convenience. We apply FDA Computer Software Assurance (risk-based assurance, intended use, recorded evidence), GAMP 5 (2nd ed) risk-based CSV, IEC 62304 V&V activities, and ISO/IEC 42001 for the AI management system.

CSV/CSA element	How it is satisfied here
Intended-use statement	Each agent/harness component has a documented intended use, operational boundaries, and the safety classes it may act on (drives risk-based rigor)
Risk-based test rigor	Test depth scales by IEC 62304 class (§6.2); CSA "unscripted vs scripted" assurance applied proportional to harm
IQ analog	Harness, model fleet, verifier toolchain deployed to K8s with pinned versions; installation verified and recorded (ties to 03)
OQ analog	Each verifier and gate exercised against golden suites; deterministic verdicts reproduced; thresholds demonstrated
PQ analog	End-to-end performance on representative change-sets at target precision/escape-rate; continuous in production (§9)
Documented/recorded evidence	Every gate emits a signed evidence bundle; nothing is asserted without a record (P4, Part 11)
Traceability	requirement → spec → code → test → eval → release, linked bidirectionally and stored immutably (01, 03)

7.1 Revalidation triggers#

Validation is a state, not an event. Any of the following triggers risk-assessed revalidation, scoped by impact:

Model change — new fine-tune, base-weight update, quantization, or decoding-config change (ties to 04).
Harness change — new/updated verifier, gate-threshold change, repair-budget change, orchestration change (06).
Eval-dataset change — new golden items, rubric change, contamination remediation (§8).
Drift — behavioral/score drift past a control limit in production (§9).
Regulatory/process change — change in intended use, standard revision, or risk-file update (14971).

Each trigger, the affected scope, and the revalidation outcome are recorded; the traceability graph identifies the blast radius automatically.

8. Eval suites as controlled artifacts#

Eval suites are controlled, validated artifacts with the same rigor as the software they judge — because under CSA they are quality-system software.

Versioned & signed — each suite has a semantic version, content hash, and a Part 11 signature; gate runs record exactly which suite version produced the verdict (P7).
Owned — an explicit Eval Owner role is accountable for suite integrity, the defect-family coverage matrix (§3.1), sample-size adequacy (§6.2), and contamination control. The Eval Owner is segregated from the model-training function (segregation of duties, policy-as-code enforced — §3, 07).
Contamination control vs fine-tuning data — golden eval items are held out of, and continuously diffed against, fine-tuning corpora (ties to 04). Any leakage invalidates the affected results and triggers re-curation. Provenance and hashes of eval vs train sets are recorded so contamination is detectable, not merely asserted.
Regression capture — every production escape becomes a new golden item (the closed loop in §2). This is the mechanism by which the system learns from failures and is the L5 "self-optimizing" behavior in 02 — but the new item enters the controlled suite through the same versioning/sign-off, never silently.

9. Continuous evaluation & monitoring in production#

A gate that is only run pre-merge cannot detect post-deployment drift. Continuous evaluation closes the loop (PQ analog, §7).

Capability	What it does	Ties to
Live eval	Periodically re-run golden suites against the current deployed fleet+harness; alert on regression	§6.2 thresholds
Drift detection	Track score/behavior distributions against control limits; flag shift	§5, §7.1 revalidation
Auto-regression	Re-run the captured-failure golden items on every model/harness change to prevent recurrence	§8
MTTR	Track mean-time-to-remediate gate regressions and production escapes; an operational KPI	09
Eval-cost budgeting	Cost-aware scheduling of expensive checks (mutation, fuzzing, N-sample, formal); spend governed per cost-per-green-PR (P6)	08

Expensive deterministic checks are scheduled risk-proportionally: always-on for Class C critical paths, sampled or nightly for lower-risk surfaces, so assurance scales without unbounded GPU/compute cost (08).

10. Worked example: a Class B code change through every gate#

A change-set modifies a Class B data-formatting routine. It is dispatched to a mid-tier (M) model with a constrained, spec-conditioned prompt.

Step	Action	Evidence produced
0. Intent	Requirement + Gherkin acceptance criteria retrieved and linked	Trace edge: requirement → spec
1. Generate	N=4 candidates via structured decoding; agent first writes a failing test	4 candidate diffs + 1 test, OTel trajectory trace
2. Build/type/schema	Gate 0 run on all candidates; 1 fails compile, dropped	Compile + type-check logs (deterministic)
3. Unit + property	Surviving candidates run against unit + property suites; verifier selects the passing candidate	Test results, property seeds recorded
4. Mutation	Mutation run confirms the test suite kills injected faults (not vacuously green)	Mutation score report
5. SAST/SCA	Security scan clean; dependencies unchanged	SAST/SCA report (07)
6. Eval Gate	Golden suite for this surface meets Class B criterion (lower 95% bound ≥ 99.9%)	Statistical acceptance record (§6.2)
7. LLM-as-judge	Secondary rationale check — advisory only; agrees, logged, does not gate	Judge model+prompt+version, score (non-gating)
8. HITL	Single qualified reviewer (Class B) approves with comments	Signed human review record (Part 11)
9. Release record	Signed evidence bundle assembled; full traceability sealed	requirement→spec→code→test→eval→release graph

Had any deterministic gate failed, the diagnostics would feed the bounded repair loop; on budget exhaustion the item would abstain and escalate, never force-pass. Had this been Class C, step 8 would require dual human control and step 6 would add formal/specification checks regardless of the eval score (§6.3).

11. Anti-patterns#

Anti-pattern	Why it is dangerous	Countermeasure
Eval theater	Impressive dashboards over weak/correlated checks; the 99.9% is unbacked	Coverage matrix + independence audit (§3.1); mutation testing on the eval suite itself
LLM-as-sole-gate	A probabilistic judge admitting B/C artifacts violates P2; no reproducible verdict	Hard rule: deterministic admit for B/C; judge is secondary/escalation only (§5)
Training on the eval set	Contamination inflates scores and hides true escape-rate	Held-out signed suites, train/eval diffing, provenance hashes (§8, 04)
Gaming first-pass yield by hiding repairs	FPY treated as a target → incentive to suppress/mask repair iterations and abstentions	FPY is an economic metric, never a gate; repairs and abstentions are recorded and audited (§1.1, §6.4)
Treating correlated checks as independent	Overstates the composition math; real escape-rate higher than claimed	Document independence; only credit independent miss-probabilities (§6.1)
Waiving Class C human control on a high score	Statistics about a population do not protect the individual patient	Dual control on C is independent of eval score (P3, §6.3)

Summary#

We do not sell a 99.9% model; we engineer a 99.9% system. Deterministic verifiers on the critical path, composed for independence and proven with statistical acceptance, produce the headline release-gate correctness and the low escape-rate. Probabilistic evaluation stays strictly secondary, abstention is a deliberate yield lever, human control scales with IEC 62304 risk, and every verdict is signed, traceable, reproducible evidence under CSA / GAMP 5 / Part 11. The eval system itself is validated, owned, and version-controlled — because in a regulated medical-device SDLC, the harness is the product (P5), and its evidence is the validation.