04 — Model Strategy & Fine-Tuning#
Document set: Agentic-Native SDLC for Regulated Medical Device Engineering Status: Controlled engineering reference · Revision date: May 2026 Owning function: ML Platform & Quality Engineering Cross-references: 01-requirements · 02-maturity-model · 03-reference-architecture · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 08-token-and-gpu-economics · 09-adoption-roadmap
Regulatory anchors: IEC 62304 · ISO 13485 / QMSR · ISO 14971 · FDA CSA · GAMP 5 · 21 CFR Part 11 · ISO/IEC 42001 · FDA AI-enabled device guidance + PCCP
Note on thresholds: All numeric thresholds in this document (e.g., ≥99.9%, abstention rates, eval gates) are placeholders pending calibration in 05. They denote intent and governance structure, not yet-ratified acceptance criteria.
1. Strategy rationale: why a tiered, multi-model fleet of open-weight models#
The toolchain is the regulated artifact, not the device. Per Principle P5 (the harness is the product), the models are components inside a verification harness whose system-level correctness target is ≥99.9% at the release gate (P1). No single model — however large — is the source of that guarantee; the guarantee emerges from Generate → Verify → Repair → Gate loops where deterministic checks (P2: determinism wraps probabilism) bound probabilistic generation.
Given that, model strategy optimizes for fitness-per-task at lowest defensible cost, not for a single frontier model.
1.1 Why tiered multi-model instead of one big model#
| Driver | One big model | Tiered fleet (chosen) |
|---|---|---|
| Cost (P6: cost-per-green-PR) | Every autocomplete keystroke pays 70B+ inference. Economically fatal at 1000+ devs. | Reflex-tier 1–8B handles the high-frequency 80%; Reasoner-tier reserved for rare hard plans. Tie to 08. |
| Latency | 70B autocomplete = unusable IDE latency. | Tier-S sub-100ms-class on small GPUs; interactive paths never touch large models. |
| Specialization | Generalist regresses on niche tasks (embedded C, DICOM, regulatory drafting). | Per-domain LoRA adapters (§5) sharpen narrow tasks without retraining a monolith. |
| Abstention & calibration (§9) | Monolith over-commits; one calibration curve for all tasks. | Per-tier/per-task calibration; cheap router/classifier abstains early and escalates. |
| Blast radius / governed evolution (§8) | One promotion changes everything; PCCP scope is the whole org. | Adapter-scoped change control; canary one adapter without re-validating the fleet. |
| GPU packing | Coarse, wasteful allocation. | Multi-LoRA hot-swap on shared base weights; bin-pack tiers across the cluster. |
| Determinism (P2) | Harder to wrap one opaque large model in deterministic checks. | Small, fast verifiers and routers are themselves deterministic-friendly. |
The fleet is a portfolio: route each task to the smallest capable model (§10), escalate on low confidence, and let deterministic verifiers — not model scale — own the correctness guarantee.
1.2 Why open-weight + self-hosted (P7: self-hosted, sovereign, reproducible)#
| Requirement | Why SaaS LLM APIs are disqualified | What open-weight self-host gives us |
|---|---|---|
| Reproducibility (P7, critical) | Vendor silently updates the model; yesterday's evidence is not reproducible. | We pin exact weights + tokenizer + config by digest; a fine-tune is reproducible byte-for-byte. |
| CSA / Part 11 evidence (P4) | No control over model lineage; cannot sign the chain. | Full signed lineage dataset→base→adapter→eval→registry (§7) becomes defensible audit evidence. |
| Data sovereignty (PHI/PII/IP) | Source code, design history, and PHI-adjacent data leave the boundary. | All training and inference stay inside the K8s/VPC trust boundary (see 07). |
| Determinism & control | No control over sampling, decode, or version. | We fix seeds, decode params, and serving stack (vLLM/Triton). |
| Cost (P6) | Per-token vendor pricing scales with org size; opaque. | Amortized GPU cost, first-class and measurable (08). |
| Longevity | Models deprecated by vendor on vendor timeline. | We retain weights indefinitely for re-validation and regulatory defense. |
HARD CONSTRAINT (restated): Self-hosted, fine-tuned, open-weight models only. No SaaS LLM APIs anywhere in the SDLC toolchain.
2. The fleet specification#
Five tiers. All tiers are served from the shared stack: K8s + NVIDIA GPU Operator, vLLM / Triton + TensorRT-LLM / KServe with multi-LoRA hot-swap. See 03 for serving topology.
| Tier | Name | Params | Example base models (open-weight) | Context window | Serving HW (per replica) | Quantization | Typical tasks | Why this tier |
|---|---|---|---|---|---|---|---|---|
| Tier-S | Reflex | 1–8B | Qwen2.5-Coder-1.5B / 7B, Llama-3.2-3B | 8–32k | 1× L4 / A10 (or MIG slice) | FP8 / INT8; AWQ/GPTQ-4bit for 1.5B | Autocomplete, classify, route, redact/PII-scrub, abstain-or-escalate gate | High-frequency, low-latency, cost-dominant path. Must be cheap (P6) and fast. |
| Tier-M | Worker | 14–34B | Qwen2.5-Coder-32B, StarCoder2-15B, DeepSeek-Coder-V2-Lite | 16–128k | 1–2× A100/H100 80GB | FP8; AWQ-4bit option | Test generation, refactor, code review, doc drafting, structured edits | Workhorse for bounded automation (ASMM-Med L2/L3). Strong code with feasible cost. |
| Tier-L | Reasoner | 70B+ / MoE | Llama-3.3-70B, Qwen2.5-72B, DeepSeek-V3 / R1-distill, Mixtral-8x22B | 32–128k | 2–8× H100 80GB (TP/PP) | FP8; INT4 for offline | Architecture, multi-step planning, hard root-cause, spec decomposition | Rare, high-value reasoning. Reserved; never on interactive hot paths. |
| Tier-V | Multimodal | 7–90B | Qwen2.5-VL, Llama-3.2-Vision, InternVL2, Pixtral | 8–128k | 1–4× A100/H100 80GB | FP8 / AWQ | Parse design PDFs, schematics, UI screenshots, imaging artifacts, diagram→spec | Inputs in this domain are visual (design history files, DICOM-adjacent). |
| Tier-E | Embedding / Rerank | 0.1–1.5B | bge, gte, jina-code, nomic-embed | 512–8k | 1× L4 / A10 (CPU fallback) | FP16 / INT8 | RAG retrieval, code search, dedup, contamination detection, rerank | Retrieval substrate for every agentic workflow (06). |
Routing summary. A Tier-S classifier/router triages every request: trivial → answer; ambiguous/high-risk → escalate to Tier-M/L; visual → Tier-V; retrieval → Tier-E first. Routing policy is governed by IEC 62304 class of the affected artifact (P3: autonomy by class A/B/C) and recorded as evidence (P4).
3. Base-model selection criteria & governance#
A base model is a supplier-provided component under ISO 13485 / QMSR supplier controls and GAMP 5 categorization. No base model enters the fleet without passing the gate below and landing in the Approved Base-Model Registry.
3.1 Selection criteria#
| Criterion | Requirement | Evidence captured |
|---|---|---|
| License compatibility | License must permit commercial + regulated use, self-hosting, fine-tuning, and redistribution of derivatives internally. Legal sign-off mandatory (see §3.2). | License text, SPDX id, legal approval record. |
| Provenance | Weights obtained from the authoritative publisher; digest verified. No re-uploads of unknown origin. | Source URL, publisher identity, SHA-256 of weights + tokenizer. |
| Security scan of weights | Scan serialized weights for unsafe deserialization (reject pickle where possible; require safetensors), embedded code, and known-bad artifacts. Quarantine until clean. | Scan report, scanner version, verdict. |
| Model card | Documented training data summary, intended use, known limitations, eval baselines, and bias notes. Missing card → not approved. | Stored model card + internal addendum. |
| Capability baseline | Passes minimum task-suite scores in 05 before any fine-tuning. | Eval run id, scores. |
| Maintainability | Supported by serving stack (vLLM/TensorRT-LLM), tokenizer stable, reasonable VRAM footprint. | Compatibility matrix entry. |
3.2 License review (open-weight ≠ unrestricted)#
"Open-weight" describes weight availability, not unrestricted rights. Each license is reviewed individually; the table below is engineering guidance, not legal advice — Legal sign-off per model is mandatory and recorded in the registry.
| License family | Typical examples | Commercial / regulated use | Watch-outs (review per version) |
|---|---|---|---|
| Apache-2.0 / MIT | Qwen2.5 (most sizes), StarCoder2, many bge/gte | Generally permissive | Confirm the specific checkpoint's license; some variants differ. |
| Llama Community License | Llama-3.x family | Permitted with conditions | Acceptable-use policy, attribution/naming requirements, large-MAU clause. |
| Model-specific bespoke | DeepSeek, some VL models | Case-by-case | Field-of-use, output/derivative terms, redistribution limits. |
| Non-commercial / research-only | Some checkpoints | Disqualified | Never admitted to the production fleet. |
3.3 Approved Base-Model Registry#
Maintained in the MLflow registry with signed entries. Schema:
| Field | Example |
|---|---|
model_uid | base/qwen2.5-coder-32b |
weights_digest | sha256:… (safetensors) |
tokenizer_digest | sha256:… |
license_spdx / legal_approval_ref | Apache-2.0 / LGL-2026-0142 |
provenance_url / publisher | authoritative source |
security_scan_ref / verdict | SCAN-2026-0331 / clean |
model_card_ref | stored card + addendum |
tier / serving_compat | Tier-M / vLLM,TRT-LLM |
approval_state | approved \ |
cosign_signature | Sigstore/cosign over the manifest |
Only approved bases may be parents of a fine-tune. Revocation propagates to all derived adapters (§8).
4. The fine-tuning pipeline#
flowchart TD
subgraph SRC["Sourced & governed inputs (§6)"]
A1[Internal sanitized corpus<br/>code · docs · tickets]
A2[Task / instruction datasets]
A3[Preference pairs<br/>chosen / rejected]
A4[Verifier-filtered synthetic data]
end
B[("Approved Base-Model<br/>Registry (§3)")] --> S1
A1 --> S1["Stage 1: DAPT<br/>Domain-Adaptive Continued Pretraining<br/>(usually full FT or large LoRA)"]
S1 --> S2["Stage 2: SFT<br/>Instruction / task tuning<br/>(LoRA or full FT)"]
A2 --> S2
A4 --> S2
S2 --> S3["Stage 3: Preference alignment<br/>DPO / ORPO (TRL)"]
A3 --> S3
S3 --> S4["Stage 4: Task/Domain LoRA adapters<br/>(PEFT/QLoRA) — one per specialization (§5)"]
S4 --> E["Eval gate (05)<br/>deterministic suites + abstention checks"]
E -->|pass| R[("MLflow registry<br/>signed adapter + lineage (§7)")]
E -->|fail| X[Repair / re-tune / reject]
R --> SERVE["Multi-LoRA serving<br/>vLLM/Triton hot-swap on shared base"]
classDef gate fill:#eef,stroke:#446;
class E gate;
4.1 When to use each stage#
| Stage | Purpose | Method | Use when | Skip when |
|---|---|---|---|---|
| 1 — DAPT (Domain-Adaptive Continued Pretraining) | Inject domain distribution (embedded C idioms, regulatory register, internal APIs) | Continued pretraining on the sanitized internal corpus; full FT or large-rank LoRA; DeepSpeed/FSDP via Ray+Kueue | Base is unfamiliar with the domain vocabulary/style at the token level | Base already strong in-domain; only behavior shaping needed |
| 2 — SFT (Supervised Fine-Tuning) | Teach task format & instruction following (test-gen schema, review rubric, doc templates) | TRL SFT; LoRA/QLoRA usually sufficient; full FT only if LoRA underfits | Almost always — this is the primary lever for task behavior | Task is purely retrieval/format-trivial |
| 3 — Preference alignment | Shape preferences: prefer abstention over guessing, prefer compiling code, prefer cited regulatory claims | DPO / ORPO (TRL) on chosen/rejected pairs | Need to suppress over-confidence, hallucination, or unsafe patterns (§9) | No reliable preference signal yet |
| 4 — Task/Domain LoRA | Narrow, swappable specialization | PEFT LoRA/QLoRA adapters on the aligned base | Per-domain (§5) capability needed without forking the base | One general adapter already meets the eval gate |
4.2 LoRA vs full fine-tuning — decision rule#
| Use LoRA / QLoRA when… | Use full FT when… |
|---|---|
| Behavior/format adaptation on a capable base (most SFT, all per-domain adapters) | DAPT requires moving the base distribution substantially |
| You need many swappable specializations on shared weights (multi-LoRA serving) | Tokenizer/vocab must change |
| GPU/cost budget is tight (P6); QLoRA fits on fewer GPUs | LoRA repeatedly underfits the eval target after rank/data tuning |
| Fast iteration and small, signable artifacts are required | A new long-lived base derivative is justified and will itself enter the registry |
Default posture: prefer LoRA. Full FT is the exception and requires a documented justification plus its own registry entry as a derived base.
5. Domain specialization#
Each domain ships as a named LoRA adapter over an approved (optionally DAPT'd) base, independently versioned, eval-gated, and signed. Multi-LoRA serving hot-swaps the right adapter per request.
| Domain adapter | Tier(s) | Specialized capability | Example tasks |
|---|---|---|---|
embedded-fw-c | M / L | Embedded/firmware C, MISRA-style constraints, ISRs, fixed-point, no-malloc patterns | Generate/refactor firmware, flag undefined behavior, MISRA review |
imaging-pipeline | M / V | Imaging processing pipelines, array/tensor ops, numerical stability | Pipeline code-gen, perf refactor, artifact reasoning |
dicom-adjacent | M / V | DICOM-adjacent metadata, header semantics, de-ID conventions | Parse/validate metadata, generate handling code |
reg-doc-drafting | M / L | Regulatory register; IEC 62304 / ISO 14971 / Part 11 phrasing; traceable claims | Draft design history items, risk entries, SOUP rationale |
test-generation | M | Coverage-oriented unit/integration test synthesis with assertions | Generate tests to push coverage and mutation score |
code-review | M | Project-specific review rubric, severity classification | Structured review with cited rule ids |
router-classify | S | Triage, risk/class tagging, escalation, redaction | Route + abstain-or-escalate gate |
5.1 Multimodal angle (Tier-V)#
Design inputs in this domain are inherently visual. Tier-V adapters target:
- Design PDFs / design history files → extract structured requirements/specs (feeds 06).
- Schematics / block diagrams → derive interfaces, signal lists, architecture facts.
- UI screenshots → verify UI against spec; detect drift.
- Imaging artifacts → describe/triage visual anomalies (advisory only; never a clinical claim).
All Tier-V outputs are advisory inputs to deterministic verifiers, not autonomous decisions; class-C-affecting outputs always require human confirmation (P3).
6. Data strategy for fine-tuning#
Training data is a controlled, versioned, signed artifact. The dataset is as much a regulated input as the model.
| Concern | Control |
|---|---|
| Sourcing | Internal code, design docs, tickets/issues, review history — pulled via governed connectors with access controls (07). |
| PII / PHI scrubbing | Mandatory de-identification before any data leaves the source boundary into training. Multi-pass: pattern + Tier-S redaction model + human spot-audit. No PHI in training sets, ever. |
| IP / license hygiene | Exclude third-party code with incompatible licenses; track provenance per record; quarantine unknown-origin snippets. |
| Dataset versioning & signing | Immutable, content-addressed dataset snapshots (dataset_uid + digest), registered in MLflow, cosign-signed. |
| Train/test contamination control | Use Tier-E embeddings to detect near-duplicates between training data and held-out eval sets (05); fail the build on overlap above threshold. Eval sets are sealed and never enter training. |
| Synthetic data | Generated by Tier-M/L, then verifier-filtered: only synthetic examples whose outputs pass deterministic checks (compiles, tests pass, schema-valid) are retained. Unverifiable synthetic data is discarded. |
| Provenance labels | Every record tagged internal / synthetic-verified / public-permissive for auditability and ablation. |
Contamination is a correctness-and-evidence risk, not a metric nuisance. A model trained on its own eval set produces inflated scores that cannot support a defensible ≥99.9% claim (P1/P4). Contamination control is a release gate (§11 anti-pattern).
7. Reproducibility & validation as first-class (P7)#
A fine-tune must be reproducible and defensible as CSA evidence. We lock every input and sign the full lineage so an auditor can re-derive the artifact.
7.1 What is locked#
| Locked input | Mechanism |
|---|---|
| Dataset | Content-addressed snapshot (dataset_uid + digest), signed (§6). |
| Base model | weights_digest + tokenizer_digest from the Approved Registry (§3). |
| Config | Hyperparameters, stage sequence, LoRA rank/targets, decode params — versioned YAML (Axolotl/Llama-Factory/torchtune), digested. |
| Seeds | All RNG seeds (data shuffling, init, dropout) pinned. |
| Environment | Container image digest, CUDA/driver, library versions (PEFT/TRL/DeepSpeed) recorded. |
| Eval | Eval suite version + sealed test-set digest (05). |
7.2 Signed lineage chain#
flowchart LR
D["dataset_uid<br/>(signed digest)"] --> A
BM["base model_uid<br/>(registry digest)"] --> A
CFG["config + seeds + env<br/>(digest)"] --> A
A["adapter / model artifact<br/>(SHA-256)"] --> EV
EV["eval run<br/>(suite ver + scores)"] --> REG
REG[("MLflow registry entry<br/>cosign-signed, SLSA provenance")]
Each edge is a cosign attestation; the registry entry carries SLSA provenance. The chain answers the auditor's question — "show me exactly how this model was produced and that nothing changed" — and is the artifact-level realization of Part 11 (P4) and CSA.
7.3 How this satisfies P7 and D2-L4#
- P7 (reproducible): Any registered fine-tune is byte-reproducible from locked inputs; re-running the pipeline yields the same artifact digest (modulo documented nondeterminism, which is itself bounded and recorded).
- D2-L4 (02): Validated Autonomous Agents require validated models. The signed lineage + eval-gated promotion (§8) is the model-side evidence package that lets an agent operate autonomously within its IEC 62304 class envelope.
8. Model lifecycle & governed evolution#
Models evolve under a PCCP-style predetermined change control applied to the toolchain models (reusing the FDA PCCP concept; the device itself is governed separately). Promotion is eval-gated (05); no promotion bypasses the gate.
flowchart LR
C["candidate<br/>(new adapter / base)"] --> SH["shadow<br/>(mirror traffic, no effect)"]
SH --> CN["canary<br/>(small % real, guarded)"]
CN --> PR["promote<br/>(default for tier/domain)"]
PR --> DP["deprecate<br/>(retain weights + lineage)"]
SH -->|fail gate| RJ[reject]
CN -->|regression| RB[rollback]
| Phase | Gate / exit criteria | Evidence |
|---|---|---|
| Candidate | Lineage signed (§7); passes offline eval suite + abstention/calibration checks (§9) | Registry entry, eval run id |
| Shadow | Mirrored traffic; no regression vs incumbent on live distribution; no safety violations | Shadow comparison report |
| Canary | Bounded % of real traffic by IEC 62304 class (lower class first); cost-per-green-PR within budget (P6) | Canary metrics, guardrail logs |
| Promote | Meets/exceeds incumbent on all gated metrics; sign-off recorded | Promotion record, signatures |
| Deprecate | Successor promoted; weights and lineage retained for re-validation/defense | Retention record |
8.1 PCCP-style predetermined change control (toolchain models)#
The Predetermined Change Control Plan for the model fleet specifies, in advance: the allowed change types (e.g., new domain adapter, refreshed SFT data), the fixed eval protocol that gates them, the rollback triggers, and the autonomy class affected. Changes inside the envelope flow through the lifecycle without re-opening the whole validation; changes outside it require plan revision. This ties to D1-L5 (02) (self-optimizing under governance) and the evaluation regime in 05. Base-model revocation (§3) forces immediate deprecation of all derived adapters.
9. Abstention & calibration#
The ≥99.9% system property (P1) depends on models that decline rather than guess on out-of-distribution or high-risk inputs, escalating to a larger tier or a human. Abstention is a trained and served behavior, validated in 05.
| Mechanism | Where | Effect |
|---|---|---|
| Preference training for abstention | Stage 3 DPO/ORPO (§4) | Prefer "insufficient evidence → escalate" over a confident wrong answer |
| Calibrated confidence | Tier-S router + per-task heads | Confidence thresholds tuned so high-confidence ≈ high-accuracy |
| Abstain-or-escalate gate | Serving (router) | Below threshold → escalate tier or hand to human; never silently proceed |
| Selective prediction metrics | Eval (05) | Track coverage vs risk; gate on risk at fixed coverage, not raw accuracy |
| Class-aware strictness (P3) | Routing policy | IEC 62304 class C → conservative thresholds, mandatory human confirmation |
Calibration target: in the operating region, a high-confidence answer is correct ≥99.9%; everything else abstains and routes to verification or a human. A miscalibrated-but-accurate model is not acceptable — abstention behavior is itself an eval gate.
10. Right-sizing & cost linkage#
Smallest-capable-model principle (P6): route every task to the smallest model that passes the gate; escalate only on abstention.
| Lever | Action | Cost effect (→ 08) |
|---|---|---|
| Tiered routing | Tier-S handles the high-frequency majority; Tier-L is rare | Largest single cost lever; collapses per-token spend |
| Distillation | Capture Tier-L behavior (traces, preferences) → train Tier-M/S adapters | Moves capability down a tier at a fraction of inference cost |
| Quantization | FP8/AWQ/GPTQ per tier (§2) | More replicas per GPU; lower latency |
| Multi-LoRA hot-swap | Many domain adapters on one shared base | Eliminates per-domain base replicas; high GPU packing |
| Abstention budgeting | Escalate only when justified (§9) | Prevents needless large-model calls |
| Right-sized context | Use the minimum context window that passes eval | Lower KV-cache cost |
Distillation note. The verifier-filtered synthetic pipeline (§6) is the distillation substrate: only Tier-L outputs that pass deterministic verification become training data for smaller adapters, so distillation transfers verified behavior, not hallucinations. The primary metric for right-sizing decisions is cost-per-green-PR (P6), owned in 08.
11. Anti-patterns#
| # | Anti-pattern | Why it fails here | Required control |
|---|---|---|---|
| A1 | Unversioned models | No reproducibility, no defensible evidence; violates P7/P4 | Every model is a signed registry entry with full lineage (§7) |
| A2 | Training on the eval set | Inflated scores cannot support ≥99.9% (P1); fraudulent evidence | Embedding-based contamination gate; sealed eval sets (§6, 05) |
| A3 | Unscanned weights | Unsafe deserialization / supply-chain compromise | Mandatory weight scan + safetensors before approval (§3) |
| A4 | License violation | Legal and regulatory exposure; non-commercial weights in production | Per-model legal sign-off in the registry (§3.2) |
| A5 | SaaS LLM API "just for this one thing" | Breaks sovereignty, reproducibility, Part 11 chain (P7) | Hard constraint: open-weight self-host only (§1.2) |
| A6 | One big model for everything | Cost-fatal, latency-fatal, coarse governance (§1.1) | Tiered fleet + smallest-capable routing (§2, §10) |
| A7 | Over-confident models (no abstention) | Confident wrong answers break the 99.9% property | Abstention training + calibration gates (§9) |
| A8 | Promotion without eval gate | Unvalidated change reaches users; violates D2-L4 | Eval-gated candidate→shadow→canary→promote (§8) |
| A9 | PHI/IP in training data | Privacy/IP breach; non-compliant corpus | Mandatory scrubbing + provenance labels (§6, 07) |
| A10 | Unverified synthetic data | Trains models on hallucinations; degrades correctness | Verifier-filtering only; discard unverifiable (§6) |
Summary#
The model strategy is a portfolio of small, specialized, signed, open-weight models governed as regulated components: tiered for cost/latency/specialization, fine-tuned through a locked-and-signed pipeline (DAPT → SFT → DPO/ORPO → LoRA), specialized per domain via hot-swappable adapters, and promoted only through eval-gated, PCCP-style change control. The correctness guarantee lives in the harness (P5) and its deterministic verification (P2), not in model scale; the models contribute calibrated capability and disciplined abstention (P1, §9), and reproducible, signed lineage (P7, §7) is what makes every fine-tune defensible CSA evidence. Implementation detail for evaluation lives in 05 and for economics in 08.