04 — Model Strategy & Fine-Tuning#

Figure E — Tiered Model-Fleet Routing · open SVG

Document set: Agentic-Native SDLC for Regulated Medical Device Engineering Status: Controlled engineering reference · Revision date: May 2026 Owning function: ML Platform & Quality Engineering Cross-references: 01-requirements · 02-maturity-model · 03-reference-architecture · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 08-token-and-gpu-economics · 09-adoption-roadmap

Regulatory anchors: IEC 62304 · ISO 13485 / QMSR · ISO 14971 · FDA CSA · GAMP 5 · 21 CFR Part 11 · ISO/IEC 42001 · FDA AI-enabled device guidance + PCCP

Note on thresholds: All numeric thresholds in this document (e.g., ≥99.9%, abstention rates, eval gates) are placeholders pending calibration in 05. They denote intent and governance structure, not yet-ratified acceptance criteria.

1. Strategy rationale: why a tiered, multi-model fleet of open-weight models#

The toolchain is the regulated artifact, not the device. Per Principle P5 (the harness is the product), the models are components inside a verification harness whose system-level correctness target is ≥99.9% at the release gate (P1). No single model — however large — is the source of that guarantee; the guarantee emerges from Generate → Verify → Repair → Gate loops where deterministic checks (P2: determinism wraps probabilism) bound probabilistic generation.

Given that, model strategy optimizes for fitness-per-task at lowest defensible cost, not for a single frontier model.

1.1 Why tiered multi-model instead of one big model#

Driver	One big model	Tiered fleet (chosen)
Cost (P6: cost-per-green-PR)	Every autocomplete keystroke pays 70B+ inference. Economically fatal at 1000+ devs.	Reflex-tier 1–8B handles the high-frequency 80%; Reasoner-tier reserved for rare hard plans. Tie to 08.
Latency	70B autocomplete = unusable IDE latency.	Tier-S sub-100ms-class on small GPUs; interactive paths never touch large models.
Specialization	Generalist regresses on niche tasks (embedded C, DICOM, regulatory drafting).	Per-domain LoRA adapters (§5) sharpen narrow tasks without retraining a monolith.
Abstention & calibration (§9)	Monolith over-commits; one calibration curve for all tasks.	Per-tier/per-task calibration; cheap router/classifier abstains early and escalates.
Blast radius / governed evolution (§8)	One promotion changes everything; PCCP scope is the whole org.	Adapter-scoped change control; canary one adapter without re-validating the fleet.
GPU packing	Coarse, wasteful allocation.	Multi-LoRA hot-swap on shared base weights; bin-pack tiers across the cluster.
Determinism (P2)	Harder to wrap one opaque large model in deterministic checks.	Small, fast verifiers and routers are themselves deterministic-friendly.

The fleet is a portfolio: route each task to the smallest capable model (§10), escalate on low confidence, and let deterministic verifiers — not model scale — own the correctness guarantee.

1.2 Why open-weight + self-hosted (P7: self-hosted, sovereign, reproducible)#

Requirement	Why SaaS LLM APIs are disqualified	What open-weight self-host gives us
Reproducibility (P7, critical)	Vendor silently updates the model; yesterday's evidence is not reproducible.	We pin exact weights + tokenizer + config by digest; a fine-tune is reproducible byte-for-byte.
CSA / Part 11 evidence (P4)	No control over model lineage; cannot sign the chain.	Full signed lineage dataset→base→adapter→eval→registry (§7) becomes defensible audit evidence.
Data sovereignty (PHI/PII/IP)	Source code, design history, and PHI-adjacent data leave the boundary.	All training and inference stay inside the K8s/VPC trust boundary (see 07).
Determinism & control	No control over sampling, decode, or version.	We fix seeds, decode params, and serving stack (vLLM/Triton).
Cost (P6)	Per-token vendor pricing scales with org size; opaque.	Amortized GPU cost, first-class and measurable (08).
Longevity	Models deprecated by vendor on vendor timeline.	We retain weights indefinitely for re-validation and regulatory defense.

HARD CONSTRAINT (restated): Self-hosted, fine-tuned, open-weight models only. No SaaS LLM APIs anywhere in the SDLC toolchain.

2. The fleet specification#

Five tiers. All tiers are served from the shared stack: K8s + NVIDIA GPU Operator, vLLM / Triton + TensorRT-LLM / KServe with multi-LoRA hot-swap. See 03 for serving topology.

Tier	Name	Params	Example base models (open-weight)	Context window	Serving HW (per replica)	Quantization	Typical tasks	Why this tier
Tier-S	Reflex	1–8B	Qwen2.5-Coder-1.5B / 7B, Llama-3.2-3B	8–32k	1× L4 / A10 (or MIG slice)	FP8 / INT8; AWQ/GPTQ-4bit for 1.5B	Autocomplete, classify, route, redact/PII-scrub, abstain-or-escalate gate	High-frequency, low-latency, cost-dominant path. Must be cheap (P6) and fast.
Tier-M	Worker	14–34B	Qwen2.5-Coder-32B, StarCoder2-15B, DeepSeek-Coder-V2-Lite	16–128k	1–2× A100/H100 80GB	FP8; AWQ-4bit option	Test generation, refactor, code review, doc drafting, structured edits	Workhorse for bounded automation (ASMM-Med L2/L3). Strong code with feasible cost.
Tier-L	Reasoner	70B+ / MoE	Llama-3.3-70B, Qwen2.5-72B, DeepSeek-V3 / R1-distill, Mixtral-8x22B	32–128k	2–8× H100 80GB (TP/PP)	FP8; INT4 for offline	Architecture, multi-step planning, hard root-cause, spec decomposition	Rare, high-value reasoning. Reserved; never on interactive hot paths.
Tier-V	Multimodal	7–90B	Qwen2.5-VL, Llama-3.2-Vision, InternVL2, Pixtral	8–128k	1–4× A100/H100 80GB	FP8 / AWQ	Parse design PDFs, schematics, UI screenshots, imaging artifacts, diagram→spec	Inputs in this domain are visual (design history files, DICOM-adjacent).
Tier-E	Embedding / Rerank	0.1–1.5B	bge, gte, jina-code, nomic-embed	512–8k	1× L4 / A10 (CPU fallback)	FP16 / INT8	RAG retrieval, code search, dedup, contamination detection, rerank	Retrieval substrate for every agentic workflow (06).

Routing summary. A Tier-S classifier/router triages every request: trivial → answer; ambiguous/high-risk → escalate to Tier-M/L; visual → Tier-V; retrieval → Tier-E first. Routing policy is governed by IEC 62304 class of the affected artifact (P3: autonomy by class A/B/C) and recorded as evidence (P4).

3. Base-model selection criteria & governance#

A base model is a supplier-provided component under ISO 13485 / QMSR supplier controls and GAMP 5 categorization. No base model enters the fleet without passing the gate below and landing in the Approved Base-Model Registry.

3.1 Selection criteria#

Criterion	Requirement	Evidence captured
License compatibility	License must permit commercial + regulated use, self-hosting, fine-tuning, and redistribution of derivatives internally. Legal sign-off mandatory (see §3.2).	License text, SPDX id, legal approval record.
Provenance	Weights obtained from the authoritative publisher; digest verified. No re-uploads of unknown origin.	Source URL, publisher identity, SHA-256 of weights + tokenizer.
Security scan of weights	Scan serialized weights for unsafe deserialization (reject pickle where possible; require `safetensors`), embedded code, and known-bad artifacts. Quarantine until clean.	Scan report, scanner version, verdict.
Model card	Documented training data summary, intended use, known limitations, eval baselines, and bias notes. Missing card → not approved.	Stored model card + internal addendum.
Capability baseline	Passes minimum task-suite scores in 05 before any fine-tuning.	Eval run id, scores.
Maintainability	Supported by serving stack (vLLM/TensorRT-LLM), tokenizer stable, reasonable VRAM footprint.	Compatibility matrix entry.

3.2 License review (open-weight ≠ unrestricted)#

"Open-weight" describes weight availability, not unrestricted rights. Each license is reviewed individually; the table below is engineering guidance, not legal advice — Legal sign-off per model is mandatory and recorded in the registry.

License family	Typical examples	Commercial / regulated use	Watch-outs (review per version)
Apache-2.0 / MIT	Qwen2.5 (most sizes), StarCoder2, many bge/gte	Generally permissive	Confirm the specific checkpoint's license; some variants differ.
Llama Community License	Llama-3.x family	Permitted with conditions	Acceptable-use policy, attribution/naming requirements, large-MAU clause.
Model-specific bespoke	DeepSeek, some VL models	Case-by-case	Field-of-use, output/derivative terms, redistribution limits.
Non-commercial / research-only	Some checkpoints	Disqualified	Never admitted to the production fleet.

3.3 Approved Base-Model Registry#

Maintained in the MLflow registry with signed entries. Schema:

Field	Example
`model_uid`	`base/qwen2.5-coder-32b`
`weights_digest`	`sha256:…` (safetensors)
`tokenizer_digest`	`sha256:…`
`license_spdx` / `legal_approval_ref`	`Apache-2.0` / `LGL-2026-0142`
`provenance_url` / `publisher`	authoritative source
`security_scan_ref` / `verdict`	`SCAN-2026-0331` / `clean`
`model_card_ref`	stored card + addendum
`tier` / `serving_compat`	`Tier-M` / `vLLM,TRT-LLM`
`approval_state`	`approved` \
`cosign_signature`	Sigstore/cosign over the manifest

Only approved bases may be parents of a fine-tune. Revocation propagates to all derived adapters (§8).

4. The fine-tuning pipeline#

flowchart TD
    subgraph SRC["Sourced & governed inputs (§6)"]
        A1[Internal sanitized corpus<br/>code · docs · tickets]
        A2[Task / instruction datasets]
        A3[Preference pairs<br/>chosen / rejected]
        A4[Verifier-filtered synthetic data]
    end

    B[("Approved Base-Model<br/>Registry (§3)")] --> S1

    A1 --> S1["Stage 1: DAPT<br/>Domain-Adaptive Continued Pretraining<br/>(usually full FT or large LoRA)"]
    S1 --> S2["Stage 2: SFT<br/>Instruction / task tuning<br/>(LoRA or full FT)"]
    A2 --> S2
    A4 --> S2
    S2 --> S3["Stage 3: Preference alignment<br/>DPO / ORPO (TRL)"]
    A3 --> S3
    S3 --> S4["Stage 4: Task/Domain LoRA adapters<br/>(PEFT/QLoRA) — one per specialization (§5)"]

    S4 --> E["Eval gate (05)<br/>deterministic suites + abstention checks"]
    E -->|pass| R[("MLflow registry<br/>signed adapter + lineage (§7)")]
    E -->|fail| X[Repair / re-tune / reject]

    R --> SERVE["Multi-LoRA serving<br/>vLLM/Triton hot-swap on shared base"]

    classDef gate fill:#eef,stroke:#446;
    class E gate;

4.1 When to use each stage#

Stage	Purpose	Method	Use when	Skip when
1 — DAPT (Domain-Adaptive Continued Pretraining)	Inject domain distribution (embedded C idioms, regulatory register, internal APIs)	Continued pretraining on the sanitized internal corpus; full FT or large-rank LoRA; DeepSpeed/FSDP via Ray+Kueue	Base is unfamiliar with the domain vocabulary/style at the token level	Base already strong in-domain; only behavior shaping needed
2 — SFT (Supervised Fine-Tuning)	Teach task format & instruction following (test-gen schema, review rubric, doc templates)	TRL SFT; LoRA/QLoRA usually sufficient; full FT only if LoRA underfits	Almost always — this is the primary lever for task behavior	Task is purely retrieval/format-trivial
3 — Preference alignment	Shape preferences: prefer abstention over guessing, prefer compiling code, prefer cited regulatory claims	DPO / ORPO (TRL) on chosen/rejected pairs	Need to suppress over-confidence, hallucination, or unsafe patterns (§9)	No reliable preference signal yet
4 — Task/Domain LoRA	Narrow, swappable specialization	PEFT LoRA/QLoRA adapters on the aligned base	Per-domain (§5) capability needed without forking the base	One general adapter already meets the eval gate

4.2 LoRA vs full fine-tuning — decision rule#

Use LoRA / QLoRA when…	Use full FT when…
Behavior/format adaptation on a capable base (most SFT, all per-domain adapters)	DAPT requires moving the base distribution substantially
You need many swappable specializations on shared weights (multi-LoRA serving)	Tokenizer/vocab must change
GPU/cost budget is tight (P6); QLoRA fits on fewer GPUs	LoRA repeatedly underfits the eval target after rank/data tuning
Fast iteration and small, signable artifacts are required	A new long-lived base derivative is justified and will itself enter the registry

Default posture: prefer LoRA. Full FT is the exception and requires a documented justification plus its own registry entry as a derived base.

5. Domain specialization#

Each domain ships as a named LoRA adapter over an approved (optionally DAPT'd) base, independently versioned, eval-gated, and signed. Multi-LoRA serving hot-swaps the right adapter per request.

Domain adapter	Tier(s)	Specialized capability	Example tasks
`embedded-fw-c`	M / L	Embedded/firmware C, MISRA-style constraints, ISRs, fixed-point, no-malloc patterns	Generate/refactor firmware, flag undefined behavior, MISRA review
`imaging-pipeline`	M / V	Imaging processing pipelines, array/tensor ops, numerical stability	Pipeline code-gen, perf refactor, artifact reasoning
`dicom-adjacent`	M / V	DICOM-adjacent metadata, header semantics, de-ID conventions	Parse/validate metadata, generate handling code
`reg-doc-drafting`	M / L	Regulatory register; IEC 62304 / ISO 14971 / Part 11 phrasing; traceable claims	Draft design history items, risk entries, SOUP rationale
`test-generation`	M	Coverage-oriented unit/integration test synthesis with assertions	Generate tests to push coverage and mutation score
`code-review`	M	Project-specific review rubric, severity classification	Structured review with cited rule ids
`router-classify`	S	Triage, risk/class tagging, escalation, redaction	Route + abstain-or-escalate gate

5.1 Multimodal angle (Tier-V)#

Design inputs in this domain are inherently visual. Tier-V adapters target:

Design PDFs / design history files → extract structured requirements/specs (feeds 06).
Schematics / block diagrams → derive interfaces, signal lists, architecture facts.
UI screenshots → verify UI against spec; detect drift.
Imaging artifacts → describe/triage visual anomalies (advisory only; never a clinical claim).

All Tier-V outputs are advisory inputs to deterministic verifiers, not autonomous decisions; class-C-affecting outputs always require human confirmation (P3).

6. Data strategy for fine-tuning#

Training data is a controlled, versioned, signed artifact. The dataset is as much a regulated input as the model.

Concern	Control
Sourcing	Internal code, design docs, tickets/issues, review history — pulled via governed connectors with access controls (07).
PII / PHI scrubbing	Mandatory de-identification before any data leaves the source boundary into training. Multi-pass: pattern + Tier-S redaction model + human spot-audit. No PHI in training sets, ever.
IP / license hygiene	Exclude third-party code with incompatible licenses; track provenance per record; quarantine unknown-origin snippets.
Dataset versioning & signing	Immutable, content-addressed dataset snapshots (`dataset_uid` + digest), registered in MLflow, cosign-signed.
Train/test contamination control	Use Tier-E embeddings to detect near-duplicates between training data and held-out eval sets (05); fail the build on overlap above threshold. Eval sets are sealed and never enter training.
Synthetic data	Generated by Tier-M/L, then verifier-filtered: only synthetic examples whose outputs pass deterministic checks (compiles, tests pass, schema-valid) are retained. Unverifiable synthetic data is discarded.
Provenance labels	Every record tagged `internal` / `synthetic-verified` / `public-permissive` for auditability and ablation.

Contamination is a correctness-and-evidence risk, not a metric nuisance. A model trained on its own eval set produces inflated scores that cannot support a defensible ≥99.9% claim (P1/P4). Contamination control is a release gate (§11 anti-pattern).

7. Reproducibility & validation as first-class (P7)#

A fine-tune must be reproducible and defensible as CSA evidence. We lock every input and sign the full lineage so an auditor can re-derive the artifact.

7.1 What is locked#

Locked input	Mechanism
Dataset	Content-addressed snapshot (`dataset_uid` + digest), signed (§6).
Base model	`weights_digest` + `tokenizer_digest` from the Approved Registry (§3).
Config	Hyperparameters, stage sequence, LoRA rank/targets, decode params — versioned YAML (Axolotl/Llama-Factory/torchtune), digested.
Seeds	All RNG seeds (data shuffling, init, dropout) pinned.
Environment	Container image digest, CUDA/driver, library versions (PEFT/TRL/DeepSpeed) recorded.
Eval	Eval suite version + sealed test-set digest (05).

7.2 Signed lineage chain#

flowchart LR
    D["dataset_uid<br/>(signed digest)"] --> A
    BM["base model_uid<br/>(registry digest)"] --> A
    CFG["config + seeds + env<br/>(digest)"] --> A
    A["adapter / model artifact<br/>(SHA-256)"] --> EV
    EV["eval run<br/>(suite ver + scores)"] --> REG
    REG[("MLflow registry entry<br/>cosign-signed, SLSA provenance")]

Each edge is a cosign attestation; the registry entry carries SLSA provenance. The chain answers the auditor's question — "show me exactly how this model was produced and that nothing changed" — and is the artifact-level realization of Part 11 (P4) and CSA.

7.3 How this satisfies P7 and D2-L4#

P7 (reproducible): Any registered fine-tune is byte-reproducible from locked inputs; re-running the pipeline yields the same artifact digest (modulo documented nondeterminism, which is itself bounded and recorded).
D2-L4 (02): Validated Autonomous Agents require validated models. The signed lineage + eval-gated promotion (§8) is the model-side evidence package that lets an agent operate autonomously within its IEC 62304 class envelope.

8. Model lifecycle & governed evolution#

Models evolve under a PCCP-style predetermined change control applied to the toolchain models (reusing the FDA PCCP concept; the device itself is governed separately). Promotion is eval-gated (05); no promotion bypasses the gate.

flowchart LR
    C["candidate<br/>(new adapter / base)"] --> SH["shadow<br/>(mirror traffic, no effect)"]
    SH --> CN["canary<br/>(small % real, guarded)"]
    CN --> PR["promote<br/>(default for tier/domain)"]
    PR --> DP["deprecate<br/>(retain weights + lineage)"]
    SH -->|fail gate| RJ[reject]
    CN -->|regression| RB[rollback]

Phase	Gate / exit criteria	Evidence
Candidate	Lineage signed (§7); passes offline eval suite + abstention/calibration checks (§9)	Registry entry, eval run id
Shadow	Mirrored traffic; no regression vs incumbent on live distribution; no safety violations	Shadow comparison report
Canary	Bounded % of real traffic by IEC 62304 class (lower class first); cost-per-green-PR within budget (P6)	Canary metrics, guardrail logs
Promote	Meets/exceeds incumbent on all gated metrics; sign-off recorded	Promotion record, signatures
Deprecate	Successor promoted; weights and lineage retained for re-validation/defense	Retention record

8.1 PCCP-style predetermined change control (toolchain models)#

The Predetermined Change Control Plan for the model fleet specifies, in advance: the allowed change types (e.g., new domain adapter, refreshed SFT data), the fixed eval protocol that gates them, the rollback triggers, and the autonomy class affected. Changes inside the envelope flow through the lifecycle without re-opening the whole validation; changes outside it require plan revision. This ties to D1-L5 (02) (self-optimizing under governance) and the evaluation regime in 05. Base-model revocation (§3) forces immediate deprecation of all derived adapters.

9. Abstention & calibration#

The ≥99.9% system property (P1) depends on models that decline rather than guess on out-of-distribution or high-risk inputs, escalating to a larger tier or a human. Abstention is a trained and served behavior, validated in 05.

Mechanism	Where	Effect
Preference training for abstention	Stage 3 DPO/ORPO (§4)	Prefer "insufficient evidence → escalate" over a confident wrong answer
Calibrated confidence	Tier-S router + per-task heads	Confidence thresholds tuned so high-confidence ≈ high-accuracy
Abstain-or-escalate gate	Serving (router)	Below threshold → escalate tier or hand to human; never silently proceed
Selective prediction metrics	Eval (05)	Track coverage vs risk; gate on risk at fixed coverage, not raw accuracy
Class-aware strictness (P3)	Routing policy	IEC 62304 class C → conservative thresholds, mandatory human confirmation

Calibration target: in the operating region, a high-confidence answer is correct ≥99.9%; everything else abstains and routes to verification or a human. A miscalibrated-but-accurate model is not acceptable — abstention behavior is itself an eval gate.

10. Right-sizing & cost linkage#

Smallest-capable-model principle (P6): route every task to the smallest model that passes the gate; escalate only on abstention.

Lever	Action	Cost effect (→ 08)
Tiered routing	Tier-S handles the high-frequency majority; Tier-L is rare	Largest single cost lever; collapses per-token spend
Distillation	Capture Tier-L behavior (traces, preferences) → train Tier-M/S adapters	Moves capability down a tier at a fraction of inference cost
Quantization	FP8/AWQ/GPTQ per tier (§2)	More replicas per GPU; lower latency
Multi-LoRA hot-swap	Many domain adapters on one shared base	Eliminates per-domain base replicas; high GPU packing
Abstention budgeting	Escalate only when justified (§9)	Prevents needless large-model calls
Right-sized context	Use the minimum context window that passes eval	Lower KV-cache cost

Distillation note. The verifier-filtered synthetic pipeline (§6) is the distillation substrate: only Tier-L outputs that pass deterministic verification become training data for smaller adapters, so distillation transfers verified behavior, not hallucinations. The primary metric for right-sizing decisions is cost-per-green-PR (P6), owned in 08.

11. Anti-patterns#

#	Anti-pattern	Why it fails here	Required control
A1	Unversioned models	No reproducibility, no defensible evidence; violates P7/P4	Every model is a signed registry entry with full lineage (§7)
A2	Training on the eval set	Inflated scores cannot support ≥99.9% (P1); fraudulent evidence	Embedding-based contamination gate; sealed eval sets (§6, 05)
A3	Unscanned weights	Unsafe deserialization / supply-chain compromise	Mandatory weight scan + `safetensors` before approval (§3)
A4	License violation	Legal and regulatory exposure; non-commercial weights in production	Per-model legal sign-off in the registry (§3.2)
A5	SaaS LLM API "just for this one thing"	Breaks sovereignty, reproducibility, Part 11 chain (P7)	Hard constraint: open-weight self-host only (§1.2)
A6	One big model for everything	Cost-fatal, latency-fatal, coarse governance (§1.1)	Tiered fleet + smallest-capable routing (§2, §10)
A7	Over-confident models (no abstention)	Confident wrong answers break the 99.9% property	Abstention training + calibration gates (§9)
A8	Promotion without eval gate	Unvalidated change reaches users; violates D2-L4	Eval-gated candidate→shadow→canary→promote (§8)
A9	PHI/IP in training data	Privacy/IP breach; non-compliant corpus	Mandatory scrubbing + provenance labels (§6, 07)
A10	Unverified synthetic data	Trains models on hallucinations; degrades correctness	Verifier-filtering only; discard unverifiable (§6)

Summary#

The model strategy is a portfolio of small, specialized, signed, open-weight models governed as regulated components: tiered for cost/latency/specialization, fine-tuned through a locked-and-signed pipeline (DAPT → SFT → DPO/ORPO → LoRA), specialized per domain via hot-swappable adapters, and promoted only through eval-gated, PCCP-style change control. The correctness guarantee lives in the harness (P5) and its deterministic verification (P2), not in model scale; the models contribute calibrated capability and disciplined abstention (P1, §9), and reproducible, signed lineage (P7, §7) is what makes every fine-tune defensible CSA evidence. Implementation detail for evaluation lives in 05 and for economics in 08.