← Unovie.AI Agentic-Native SDLC · Regulated MedTech

04 — Model Strategy & Fine-Tuning#

Tiered model routing — smallest-capable-model firstDev request+ contextRouting GatewayTier-S classifierTier-Sautocomplete · classify≈ lowest $Tier-Mtest · review · docslow $Tier-Larchitecture · planninghigh $ · sparseTier-Vdiagrams · imaging · PDFmultimodalTier-Eretrieval · reranktiny $
Figure E — Tiered Model-Fleet Routing  ·  open SVG

Document set: Agentic-Native SDLC for Regulated Medical Device Engineering Status: Controlled engineering reference · Revision date: May 2026 Owning function: ML Platform & Quality Engineering Cross-references: 01-requirements · 02-maturity-model · 03-reference-architecture · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 08-token-and-gpu-economics · 09-adoption-roadmap

Regulatory anchors: IEC 62304 · ISO 13485 / QMSR · ISO 14971 · FDA CSA · GAMP 5 · 21 CFR Part 11 · ISO/IEC 42001 · FDA AI-enabled device guidance + PCCP

Note on thresholds: All numeric thresholds in this document (e.g., ≥99.9%, abstention rates, eval gates) are placeholders pending calibration in 05. They denote intent and governance structure, not yet-ratified acceptance criteria.


1. Strategy rationale: why a tiered, multi-model fleet of open-weight models#

The toolchain is the regulated artifact, not the device. Per Principle P5 (the harness is the product), the models are components inside a verification harness whose system-level correctness target is ≥99.9% at the release gate (P1). No single model — however large — is the source of that guarantee; the guarantee emerges from Generate → Verify → Repair → Gate loops where deterministic checks (P2: determinism wraps probabilism) bound probabilistic generation.

Given that, model strategy optimizes for fitness-per-task at lowest defensible cost, not for a single frontier model.

1.1 Why tiered multi-model instead of one big model#

DriverOne big modelTiered fleet (chosen)
Cost (P6: cost-per-green-PR)Every autocomplete keystroke pays 70B+ inference. Economically fatal at 1000+ devs.Reflex-tier 1–8B handles the high-frequency 80%; Reasoner-tier reserved for rare hard plans. Tie to 08.
Latency70B autocomplete = unusable IDE latency.Tier-S sub-100ms-class on small GPUs; interactive paths never touch large models.
SpecializationGeneralist regresses on niche tasks (embedded C, DICOM, regulatory drafting).Per-domain LoRA adapters (§5) sharpen narrow tasks without retraining a monolith.
Abstention & calibration (§9)Monolith over-commits; one calibration curve for all tasks.Per-tier/per-task calibration; cheap router/classifier abstains early and escalates.
Blast radius / governed evolution (§8)One promotion changes everything; PCCP scope is the whole org.Adapter-scoped change control; canary one adapter without re-validating the fleet.
GPU packingCoarse, wasteful allocation.Multi-LoRA hot-swap on shared base weights; bin-pack tiers across the cluster.
Determinism (P2)Harder to wrap one opaque large model in deterministic checks.Small, fast verifiers and routers are themselves deterministic-friendly.

The fleet is a portfolio: route each task to the smallest capable model (§10), escalate on low confidence, and let deterministic verifiers — not model scale — own the correctness guarantee.

1.2 Why open-weight + self-hosted (P7: self-hosted, sovereign, reproducible)#

RequirementWhy SaaS LLM APIs are disqualifiedWhat open-weight self-host gives us
Reproducibility (P7, critical)Vendor silently updates the model; yesterday's evidence is not reproducible.We pin exact weights + tokenizer + config by digest; a fine-tune is reproducible byte-for-byte.
CSA / Part 11 evidence (P4)No control over model lineage; cannot sign the chain.Full signed lineage dataset→base→adapter→eval→registry (§7) becomes defensible audit evidence.
Data sovereignty (PHI/PII/IP)Source code, design history, and PHI-adjacent data leave the boundary.All training and inference stay inside the K8s/VPC trust boundary (see 07).
Determinism & controlNo control over sampling, decode, or version.We fix seeds, decode params, and serving stack (vLLM/Triton).
Cost (P6)Per-token vendor pricing scales with org size; opaque.Amortized GPU cost, first-class and measurable (08).
LongevityModels deprecated by vendor on vendor timeline.We retain weights indefinitely for re-validation and regulatory defense.

HARD CONSTRAINT (restated): Self-hosted, fine-tuned, open-weight models only. No SaaS LLM APIs anywhere in the SDLC toolchain.


2. The fleet specification#

Five tiers. All tiers are served from the shared stack: K8s + NVIDIA GPU Operator, vLLM / Triton + TensorRT-LLM / KServe with multi-LoRA hot-swap. See 03 for serving topology.

TierNameParamsExample base models (open-weight)Context windowServing HW (per replica)QuantizationTypical tasksWhy this tier
Tier-SReflex1–8BQwen2.5-Coder-1.5B / 7B, Llama-3.2-3B8–32k1× L4 / A10 (or MIG slice)FP8 / INT8; AWQ/GPTQ-4bit for 1.5BAutocomplete, classify, route, redact/PII-scrub, abstain-or-escalate gateHigh-frequency, low-latency, cost-dominant path. Must be cheap (P6) and fast.
Tier-MWorker14–34BQwen2.5-Coder-32B, StarCoder2-15B, DeepSeek-Coder-V2-Lite16–128k1–2× A100/H100 80GBFP8; AWQ-4bit optionTest generation, refactor, code review, doc drafting, structured editsWorkhorse for bounded automation (ASMM-Med L2/L3). Strong code with feasible cost.
Tier-LReasoner70B+ / MoELlama-3.3-70B, Qwen2.5-72B, DeepSeek-V3 / R1-distill, Mixtral-8x22B32–128k2–8× H100 80GB (TP/PP)FP8; INT4 for offlineArchitecture, multi-step planning, hard root-cause, spec decompositionRare, high-value reasoning. Reserved; never on interactive hot paths.
Tier-VMultimodal7–90BQwen2.5-VL, Llama-3.2-Vision, InternVL2, Pixtral8–128k1–4× A100/H100 80GBFP8 / AWQParse design PDFs, schematics, UI screenshots, imaging artifacts, diagram→specInputs in this domain are visual (design history files, DICOM-adjacent).
Tier-EEmbedding / Rerank0.1–1.5Bbge, gte, jina-code, nomic-embed512–8k1× L4 / A10 (CPU fallback)FP16 / INT8RAG retrieval, code search, dedup, contamination detection, rerankRetrieval substrate for every agentic workflow (06).

Routing summary. A Tier-S classifier/router triages every request: trivial → answer; ambiguous/high-risk → escalate to Tier-M/L; visual → Tier-V; retrieval → Tier-E first. Routing policy is governed by IEC 62304 class of the affected artifact (P3: autonomy by class A/B/C) and recorded as evidence (P4).


3. Base-model selection criteria & governance#

A base model is a supplier-provided component under ISO 13485 / QMSR supplier controls and GAMP 5 categorization. No base model enters the fleet without passing the gate below and landing in the Approved Base-Model Registry.

3.1 Selection criteria#

CriterionRequirementEvidence captured
License compatibilityLicense must permit commercial + regulated use, self-hosting, fine-tuning, and redistribution of derivatives internally. Legal sign-off mandatory (see §3.2).License text, SPDX id, legal approval record.
ProvenanceWeights obtained from the authoritative publisher; digest verified. No re-uploads of unknown origin.Source URL, publisher identity, SHA-256 of weights + tokenizer.
Security scan of weightsScan serialized weights for unsafe deserialization (reject pickle where possible; require safetensors), embedded code, and known-bad artifacts. Quarantine until clean.Scan report, scanner version, verdict.
Model cardDocumented training data summary, intended use, known limitations, eval baselines, and bias notes. Missing card → not approved.Stored model card + internal addendum.
Capability baselinePasses minimum task-suite scores in 05 before any fine-tuning.Eval run id, scores.
MaintainabilitySupported by serving stack (vLLM/TensorRT-LLM), tokenizer stable, reasonable VRAM footprint.Compatibility matrix entry.

3.2 License review (open-weight ≠ unrestricted)#

"Open-weight" describes weight availability, not unrestricted rights. Each license is reviewed individually; the table below is engineering guidance, not legal advice — Legal sign-off per model is mandatory and recorded in the registry.

License familyTypical examplesCommercial / regulated useWatch-outs (review per version)
Apache-2.0 / MITQwen2.5 (most sizes), StarCoder2, many bge/gteGenerally permissiveConfirm the specific checkpoint's license; some variants differ.
Llama Community LicenseLlama-3.x familyPermitted with conditionsAcceptable-use policy, attribution/naming requirements, large-MAU clause.
Model-specific bespokeDeepSeek, some VL modelsCase-by-caseField-of-use, output/derivative terms, redistribution limits.
Non-commercial / research-onlySome checkpointsDisqualifiedNever admitted to the production fleet.

3.3 Approved Base-Model Registry#

Maintained in the MLflow registry with signed entries. Schema:

FieldExample
model_uidbase/qwen2.5-coder-32b
weights_digestsha256:… (safetensors)
tokenizer_digestsha256:…
license_spdx / legal_approval_refApache-2.0 / LGL-2026-0142
provenance_url / publisherauthoritative source
security_scan_ref / verdictSCAN-2026-0331 / clean
model_card_refstored card + addendum
tier / serving_compatTier-M / vLLM,TRT-LLM
approval_stateapproved \
cosign_signatureSigstore/cosign over the manifest

Only approved bases may be parents of a fine-tune. Revocation propagates to all derived adapters (§8).


4. The fine-tuning pipeline#

flowchart TD
    subgraph SRC["Sourced & governed inputs (§6)"]
        A1[Internal sanitized corpus<br/>code · docs · tickets]
        A2[Task / instruction datasets]
        A3[Preference pairs<br/>chosen / rejected]
        A4[Verifier-filtered synthetic data]
    end

    B[("Approved Base-Model<br/>Registry (§3)")] --> S1

    A1 --> S1["Stage 1: DAPT<br/>Domain-Adaptive Continued Pretraining<br/>(usually full FT or large LoRA)"]
    S1 --> S2["Stage 2: SFT<br/>Instruction / task tuning<br/>(LoRA or full FT)"]
    A2 --> S2
    A4 --> S2
    S2 --> S3["Stage 3: Preference alignment<br/>DPO / ORPO (TRL)"]
    A3 --> S3
    S3 --> S4["Stage 4: Task/Domain LoRA adapters<br/>(PEFT/QLoRA) — one per specialization (§5)"]

    S4 --> E["Eval gate (05)<br/>deterministic suites + abstention checks"]
    E -->|pass| R[("MLflow registry<br/>signed adapter + lineage (§7)")]
    E -->|fail| X[Repair / re-tune / reject]

    R --> SERVE["Multi-LoRA serving<br/>vLLM/Triton hot-swap on shared base"]

    classDef gate fill:#eef,stroke:#446;
    class E gate;

4.1 When to use each stage#

StagePurposeMethodUse whenSkip when
1 — DAPT (Domain-Adaptive Continued Pretraining)Inject domain distribution (embedded C idioms, regulatory register, internal APIs)Continued pretraining on the sanitized internal corpus; full FT or large-rank LoRA; DeepSpeed/FSDP via Ray+KueueBase is unfamiliar with the domain vocabulary/style at the token levelBase already strong in-domain; only behavior shaping needed
2 — SFT (Supervised Fine-Tuning)Teach task format & instruction following (test-gen schema, review rubric, doc templates)TRL SFT; LoRA/QLoRA usually sufficient; full FT only if LoRA underfitsAlmost always — this is the primary lever for task behaviorTask is purely retrieval/format-trivial
3 — Preference alignmentShape preferences: prefer abstention over guessing, prefer compiling code, prefer cited regulatory claimsDPO / ORPO (TRL) on chosen/rejected pairsNeed to suppress over-confidence, hallucination, or unsafe patterns (§9)No reliable preference signal yet
4 — Task/Domain LoRANarrow, swappable specializationPEFT LoRA/QLoRA adapters on the aligned basePer-domain (§5) capability needed without forking the baseOne general adapter already meets the eval gate

4.2 LoRA vs full fine-tuning — decision rule#

Use LoRA / QLoRA when…Use full FT when…
Behavior/format adaptation on a capable base (most SFT, all per-domain adapters)DAPT requires moving the base distribution substantially
You need many swappable specializations on shared weights (multi-LoRA serving)Tokenizer/vocab must change
GPU/cost budget is tight (P6); QLoRA fits on fewer GPUsLoRA repeatedly underfits the eval target after rank/data tuning
Fast iteration and small, signable artifacts are requiredA new long-lived base derivative is justified and will itself enter the registry

Default posture: prefer LoRA. Full FT is the exception and requires a documented justification plus its own registry entry as a derived base.


5. Domain specialization#

Each domain ships as a named LoRA adapter over an approved (optionally DAPT'd) base, independently versioned, eval-gated, and signed. Multi-LoRA serving hot-swaps the right adapter per request.

Domain adapterTier(s)Specialized capabilityExample tasks
embedded-fw-cM / LEmbedded/firmware C, MISRA-style constraints, ISRs, fixed-point, no-malloc patternsGenerate/refactor firmware, flag undefined behavior, MISRA review
imaging-pipelineM / VImaging processing pipelines, array/tensor ops, numerical stabilityPipeline code-gen, perf refactor, artifact reasoning
dicom-adjacentM / VDICOM-adjacent metadata, header semantics, de-ID conventionsParse/validate metadata, generate handling code
reg-doc-draftingM / LRegulatory register; IEC 62304 / ISO 14971 / Part 11 phrasing; traceable claimsDraft design history items, risk entries, SOUP rationale
test-generationMCoverage-oriented unit/integration test synthesis with assertionsGenerate tests to push coverage and mutation score
code-reviewMProject-specific review rubric, severity classificationStructured review with cited rule ids
router-classifySTriage, risk/class tagging, escalation, redactionRoute + abstain-or-escalate gate

5.1 Multimodal angle (Tier-V)#

Design inputs in this domain are inherently visual. Tier-V adapters target:

  • Design PDFs / design history files → extract structured requirements/specs (feeds 06).
  • Schematics / block diagrams → derive interfaces, signal lists, architecture facts.
  • UI screenshots → verify UI against spec; detect drift.
  • Imaging artifacts → describe/triage visual anomalies (advisory only; never a clinical claim).

All Tier-V outputs are advisory inputs to deterministic verifiers, not autonomous decisions; class-C-affecting outputs always require human confirmation (P3).


6. Data strategy for fine-tuning#

Training data is a controlled, versioned, signed artifact. The dataset is as much a regulated input as the model.

ConcernControl
SourcingInternal code, design docs, tickets/issues, review history — pulled via governed connectors with access controls (07).
PII / PHI scrubbingMandatory de-identification before any data leaves the source boundary into training. Multi-pass: pattern + Tier-S redaction model + human spot-audit. No PHI in training sets, ever.
IP / license hygieneExclude third-party code with incompatible licenses; track provenance per record; quarantine unknown-origin snippets.
Dataset versioning & signingImmutable, content-addressed dataset snapshots (dataset_uid + digest), registered in MLflow, cosign-signed.
Train/test contamination controlUse Tier-E embeddings to detect near-duplicates between training data and held-out eval sets (05); fail the build on overlap above threshold. Eval sets are sealed and never enter training.
Synthetic dataGenerated by Tier-M/L, then verifier-filtered: only synthetic examples whose outputs pass deterministic checks (compiles, tests pass, schema-valid) are retained. Unverifiable synthetic data is discarded.
Provenance labelsEvery record tagged internal / synthetic-verified / public-permissive for auditability and ablation.

Contamination is a correctness-and-evidence risk, not a metric nuisance. A model trained on its own eval set produces inflated scores that cannot support a defensible ≥99.9% claim (P1/P4). Contamination control is a release gate (§11 anti-pattern).


7. Reproducibility & validation as first-class (P7)#

A fine-tune must be reproducible and defensible as CSA evidence. We lock every input and sign the full lineage so an auditor can re-derive the artifact.

7.1 What is locked#

Locked inputMechanism
DatasetContent-addressed snapshot (dataset_uid + digest), signed (§6).
Base modelweights_digest + tokenizer_digest from the Approved Registry (§3).
ConfigHyperparameters, stage sequence, LoRA rank/targets, decode params — versioned YAML (Axolotl/Llama-Factory/torchtune), digested.
SeedsAll RNG seeds (data shuffling, init, dropout) pinned.
EnvironmentContainer image digest, CUDA/driver, library versions (PEFT/TRL/DeepSpeed) recorded.
EvalEval suite version + sealed test-set digest (05).

7.2 Signed lineage chain#

flowchart LR
    D["dataset_uid<br/>(signed digest)"] --> A
    BM["base model_uid<br/>(registry digest)"] --> A
    CFG["config + seeds + env<br/>(digest)"] --> A
    A["adapter / model artifact<br/>(SHA-256)"] --> EV
    EV["eval run<br/>(suite ver + scores)"] --> REG
    REG[("MLflow registry entry<br/>cosign-signed, SLSA provenance")]

Each edge is a cosign attestation; the registry entry carries SLSA provenance. The chain answers the auditor's question — "show me exactly how this model was produced and that nothing changed" — and is the artifact-level realization of Part 11 (P4) and CSA.

7.3 How this satisfies P7 and D2-L4#

  • P7 (reproducible): Any registered fine-tune is byte-reproducible from locked inputs; re-running the pipeline yields the same artifact digest (modulo documented nondeterminism, which is itself bounded and recorded).
  • D2-L4 (02): Validated Autonomous Agents require validated models. The signed lineage + eval-gated promotion (§8) is the model-side evidence package that lets an agent operate autonomously within its IEC 62304 class envelope.

8. Model lifecycle & governed evolution#

Models evolve under a PCCP-style predetermined change control applied to the toolchain models (reusing the FDA PCCP concept; the device itself is governed separately). Promotion is eval-gated (05); no promotion bypasses the gate.

flowchart LR
    C["candidate<br/>(new adapter / base)"] --> SH["shadow<br/>(mirror traffic, no effect)"]
    SH --> CN["canary<br/>(small % real, guarded)"]
    CN --> PR["promote<br/>(default for tier/domain)"]
    PR --> DP["deprecate<br/>(retain weights + lineage)"]
    SH -->|fail gate| RJ[reject]
    CN -->|regression| RB[rollback]
PhaseGate / exit criteriaEvidence
CandidateLineage signed (§7); passes offline eval suite + abstention/calibration checks (§9)Registry entry, eval run id
ShadowMirrored traffic; no regression vs incumbent on live distribution; no safety violationsShadow comparison report
CanaryBounded % of real traffic by IEC 62304 class (lower class first); cost-per-green-PR within budget (P6)Canary metrics, guardrail logs
PromoteMeets/exceeds incumbent on all gated metrics; sign-off recordedPromotion record, signatures
DeprecateSuccessor promoted; weights and lineage retained for re-validation/defenseRetention record

8.1 PCCP-style predetermined change control (toolchain models)#

The Predetermined Change Control Plan for the model fleet specifies, in advance: the allowed change types (e.g., new domain adapter, refreshed SFT data), the fixed eval protocol that gates them, the rollback triggers, and the autonomy class affected. Changes inside the envelope flow through the lifecycle without re-opening the whole validation; changes outside it require plan revision. This ties to D1-L5 (02) (self-optimizing under governance) and the evaluation regime in 05. Base-model revocation (§3) forces immediate deprecation of all derived adapters.


9. Abstention & calibration#

The ≥99.9% system property (P1) depends on models that decline rather than guess on out-of-distribution or high-risk inputs, escalating to a larger tier or a human. Abstention is a trained and served behavior, validated in 05.

MechanismWhereEffect
Preference training for abstentionStage 3 DPO/ORPO (§4)Prefer "insufficient evidence → escalate" over a confident wrong answer
Calibrated confidenceTier-S router + per-task headsConfidence thresholds tuned so high-confidence ≈ high-accuracy
Abstain-or-escalate gateServing (router)Below threshold → escalate tier or hand to human; never silently proceed
Selective prediction metricsEval (05)Track coverage vs risk; gate on risk at fixed coverage, not raw accuracy
Class-aware strictness (P3)Routing policyIEC 62304 class C → conservative thresholds, mandatory human confirmation

Calibration target: in the operating region, a high-confidence answer is correct ≥99.9%; everything else abstains and routes to verification or a human. A miscalibrated-but-accurate model is not acceptable — abstention behavior is itself an eval gate.


10. Right-sizing & cost linkage#

Smallest-capable-model principle (P6): route every task to the smallest model that passes the gate; escalate only on abstention.

LeverActionCost effect (→ 08)
Tiered routingTier-S handles the high-frequency majority; Tier-L is rareLargest single cost lever; collapses per-token spend
DistillationCapture Tier-L behavior (traces, preferences) → train Tier-M/S adaptersMoves capability down a tier at a fraction of inference cost
QuantizationFP8/AWQ/GPTQ per tier (§2)More replicas per GPU; lower latency
Multi-LoRA hot-swapMany domain adapters on one shared baseEliminates per-domain base replicas; high GPU packing
Abstention budgetingEscalate only when justified (§9)Prevents needless large-model calls
Right-sized contextUse the minimum context window that passes evalLower KV-cache cost

Distillation note. The verifier-filtered synthetic pipeline (§6) is the distillation substrate: only Tier-L outputs that pass deterministic verification become training data for smaller adapters, so distillation transfers verified behavior, not hallucinations. The primary metric for right-sizing decisions is cost-per-green-PR (P6), owned in 08.


11. Anti-patterns#

#Anti-patternWhy it fails hereRequired control
A1Unversioned modelsNo reproducibility, no defensible evidence; violates P7/P4Every model is a signed registry entry with full lineage (§7)
A2Training on the eval setInflated scores cannot support ≥99.9% (P1); fraudulent evidenceEmbedding-based contamination gate; sealed eval sets (§6, 05)
A3Unscanned weightsUnsafe deserialization / supply-chain compromiseMandatory weight scan + safetensors before approval (§3)
A4License violationLegal and regulatory exposure; non-commercial weights in productionPer-model legal sign-off in the registry (§3.2)
A5SaaS LLM API "just for this one thing"Breaks sovereignty, reproducibility, Part 11 chain (P7)Hard constraint: open-weight self-host only (§1.2)
A6One big model for everythingCost-fatal, latency-fatal, coarse governance (§1.1)Tiered fleet + smallest-capable routing (§2, §10)
A7Over-confident models (no abstention)Confident wrong answers break the 99.9% propertyAbstention training + calibration gates (§9)
A8Promotion without eval gateUnvalidated change reaches users; violates D2-L4Eval-gated candidate→shadow→canary→promote (§8)
A9PHI/IP in training dataPrivacy/IP breach; non-compliant corpusMandatory scrubbing + provenance labels (§6, 07)
A10Unverified synthetic dataTrains models on hallucinations; degrades correctnessVerifier-filtering only; discard unverifiable (§6)

Summary#

The model strategy is a portfolio of small, specialized, signed, open-weight models governed as regulated components: tiered for cost/latency/specialization, fine-tuned through a locked-and-signed pipeline (DAPT → SFT → DPO/ORPO → LoRA), specialized per domain via hot-swappable adapters, and promoted only through eval-gated, PCCP-style change control. The correctness guarantee lives in the harness (P5) and its deterministic verification (P2), not in model scale; the models contribute calibrated capability and disciplined abstention (P1, §9), and reproducible, signed lineage (P7, §7) is what makes every fine-tune defensible CSA evidence. Implementation detail for evaluation lives in 05 and for economics in 08.