Agentic-Native SDLC for Regulated Medical Device Engineering#

Figure A — Reference Architecture (seven planes) · open SVG

A reference framework for transitioning a 1000+ developer, Kubernetes-native medical device organization (GE HealthCare / Siemens Healthineers class) from human-authored, AI-assisted development to a validated, agent-native software development lifecycle — under strict FDA / IEC 62304 obligations, with self-hosted fine-tuned models only, deterministic evaluation, and disciplined GPU/token economics.

This is not a vibe-coding playbook. The thesis throughout: generation is cheap; correctness, validation, traceability, and cost control are the engineering. Agents propose; deterministic verifiers and qualified humans dispose.

The problem in one paragraph#

The organization wants the velocity of agentic development but operates under constraints that rule out the default industry playbook: regulated software demands ≥99.9% release-gate correctness and full auditability; SaaS LLM APIs (Claude/OpenAI/Gemini) are excluded on cost and data-sovereignty grounds, so all inference and training are self-hosted, fine-tuned, open-weight, multi-model; agentic loops are GPU-expensive, so cost must be engineered down to cost-per-verified-task; and everything must map cleanly onto IEC 62304, ISO 13485 / FDA QMSR, ISO 14971, FDA CSA, GAMP 5, and 21 CFR Part 11.

The answer in one paragraph#

A six-level maturity model (ASMM-Med) moves the org from ungoverned shadow AI → governed assistance → spec-driven bounded automation → orchestrated agentic workflows → validated autonomous agents → a self-optimizing agentic enterprise. Capability is gated by assurance: autonomy can never outrun governance, evaluation, and security. A K8s-native reference architecture serves a tiered fleet of self-hosted fine-tuned models behind a routing gateway, wraps every probabilistic generation in deterministic verifiers + HITL to earn the 99.9% gate, runs agents in zero-trust sandboxes under a policy server, and meters GPU/token cost as a first-class SLO.

Document map#

#	Document	Read it for
00	this `README.md`	Executive overview, navigation
01	Requirements	Functional, non-functional, regulatory, data, model, and cost requirements (the "shall" statements)
02	Maturity Model (ASMM-Med)	The centerpiece: 6 levels × 8 dimensions, gate rules, scoring, KPIs, anti-patterns
03	Reference Architecture	K8s-native platform: serving, orchestration, data/RAG, control planes, topology
04	Model Strategy & Fine-Tuning	The multi-model fleet, continued-pretrain → SFT → preference → LoRA, reproducibility
05	Evaluation & Validation	How 99.9% is earned: deterministic verifiers, eval suites, the assurance argument
06	Agentic Workflows	Concrete agent patterns mapped to the SDLC and IEC 62304 activities
07	Security & Compliance	Zero-trust, supply chain, prompt-injection defense, CSA/Part 11, autonomy authorization
08	Token & GPU Economics	FinOps: routing, caching, quantization, cost-per-green-PR, build-vs-buy math
09	Adoption Roadmap	Phased plan, owners, exit criteria, org design, risks

Suggested reading order: 02 (frame) → 01 (obligations) → 03/04 (build) → 05 (assurance) → 06 (operation) → 07 (control) → 08 (cost) → 09 (sequence).

Seven invariant principles (carried across every document)#

**99.9% is a system property, not a model property** — earned at the gate via Generate → Verify → Repair → Gate, not assumed at generation.
Determinism wraps probabilism — every check that can be deterministic must be, and on the critical path to merge.
Risk-proportional autonomy — IEC 62304 safety class (A/B/C) sets the leash; Class C is always dual human control.
Everything an agent does is evidence — immutable, attributable, replayable (21 CFR Part 11 grade).
The harness is the product — Agent = Model + Harness; ~90% of behavior and ~100% of assurance live in the harness.
Cost is per verified task — the governing metric is cost-per-green-PR, not cost-per-token.
Self-hosted, sovereign, reproducible — all models/datasets/training versioned, signed, and regenerable for audit.

The 99.9% question, answered up front#

No self-hosted open-weight model deterministically produces 99.9%-correct regulated code. We do not try to make it. Instead:

Figure B — The 99.9% Release-Gate Assurance Pipeline · open SVG

The model is the least trusted component. Trust is manufactured by everything around it. Full treatment in 05-evaluation-and-validation.

The cost question, answered up front#

Self-hosting trades API OpEx for a GPU fleet (CapEx) + operations (OpEx). We make it pay by:

Tiered routing — a 1–8B "reflex" model handles the majority of low-complexity calls; 70B+/MoE "reasoners" are invoked sparingly (see 04, 08).
Caching — KV-cache reuse, prompt/semantic caching, retrieval caching.
Efficiency — quantization (FP8/INT8/AWQ), speculative decoding, continuous batching, MIG partitioning, scale-to-zero for spiky workloads.
Budget guardrails in-loop — hard token/GPU stops per task; eval-cost budgeting; reasoning-effort caps.
The right metric — optimize cost-per-green-PR, because an expensive change that passes all gates beats a cheap one that escapes a defect into a regulated product.

Scope & assumptions (challenge these)#

In scope: AI agents that build/test/document/maintain regulated software (the production/quality-system tooling track).
Adjacent (enabled, not detailed): AI shipped inside the device (SaMD) — a separate submission track that reuses the same eval/reproducibility/PCCP muscles (07 §"Two regulated tracks").
Platform assumption: existing Kubernetes estate with GPU capacity (on-prem and/or sovereign VPC), service mesh, and a mature CI/CD + QMS.
Model assumption: open-weight bases (Qwen / Llama / DeepSeek / Mistral / StarCoder families + a vision-language tier), fine-tuned in-house; no external inference.
Regulatory context: US FDA-centric with EU MDR / AI Act awareness; dates as of May–June 2026 (QMSR in effect; FDA CSA final; FDA AI-lifecycle + PCCP guidance available).

Authored as an internal engineering/quality reference. Every quantitative threshold (e.g., specific coverage %, GPU counts, SLOs) is a placeholder to be set by the organization's risk and capacity analysis, not a vendor claim.