Agentic-Native SDLC for Regulated Medical Device Engineering#
A reference framework for transitioning a 1000+ developer, Kubernetes-native medical device organization (GE HealthCare / Siemens Healthineers class) from human-authored, AI-assisted development to a validated, agent-native software development lifecycle — under strict FDA / IEC 62304 obligations, with self-hosted fine-tuned models only, deterministic evaluation, and disciplined GPU/token economics.
This is not a vibe-coding playbook. The thesis throughout: generation is cheap; correctness, validation, traceability, and cost control are the engineering. Agents propose; deterministic verifiers and qualified humans dispose.
The problem in one paragraph#
The organization wants the velocity of agentic development but operates under constraints that rule out the default industry playbook: regulated software demands ≥99.9% release-gate correctness and full auditability; SaaS LLM APIs (Claude/OpenAI/Gemini) are excluded on cost and data-sovereignty grounds, so all inference and training are self-hosted, fine-tuned, open-weight, multi-model; agentic loops are GPU-expensive, so cost must be engineered down to cost-per-verified-task; and everything must map cleanly onto IEC 62304, ISO 13485 / FDA QMSR, ISO 14971, FDA CSA, GAMP 5, and 21 CFR Part 11.
The answer in one paragraph#
A six-level maturity model (ASMM-Med) moves the org from ungoverned shadow AI → governed assistance → spec-driven bounded automation → orchestrated agentic workflows → validated autonomous agents → a self-optimizing agentic enterprise. Capability is gated by assurance: autonomy can never outrun governance, evaluation, and security. A K8s-native reference architecture serves a tiered fleet of self-hosted fine-tuned models behind a routing gateway, wraps every probabilistic generation in deterministic verifiers + HITL to earn the 99.9% gate, runs agents in zero-trust sandboxes under a policy server, and meters GPU/token cost as a first-class SLO.
Document map#
| # | Document | Read it for |
|---|---|---|
| 00 | this README.md | Executive overview, navigation |
| 01 | Requirements | Functional, non-functional, regulatory, data, model, and cost requirements (the "shall" statements) |
| 02 | Maturity Model (ASMM-Med) | The centerpiece: 6 levels × 8 dimensions, gate rules, scoring, KPIs, anti-patterns |
| 03 | Reference Architecture | K8s-native platform: serving, orchestration, data/RAG, control planes, topology |
| 04 | Model Strategy & Fine-Tuning | The multi-model fleet, continued-pretrain → SFT → preference → LoRA, reproducibility |
| 05 | Evaluation & Validation | How 99.9% is earned: deterministic verifiers, eval suites, the assurance argument |
| 06 | Agentic Workflows | Concrete agent patterns mapped to the SDLC and IEC 62304 activities |
| 07 | Security & Compliance | Zero-trust, supply chain, prompt-injection defense, CSA/Part 11, autonomy authorization |
| 08 | Token & GPU Economics | FinOps: routing, caching, quantization, cost-per-green-PR, build-vs-buy math |
| 09 | Adoption Roadmap | Phased plan, owners, exit criteria, org design, risks |
Suggested reading order: 02 (frame) → 01 (obligations) → 03/04 (build) → 05 (assurance) → 06 (operation) → 07 (control) → 08 (cost) → 09 (sequence).
Seven invariant principles (carried across every document)#
- **99.9% is a system property, not a model property** — earned at the gate via Generate → Verify → Repair → Gate, not assumed at generation.
- Determinism wraps probabilism — every check that can be deterministic must be, and on the critical path to merge.
- Risk-proportional autonomy — IEC 62304 safety class (A/B/C) sets the leash; Class C is always dual human control.
- Everything an agent does is evidence — immutable, attributable, replayable (21 CFR Part 11 grade).
- The harness is the product — Agent = Model + Harness; ~90% of behavior and ~100% of assurance live in the harness.
- Cost is per verified task — the governing metric is cost-per-green-PR, not cost-per-token.
- Self-hosted, sovereign, reproducible — all models/datasets/training versioned, signed, and regenerable for audit.
The 99.9% question, answered up front#
No self-hosted open-weight model deterministically produces 99.9%-correct regulated code. We do not try to make it. Instead:
The model is the least trusted component. Trust is manufactured by everything around it. Full treatment in 05-evaluation-and-validation.
The cost question, answered up front#
Self-hosting trades API OpEx for a GPU fleet (CapEx) + operations (OpEx). We make it pay by:
- Tiered routing — a 1–8B "reflex" model handles the majority of low-complexity calls; 70B+/MoE "reasoners" are invoked sparingly (see 04, 08).
- Caching — KV-cache reuse, prompt/semantic caching, retrieval caching.
- Efficiency — quantization (FP8/INT8/AWQ), speculative decoding, continuous batching, MIG partitioning, scale-to-zero for spiky workloads.
- Budget guardrails in-loop — hard token/GPU stops per task; eval-cost budgeting; reasoning-effort caps.
- The right metric — optimize cost-per-green-PR, because an expensive change that passes all gates beats a cheap one that escapes a defect into a regulated product.
Scope & assumptions (challenge these)#
- In scope: AI agents that build/test/document/maintain regulated software (the production/quality-system tooling track).
- Adjacent (enabled, not detailed): AI shipped inside the device (SaMD) — a separate submission track that reuses the same eval/reproducibility/PCCP muscles (07 §"Two regulated tracks").
- Platform assumption: existing Kubernetes estate with GPU capacity (on-prem and/or sovereign VPC), service mesh, and a mature CI/CD + QMS.
- Model assumption: open-weight bases (Qwen / Llama / DeepSeek / Mistral / StarCoder families + a vision-language tier), fine-tuned in-house; no external inference.
- Regulatory context: US FDA-centric with EU MDR / AI Act awareness; dates as of May–June 2026 (QMSR in effect; FDA CSA final; FDA AI-lifecycle + PCCP guidance available).
Authored as an internal engineering/quality reference. Every quantitative threshold (e.g., specific coverage %, GPU counts, SLOs) is a placeholder to be set by the organization's risk and capacity analysis, not a vendor claim.