← Unovie.AI Agentic-Native SDLC · Regulated MedTech

Agentic-Native SDLC for Regulated Medical Device Engineering#

Reference Architecture — seven planes (Kubernetes-native, self-hosted)Developer InterfaceIDE pluginCLI / terminal agentMCP clientReview UIAgent & OrchestrationPlannerCoderTestReviewIntegrator · A2AArgo WorkflowsHarness & ControlRouting GatewayPolicy ServerSandbox MgrHooksSessions / MemoryModel Serving · self-hostedTier-STier-MTier-LTier-VTier-EvLLM·Triton·KServeData & KnowledgeCode GraphVector StoreReg. CorpusMLflow RegistryContext StorePlatform & InfrastructureKubernetesGPU OperatorIstio + SPIREVaultKEDA · KueueGovernance & Audit · 21 CFR Part 11 · WORM evidence
Figure A — Reference Architecture (seven planes)  ·  open SVG

A reference framework for transitioning a 1000+ developer, Kubernetes-native medical device organization (GE HealthCare / Siemens Healthineers class) from human-authored, AI-assisted development to a validated, agent-native software development lifecycle — under strict FDA / IEC 62304 obligations, with self-hosted fine-tuned models only, deterministic evaluation, and disciplined GPU/token economics.

This is not a vibe-coding playbook. The thesis throughout: generation is cheap; correctness, validation, traceability, and cost control are the engineering. Agents propose; deterministic verifiers and qualified humans dispose.


The problem in one paragraph#

The organization wants the velocity of agentic development but operates under constraints that rule out the default industry playbook: regulated software demands ≥99.9% release-gate correctness and full auditability; SaaS LLM APIs (Claude/OpenAI/Gemini) are excluded on cost and data-sovereignty grounds, so all inference and training are self-hosted, fine-tuned, open-weight, multi-model; agentic loops are GPU-expensive, so cost must be engineered down to cost-per-verified-task; and everything must map cleanly onto IEC 62304, ISO 13485 / FDA QMSR, ISO 14971, FDA CSA, GAMP 5, and 21 CFR Part 11.

The answer in one paragraph#

A six-level maturity model (ASMM-Med) moves the org from ungoverned shadow AI → governed assistance → spec-driven bounded automation → orchestrated agentic workflows → validated autonomous agents → a self-optimizing agentic enterprise. Capability is gated by assurance: autonomy can never outrun governance, evaluation, and security. A K8s-native reference architecture serves a tiered fleet of self-hosted fine-tuned models behind a routing gateway, wraps every probabilistic generation in deterministic verifiers + HITL to earn the 99.9% gate, runs agents in zero-trust sandboxes under a policy server, and meters GPU/token cost as a first-class SLO.


Document map#

#DocumentRead it for
00this README.mdExecutive overview, navigation
01RequirementsFunctional, non-functional, regulatory, data, model, and cost requirements (the "shall" statements)
02Maturity Model (ASMM-Med)The centerpiece: 6 levels × 8 dimensions, gate rules, scoring, KPIs, anti-patterns
03Reference ArchitectureK8s-native platform: serving, orchestration, data/RAG, control planes, topology
04Model Strategy & Fine-TuningThe multi-model fleet, continued-pretrain → SFT → preference → LoRA, reproducibility
05Evaluation & ValidationHow 99.9% is earned: deterministic verifiers, eval suites, the assurance argument
06Agentic WorkflowsConcrete agent patterns mapped to the SDLC and IEC 62304 activities
07Security & ComplianceZero-trust, supply chain, prompt-injection defense, CSA/Part 11, autonomy authorization
08Token & GPU EconomicsFinOps: routing, caching, quantization, cost-per-green-PR, build-vs-buy math
09Adoption RoadmapPhased plan, owners, exit criteria, org design, risks

Suggested reading order: 02 (frame) → 01 (obligations) → 03/04 (build) → 05 (assurance) → 06 (operation) → 07 (control) → 08 (cost) → 09 (sequence).


Seven invariant principles (carried across every document)#

  1. **99.9% is a system property, not a model property** — earned at the gate via Generate → Verify → Repair → Gate, not assumed at generation.
  2. Determinism wraps probabilism — every check that can be deterministic must be, and on the critical path to merge.
  3. Risk-proportional autonomy — IEC 62304 safety class (A/B/C) sets the leash; Class C is always dual human control.
  4. Everything an agent does is evidence — immutable, attributable, replayable (21 CFR Part 11 grade).
  5. The harness is the product — Agent = Model + Harness; ~90% of behavior and ~100% of assurance live in the harness.
  6. Cost is per verified task — the governing metric is cost-per-green-PR, not cost-per-token.
  7. Self-hosted, sovereign, reproducible — all models/datasets/training versioned, signed, and regenerable for audit.

The 99.9% question, answered up front#

No self-hosted open-weight model deterministically produces 99.9%-correct regulated code. We do not try to make it. Instead:

The 99.9% release gate — Generate → Verify → Repair → Gate → SignTask + SpecBDD / GherkinModel FleetS / M / L / VDeterministicVerifiersEval Gate≥ 99.9%HITLby safety classPR +Audit recordfail → repair loop (bounded budget)abstain → escalatebuild · type-check · unit · propertymutation · differential · fuzzSAST/DAST · schema · formal · policyClass C → dual control
Figure B — The 99.9% Release-Gate Assurance Pipeline  ·  open SVG

The model is the least trusted component. Trust is manufactured by everything around it. Full treatment in 05-evaluation-and-validation.


The cost question, answered up front#

Self-hosting trades API OpEx for a GPU fleet (CapEx) + operations (OpEx). We make it pay by:

  • Tiered routing — a 1–8B "reflex" model handles the majority of low-complexity calls; 70B+/MoE "reasoners" are invoked sparingly (see 04, 08).
  • Caching — KV-cache reuse, prompt/semantic caching, retrieval caching.
  • Efficiency — quantization (FP8/INT8/AWQ), speculative decoding, continuous batching, MIG partitioning, scale-to-zero for spiky workloads.
  • Budget guardrails in-loop — hard token/GPU stops per task; eval-cost budgeting; reasoning-effort caps.
  • The right metric — optimize cost-per-green-PR, because an expensive change that passes all gates beats a cheap one that escapes a defect into a regulated product.

Scope & assumptions (challenge these)#

  • In scope: AI agents that build/test/document/maintain regulated software (the production/quality-system tooling track).
  • Adjacent (enabled, not detailed): AI shipped inside the device (SaMD) — a separate submission track that reuses the same eval/reproducibility/PCCP muscles (07 §"Two regulated tracks").
  • Platform assumption: existing Kubernetes estate with GPU capacity (on-prem and/or sovereign VPC), service mesh, and a mature CI/CD + QMS.
  • Model assumption: open-weight bases (Qwen / Llama / DeepSeek / Mistral / StarCoder families + a vision-language tier), fine-tuned in-house; no external inference.
  • Regulatory context: US FDA-centric with EU MDR / AI Act awareness; dates as of May–June 2026 (QMSR in effect; FDA CSA final; FDA AI-lifecycle + PCCP guidance available).

Authored as an internal engineering/quality reference. Every quantitative threshold (e.g., specific coverage %, GPU counts, SLOs) is a placeholder to be set by the organization's risk and capacity analysis, not a vendor claim.