← Unovie.AI Agentic-Native SDLC · Regulated MedTech

ASMM-Med — Agentic SDLC Maturity Model for Regulated Medical Device Engineering#

ASMM-Med maturity — capability rises with assuranceL0Ad-hocSuggestions onlyL1GovernedInline completionL2Spec-DrivenSingle-stepfull reviewL3OrchestratedMulti-stepHITL gatesL4Validated AutonomousAutonomous withinvalidated boundsL5Self-OptimizingClosed-loopPCCP-governedGoverning level = min(D1 Governance, D4 Evaluation,D6 Security). Autonomy never outruns assurance.Class C is always dual human control.
Figure D — ASMM-Med Maturity Staircase  ·  open SVG

Audience: Engineering leadership, Quality/Regulatory (QA/RA), MLOps/Platform, and Security at a 1000+ developer medical-device organization (GE HealthCare / Siemens Healthineers class). Scope: AI agents used to build, test, document, and maintain regulated software (the tooling side of the SDLC), running on a Kubernetes cloud-native platform with self-hosted, fine-tuned open-weight models only (no Claude/OpenAI/Gemini SaaS APIs). Companion docs: 01-requirements · 03-reference-architecture · 04-model-strategy-and-finetuning · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 08-token-and-gpu-economics · 09-adoption-roadmap


1. Purpose#

This maturity model gives a regulated medical-device engineering organization a defensible, auditable path from ad-hoc AI assistance to a validated, agent-native software development lifecycle. It is explicitly not a vibe-coding model. Every level raises both capability and assurance, because in this domain unverified velocity is a liability, not an asset.

The model answers four questions leadership repeatedly asks:

  1. How do we get to ≥99.9% release-gate correctness when the underlying models are probabilistic?
  2. How do we stay inside FDA / IEC 62304 / ISO 13485 obligations while letting agents touch regulated code?
  3. How do we control GPU/token cost when agentic loops are expensive and we self-host?
  4. What does "good" look like at each step, so we can fund, audit, and de-risk the transition?

2. Foundational design principles#

These principles are invariant across all levels and bind the rest of the documentation set.

P1 — 99.9% is a system property, not a model property#

No single open-weight model will deterministically hit 99.9% functional correctness on regulated code. The target is met by the system: a probabilistic generator wrapped in deterministic verifiers and human checkpoints.

How released correctness is earnedGeneratemodel fleetVerify · deterministiccompiler·tests·SAST·schema·formalRepairloopGateeval + HITL · releasefail → repair
Figure F — How Released Correctness Is Earned  ·  open SVG

The model's job is to propose; the harness's job is to dispose. Correctness is earned at the gate, not assumed at generation. See 05-evaluation-and-validation for the full assurance argument.

P2 — Determinism wraps probabilism#

Wherever a check can be deterministic (type systems, unit/property/mutation tests, static analysis, schema validation, policy-as-code, formal/specification checks), it must be, and it sits on the critical path to merge. Probabilistic judgment (LLM-as-judge) is permitted only as a secondary, escalating signal, never as a sole gate for risk-classified changes.

P3 — Risk-proportional autonomy (IEC 62304 safety class drives the leash length)#

Agent autonomy is a function of the safety class of the artifact being modified:

  • Class A (no injury possible): high autonomy permitted at higher maturity.
  • Class B (non-serious injury): bounded autonomy, mandatory human review.
  • Class C (death / serious injury): agent may propose and evidence, but a qualified human always authors the merge decision; dual control required.

P4 — Everything an agent does is evidence#

Every prompt, context bundle, model+adapter version, tool call, verifier result, and human decision is captured as immutable, attributable, replayable record (21 CFR Part 11-grade). If it isn't logged, it didn't happen — and it can't ship in a regulated product.

P5 — The harness is the product (Agent = Model + Harness)#

~90% of behavior and ~100% of assurance comes from the harness (instructions, tools, sandboxes, policies, evals, observability), not the raw model. Investment, validation, and change control concentrate on the harness. Models are hot-swappable inputs; the harness is the controlled system.

P6 — Cost is measured per resolved, verified task, not per token#

Self-hosting converts OpEx (API calls) into CapEx+OpEx (GPU fleet). The governing metric is cost-per-green-PR (a change that passes all deterministic gates and human review), not cost-per-token. Routing, caching, and tiering exist to minimize that. See 08-token-and-gpu-economics.

P7 — Self-hosted, sovereign, reproducible#

All inference and training run inside the organization's trust boundary (on-prem or sovereign VPC). Models, datasets, and training runs are versioned, signed, and reproducible so that any artifact an agent produced can be regenerated and defended in an audit or a recall investigation.


3. The six maturity levels#

LevelNameOne-line essenceDominant operating modeAutonomy ceiling
L0Ad-hoc AssistanceUngoverned, shadow AIIndividual, in-editorSuggestions only
L1Governed AssistanceSanctioned self-hosted assistConductor (human types)Inline completion
L2Spec-Driven Bounded AutomationSpecs + single agents on reviewable tasksConductor + bounded tasksSingle-step, full review
L3Orchestrated Agentic WorkflowsSandboxed multi-step agents open PRsOrchestratorMulti-step, HITL gates
L4Validated Autonomous AgentsAgents validated as CSA software toolsOrchestrator + validated autonomyAutonomous within validated bounds
L5Self-Optimizing Agentic EnterpriseClosed-loop, eval-driven, cost-optimalFleet governanceSelf-improving under PCCP-style control

L0 — Ad-hoc Assistance (the starting reality, not a goal)#

Developers use whatever in-IDE autocomplete they can reach. No central policy, no logging, no model governance, and likely shadow use of public SaaS endpoints — an IP-leakage and compliance incident waiting to happen. There is no audit trail tying generated code to a model version. This level is non-compliant by default and the transformation's first job is to extinguish it.

L1 — Governed Assistance#

A sanctioned, self-hosted code-assistance model (Tier-S/Tier-M, see 04) is offered through the IDE behind SSO, with prompt/response logging, an Acceptable-Use Policy, and DLP/PII guards. Humans still author 100% of production code; the AI accelerates typing and lookup. The win is eliminating shadow AI and establishing the logging substrate that all later assurance depends on.

L2 — Spec-Driven Bounded Automation#

The organization adopts Spec-Driven Development: specs/, AGENTS.md/rule files, and BDD/Gherkin acceptance criteria live in the repo as the source of truth. Single-step agents perform bounded, individually reviewable tasks — test generation, documentation, mechanical refactors, boilerplate — and a deterministic evaluation harness runs in CI. Every agent output is a normal PR under full human review. This is the first level where agents write code that can ship, and the first level where the eval harness exists.

L3 — Orchestrated Agentic Workflows#

Agents run multi-step, in sandboxes (ephemeral, egress-restricted), call tools through an MCP tool plane, and are routed across a fleet of fine-tuned models by task complexity. A policy server gates every tool call (structural + semantic). Agents open PRs autonomously but human-in-the-loop checkpoints are mandatory at risk-defined boundaries. Trajectory + output evaluations join the deterministic gates. The org now ships agent-produced changes at scale with managed risk.

L4 — Validated Autonomous Agents (the regulated inflection point)#

Each production agent is treated as a software tool used in production of a medical device and is validated under FDA Computer Software Assurance (CSA): documented intended use, risk-based test evidence, and recorded validation. Agents operate autonomously within validated bounds for their safety-class envelope, coordinate via A2A multi-agent patterns, and are continuously gated by ≥99.9% eval thresholds with full IEC 62304 traceability (requirement → spec → code → test → eval → release). Class C work still requires dual human control. This is where "agentic" becomes "regulated-grade."

L5 — Self-Optimizing Agentic Enterprise#

A closed loop connects production telemetry → eval failures → curated fine-tuning data → candidate adapters → automated eval-driven promotion — all under a Predetermined Change Control Plan (PCCP)-style governance so model/harness evolution is pre-authorized and auditable. The fleet self-tunes for quality and cost-per-green-PR, observability is enterprise-wide, and agents participate in their own improvement under human governance. Maturity here is operational discipline at scale, not raw autonomy.


4. The eight capability dimensions#

Maturity is assessed independently along eight axes. An organization is rarely uniform; the floor across safety-relevant dimensions (D1, D4, D6) governs what autonomy is actually permitted, regardless of how advanced D2/D5 are.

#DimensionWhat it measures
D1Governance, Quality & Regulatory ComplianceQMS integration, IEC 62304/ISO 13485/QMSR alignment, CSA validation of tools, traceability, change control
D2Model Infrastructure & MLOpsSelf-hosted serving, fine-tuning pipeline, model registry, reproducibility, multi-LoRA, GPU platform
D3Context & Knowledge EngineeringSpecs, rule files, RAG over code/docs/regulatory corpus, memory, context hygiene
D4Evaluation, Validation & AssuranceDeterministic verifiers, eval suites, trajectory eval, acceptance criteria, the 99.9% gate, abstention
D5Agentic Orchestration & ToolingSingle→multi-agent, MCP/A2A, sandboxing, HITL design, workflow engine
D6Security & Zero-TrustIdentity, egress control, prompt-injection defense, supply chain, secrets, IEC 62443, audit immutability
D7Observability & FinOpsTracing, eval dashboards, token/GPU metering, routing economics, budget guardrails
D8People, Skills & Operating ModelRoles (conductor/orchestrator), review culture, training, approval-fatigue controls

5. The maturity matrix (core artifact)#

Each cell is the exit criterion for that dimension at that level (you reach the level only when every dimension meets at least that level's descriptor; see §6).

D1 — Governance, Quality & Regulatory Compliance#

LDescriptor
L0No policy; shadow AI; no link between generated code and a model version. Non-compliant.
L1AUP published; AI-assist logged; QMS acknowledges AI tooling; data-handling/IP policy enforced (no external endpoints).
L2AI-tool use captured in the DHF/quality records; SDD specs are controlled documents; SOP for "AI-assisted change" exists; risk assessment (ISO 14971) covers AI tooling.
L3Risk-based tool classification per GAMP 5; agent actions mapped to IEC 62304 activities; change-control board reviews harness changes; ISO/IEC 42001 AI-management controls adopted.
L4Each agent validated under FDA CSA with documented intended use + evidence; full requirement-to-release traceability; agents recognized in ISO 13485/QMSR as validated production tooling; periodic revalidation triggers defined.
L5PCCP-style predetermined change control governs model/harness evolution; continuous compliance monitoring; automated audit-evidence generation; regulatory-grade reproducibility of any historical agent action.

D2 — Model Infrastructure & MLOps#

LDescriptor
L0None / external SaaS.
L1Single self-hosted model served on K8s GPU (vLLM/Triton) behind an internal gateway; basic autoscaling.
L2Model registry (MLflow) with versioned, signed models; first fine-tunes (LoRA/SFT) on internal corpus; reproducible serving images.
L3Tiered model fleet (S/M/L/V/E) with multi-LoRA hot-swap; routing gateway; quantization (FP8/INT8/AWQ); distributed training (Ray/Kueue); model cards mandatory.
L4Reproducible, validated training pipelines (locked datasets/seeds, signed lineage) suitable as validation evidence; canary + shadow deployment; rollback guarantees; per-domain adapters.
L5Continuous fine-tuning from production signal; automated candidate training; eval-gated promotion; fleet-level capacity/cost optimization; data flywheel under governance.

D3 — Context & Knowledge Engineering#

LDescriptor
L0Whatever is in the editor buffer.
L1Curated system prompts / org coding standards injected; no retrieval.
L2specs/ + AGENTS.md + BDD criteria in-repo; static rule files; basic code RAG.
L3Governed RAG over code, design docs, SOPs, and the regulatory corpus; static-vs-dynamic context split; Agent Skills with progressive disclosure; PII/PHI context hygiene middleware.
L4Validated knowledge sources (controlled, versioned, access-scoped); retrieval provenance recorded as evidence; graph-native code understanding for large legacy estates.
L5Self-curating knowledge base; context quality measured and optimized; memory governed and access-audited fleet-wide.

D4 — Evaluation, Validation & Assurance#

LDescriptor
L0"Looks right." None.
L1Lint + existing CI on human-authored code; no AI-specific eval.
L2Deterministic eval harness in CI (build, unit/property tests, type-check, SAST); per-task acceptance criteria defined; AI changes can't merge red.
L3Output + trajectory evals; golden datasets with rubrics; mutation testing; LLM-as-judge as secondary signal; abstention/escalation wired in; escape-rate tracked.
L4≥99.9% release-gate enforced via generate→verify→repair + N-sample self-consistency + verifier voting; eval suites are controlled validation artifacts; statistical acceptance (CI bounds) per safety class; HITL mandatory for Class B/C.
L5Continuous eval against live failure modes; auto-regression capture; eval-driven model promotion; drift detection closes the loop.

D5 — Agentic Orchestration & Tooling#

LDescriptor
L0None.
L1Inline completion only; no tool use.
L2Single-step agents, bounded tasks, no autonomous multi-file changes; tools read-only or PR-only.
L3Multi-step agents in sandboxes (gVisor/Kata, ephemeral, egress-deny); MCP tool plane; policy server gates every call; HITL checkpoints designed in.
L4Multi-agent (A2A) with role decomposition (planner/coder/test/review); validated tool catalog; deterministic hooks at lifecycle points; durable sessions/memory.
L5Self-orchestrating workflows with governed dynamic planning; sub-agent fleets; automated decomposition; coordination cost-optimized.

D6 — Security & Zero-Trust#

LDescriptor
L0Uncontrolled; IP egress risk.
L1SSO, RBAC, TLS, prompt/response logging, DLP/secret-scanning on inputs.
L2Per-repo scoping; signed images; secret management (Vault); no agent write access to prod.
L3Zero-trust mesh (mTLS, SPIFFE/SPIRE); OPA/Gatekeeper; sandboxed exec with egress allow-lists; prompt-injection & context-poisoning defenses; PII/PHI masking; SBOM.
L4Supply-chain assurance (SLSA, Sigstore/cosign signed models+datasets+artifacts); semantic policy gating; IEC 62443 + FDA premarket cybersecurity (§524B) alignment; WORM immutable audit; dual-control for high-risk tool calls.
L5Continuous adversarial testing (red-team agents); automated anomaly/drift response; tamper-evident, fleet-wide, self-defending posture.

D7 — Observability & FinOps#

LDescriptor
L0None.
L1Basic request logging + GPU utilization metrics.
L2Per-team usage dashboards; cost attribution; eval pass/fail visibility.
L3End-to-end tracing (OpenTelemetry) of agent trajectories; token/GPU cost per task; routing telemetry; budget alerts.
L4Cost-per-green-PR as a board metric; per-model/per-tier economics; SLOs on latency/quality/cost; capacity forecasting; budget guardrails enforced in-loop (hard stops).
L5Closed-loop cost optimization; automated routing/quantization decisions; ROI attribution; predictive scaling.

D8 — People, Skills & Operating Model#

LDescriptor
L0Individual experimentation.
L1Training on sanctioned tool + AUP; champions identified.
L2Spec-writing & review skills built; "conductor" mode normalized; review checklists for AI output.
L3Orchestrator role emerges; ownership split (e.g., API vs UX) to reduce merge conflict; approval-fatigue controls (digital quiet hours, batched review).
L4Formal roles: Agent Steward, Eval Owner, Harness Engineer; no-blame integration culture; reviewers trained on AI failure modes (hallucinated deps, plausible-wrong logic).
L5Hiring/skills reframed around judgment over implementation; org-wide agentic literacy; continuous capability development; humans focus on architecture, risk, and verification.

6. Level-gate rules (how you actually "are" at a level)#

  1. Floor, not average. Your level on any dimension is its lowest satisfied descriptor. Your overall operating level is the minimum across D1, D4, and D6 (the safety/assurance/security triad). You may invest ahead in D2/D5, but you may not grant autonomy beyond what D1/D4/D6 support. (This is the single most important rule — it prevents capability from outrunning assurance.)
  2. Evidence-based promotion. Advancing a level requires a documented assessment (§7) signed by Engineering + QA/RA + Security. No self-attestation.
  3. Safety-class gating. Even at L4/L5, autonomy is capped per IEC 62304 class (P3). L5 does not mean "agents merge Class C code unattended" — it never does.
  4. Reversibility. Every level must support rollback to the prior level's controls if eval escape-rate or incident metrics breach threshold.

7. Assessment & scoring method#

  • Cadence: quarterly self-assessment; annual independent (internal audit) assessment; event-driven re-assessment after any Sev-1 AI-attributable defect or recall-relevant escape.
  • Instrument: 8 dimensions × 6 levels rubric (this document), scored 0–5 each, with required evidence artifacts per cell (logs, validation records, eval reports, signed model lineage, policy configs).
  • Scoring outputs:
    • Dimension scores (radar chart) → reveals imbalance.
    • Governing level = min(D1, D4, D6).
    • Capability level = mean(all) (informational only).
    • Autonomy authorization = derived table mapping (governing level × safety class) → permitted agent actions (lives in 07-security-and-compliance).
  • Gate review: QA/RA holds veto. Security holds veto. Promotion requires both plus Engineering sign-off.

Illustrative target trajectory (informational)#

Quarter (from program start)Target governing level
Q1–Q2L1 (kill shadow AI, stand up serving + logging)
Q3–Q4L2 (SDD + deterministic eval harness)
Q5–Q7L3 (sandboxed orchestration + fleet + policy server)
Q8–Q11L4 (CSA validation of agents, 99.9% gates)
Q12+L5 (closed-loop, cost-optimized)

See 09-adoption-roadmap for the detailed phased plan, owners, and exit criteria.


8. Per-level KPIs#

LevelCapability KPIsAssurance / cost KPIs
L1% devs on sanctioned tool; shadow-AI incidents → 0100% prompts logged; IP-egress events = 0
L2% repos with specs/+AGENTS.md; agent-PR acceptance rateCI deterministic-gate coverage ≥ X%; zero red merges
L3% tasks via orchestrated agents; HITL checkpoint adherencetrajectory-eval pass rate; escape-rate; cost-per-task
L4autonomous-task throughput within validated boundsrelease-gate ≥99.9%; validation-evidence completeness; revalidation on time
L5fine-tune cycle time; auto-promotion ratecost-per-green-PR trend ↓; drift MTTR; audit-evidence automation %

9. Anti-patterns (explicitly disallowed)#

  • Autonomy ahead of assurance — granting L3+ autonomy while D4/D6 sit at L1/L2. Forbidden by §6.1.
  • LLM-as-sole-gate — using a model to approve a model's output on Class B/C code. Violates P2.
  • Unversioned models — serving a model you cannot reproduce or sign. Violates P7 and CSA.
  • Eval theater — a passing demo presented as a passing eval. Evals need rubrics + statistical acceptance (D4-L4).
  • Context dumping — stuffing 100k-token repos into prompts; burns GPU and degrades accuracy. See 08.
  • Approval-fatigue reflex-clicking — unbatched micro-approvals leading reviewers to rubber-stamp. Controlled at D8-L3.
  • Shadow SaaS — any call to external LLM endpoints. Hard-blocked at network layer (D6).

10. Regulatory mapping (orientation)#

Standard / regulationHow this model engages it
IEC 62304 (medical device software lifecycle)Agent actions mapped to lifecycle activities; safety class drives autonomy (P3); traceability at D1-L4
ISO 13485 / FDA QMSR (21 CFR 820, effective Feb 2026)AI agents recognized as validated production/quality tooling; SOPs + DHF integration
ISO 14971 (risk management)AI-tooling failure modes in the risk file; mitigations = deterministic gates + HITL
FDA Computer Software Assurance (CSA)Risk-based validation of each agent as production/QS software (D1-L4)
GAMP 5 (2nd ed.)Risk-based, critical-thinking validation approach for the tool category
21 CFR Part 11Immutable, attributable, signed e-records of all agent actions (P4, D6-L4)
ISO/IEC 42001 (AI management system)Org-level AI governance controls (D1-L3+)
IEC 62443 / FDA premarket cybersecurity (§524B)Zero-trust + supply-chain assurance for the agent platform (D6-L4)
FDA AI-enabled device guidance + PCCPPattern reused at D1-L5 for governed model evolution of the dev toolchain

Critical distinction: this model governs AI that builds the device (production/quality-system software). AI shipped inside the device (SaMD/AI-enabled function) is a separate submission-bearing track — but the same assurance muscles (eval rigor, reproducibility, PCCP) directly enable it. See 07-security-and-compliance §"Two regulated tracks."


11. How to read the rest of the set#