ASMM-Med — Agentic SDLC Maturity Model for Regulated Medical Device Engineering#

Figure D — ASMM-Med Maturity Staircase · open SVG

Audience: Engineering leadership, Quality/Regulatory (QA/RA), MLOps/Platform, and Security at a 1000+ developer medical-device organization (GE HealthCare / Siemens Healthineers class). Scope: AI agents used to build, test, document, and maintain regulated software (the tooling side of the SDLC), running on a Kubernetes cloud-native platform with self-hosted, fine-tuned open-weight models only (no Claude/OpenAI/Gemini SaaS APIs). Companion docs: 01-requirements · 03-reference-architecture · 04-model-strategy-and-finetuning · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 08-token-and-gpu-economics · 09-adoption-roadmap

1. Purpose#

This maturity model gives a regulated medical-device engineering organization a defensible, auditable path from ad-hoc AI assistance to a validated, agent-native software development lifecycle. It is explicitly not a vibe-coding model. Every level raises both capability and assurance, because in this domain unverified velocity is a liability, not an asset.

The model answers four questions leadership repeatedly asks:

How do we get to ≥99.9% release-gate correctness when the underlying models are probabilistic?
How do we stay inside FDA / IEC 62304 / ISO 13485 obligations while letting agents touch regulated code?
How do we control GPU/token cost when agentic loops are expensive and we self-host?
What does "good" look like at each step, so we can fund, audit, and de-risk the transition?

2. Foundational design principles#

These principles are invariant across all levels and bind the rest of the documentation set.

P1 — 99.9% is a system property, not a model property#

No single open-weight model will deterministically hit 99.9% functional correctness on regulated code. The target is met by the system: a probabilistic generator wrapped in deterministic verifiers and human checkpoints.

Figure F — How Released Correctness Is Earned · open SVG

The model's job is to propose; the harness's job is to dispose. Correctness is earned at the gate, not assumed at generation. See 05-evaluation-and-validation for the full assurance argument.

P2 — Determinism wraps probabilism#

Wherever a check can be deterministic (type systems, unit/property/mutation tests, static analysis, schema validation, policy-as-code, formal/specification checks), it must be, and it sits on the critical path to merge. Probabilistic judgment (LLM-as-judge) is permitted only as a secondary, escalating signal, never as a sole gate for risk-classified changes.

P3 — Risk-proportional autonomy (IEC 62304 safety class drives the leash length)#

Agent autonomy is a function of the safety class of the artifact being modified:

Class A (no injury possible): high autonomy permitted at higher maturity.
Class B (non-serious injury): bounded autonomy, mandatory human review.
Class C (death / serious injury): agent may propose and evidence, but a qualified human always authors the merge decision; dual control required.

P4 — Everything an agent does is evidence#

Every prompt, context bundle, model+adapter version, tool call, verifier result, and human decision is captured as immutable, attributable, replayable record (21 CFR Part 11-grade). If it isn't logged, it didn't happen — and it can't ship in a regulated product.

P5 — The harness is the product (Agent = Model + Harness)#

~90% of behavior and ~100% of assurance comes from the harness (instructions, tools, sandboxes, policies, evals, observability), not the raw model. Investment, validation, and change control concentrate on the harness. Models are hot-swappable inputs; the harness is the controlled system.

P6 — Cost is measured per resolved, verified task, not per token#

Self-hosting converts OpEx (API calls) into CapEx+OpEx (GPU fleet). The governing metric is cost-per-green-PR (a change that passes all deterministic gates and human review), not cost-per-token. Routing, caching, and tiering exist to minimize that. See 08-token-and-gpu-economics.

P7 — Self-hosted, sovereign, reproducible#

All inference and training run inside the organization's trust boundary (on-prem or sovereign VPC). Models, datasets, and training runs are versioned, signed, and reproducible so that any artifact an agent produced can be regenerated and defended in an audit or a recall investigation.

3. The six maturity levels#

Level	Name	One-line essence	Dominant operating mode	Autonomy ceiling
L0	Ad-hoc Assistance	Ungoverned, shadow AI	Individual, in-editor	Suggestions only
L1	Governed Assistance	Sanctioned self-hosted assist	Conductor (human types)	Inline completion
L2	Spec-Driven Bounded Automation	Specs + single agents on reviewable tasks	Conductor + bounded tasks	Single-step, full review
L3	Orchestrated Agentic Workflows	Sandboxed multi-step agents open PRs	Orchestrator	Multi-step, HITL gates
L4	Validated Autonomous Agents	Agents validated as CSA software tools	Orchestrator + validated autonomy	Autonomous within validated bounds
L5	Self-Optimizing Agentic Enterprise	Closed-loop, eval-driven, cost-optimal	Fleet governance	Self-improving under PCCP-style control

L0 — Ad-hoc Assistance (the starting reality, not a goal)#

Developers use whatever in-IDE autocomplete they can reach. No central policy, no logging, no model governance, and likely shadow use of public SaaS endpoints — an IP-leakage and compliance incident waiting to happen. There is no audit trail tying generated code to a model version. This level is non-compliant by default and the transformation's first job is to extinguish it.

L1 — Governed Assistance#

A sanctioned, self-hosted code-assistance model (Tier-S/Tier-M, see 04) is offered through the IDE behind SSO, with prompt/response logging, an Acceptable-Use Policy, and DLP/PII guards. Humans still author 100% of production code; the AI accelerates typing and lookup. The win is eliminating shadow AI and establishing the logging substrate that all later assurance depends on.

L2 — Spec-Driven Bounded Automation#

The organization adopts Spec-Driven Development: specs/, AGENTS.md/rule files, and BDD/Gherkin acceptance criteria live in the repo as the source of truth. Single-step agents perform bounded, individually reviewable tasks — test generation, documentation, mechanical refactors, boilerplate — and a deterministic evaluation harness runs in CI. Every agent output is a normal PR under full human review. This is the first level where agents write code that can ship, and the first level where the eval harness exists.

L3 — Orchestrated Agentic Workflows#

Agents run multi-step, in sandboxes (ephemeral, egress-restricted), call tools through an MCP tool plane, and are routed across a fleet of fine-tuned models by task complexity. A policy server gates every tool call (structural + semantic). Agents open PRs autonomously but human-in-the-loop checkpoints are mandatory at risk-defined boundaries. Trajectory + output evaluations join the deterministic gates. The org now ships agent-produced changes at scale with managed risk.

L4 — Validated Autonomous Agents (the regulated inflection point)#

Each production agent is treated as a software tool used in production of a medical device and is validated under FDA Computer Software Assurance (CSA): documented intended use, risk-based test evidence, and recorded validation. Agents operate autonomously within validated bounds for their safety-class envelope, coordinate via A2A multi-agent patterns, and are continuously gated by ≥99.9% eval thresholds with full IEC 62304 traceability (requirement → spec → code → test → eval → release). Class C work still requires dual human control. This is where "agentic" becomes "regulated-grade."

L5 — Self-Optimizing Agentic Enterprise#

A closed loop connects production telemetry → eval failures → curated fine-tuning data → candidate adapters → automated eval-driven promotion — all under a Predetermined Change Control Plan (PCCP)-style governance so model/harness evolution is pre-authorized and auditable. The fleet self-tunes for quality and cost-per-green-PR, observability is enterprise-wide, and agents participate in their own improvement under human governance. Maturity here is operational discipline at scale, not raw autonomy.

4. The eight capability dimensions#

Maturity is assessed independently along eight axes. An organization is rarely uniform; the floor across safety-relevant dimensions (D1, D4, D6) governs what autonomy is actually permitted, regardless of how advanced D2/D5 are.

#	Dimension	What it measures
D1	Governance, Quality & Regulatory Compliance	QMS integration, IEC 62304/ISO 13485/QMSR alignment, CSA validation of tools, traceability, change control
D2	Model Infrastructure & MLOps	Self-hosted serving, fine-tuning pipeline, model registry, reproducibility, multi-LoRA, GPU platform
D3	Context & Knowledge Engineering	Specs, rule files, RAG over code/docs/regulatory corpus, memory, context hygiene
D4	Evaluation, Validation & Assurance	Deterministic verifiers, eval suites, trajectory eval, acceptance criteria, the 99.9% gate, abstention
D5	Agentic Orchestration & Tooling	Single→multi-agent, MCP/A2A, sandboxing, HITL design, workflow engine
D6	Security & Zero-Trust	Identity, egress control, prompt-injection defense, supply chain, secrets, IEC 62443, audit immutability
D7	Observability & FinOps	Tracing, eval dashboards, token/GPU metering, routing economics, budget guardrails
D8	People, Skills & Operating Model	Roles (conductor/orchestrator), review culture, training, approval-fatigue controls

5. The maturity matrix (core artifact)#

Each cell is the exit criterion for that dimension at that level (you reach the level only when every dimension meets at least that level's descriptor; see §6).

D1 — Governance, Quality & Regulatory Compliance#

L	Descriptor
L0	No policy; shadow AI; no link between generated code and a model version. Non-compliant.
L1	AUP published; AI-assist logged; QMS acknowledges AI tooling; data-handling/IP policy enforced (no external endpoints).
L2	AI-tool use captured in the DHF/quality records; SDD specs are controlled documents; SOP for "AI-assisted change" exists; risk assessment (ISO 14971) covers AI tooling.
L3	Risk-based tool classification per GAMP 5; agent actions mapped to IEC 62304 activities; change-control board reviews harness changes; ISO/IEC 42001 AI-management controls adopted.
L4	Each agent validated under FDA CSA with documented intended use + evidence; full requirement-to-release traceability; agents recognized in ISO 13485/QMSR as validated production tooling; periodic revalidation triggers defined.
L5	PCCP-style predetermined change control governs model/harness evolution; continuous compliance monitoring; automated audit-evidence generation; regulatory-grade reproducibility of any historical agent action.

D2 — Model Infrastructure & MLOps#

L	Descriptor
L0	None / external SaaS.
L1	Single self-hosted model served on K8s GPU (vLLM/Triton) behind an internal gateway; basic autoscaling.
L2	Model registry (MLflow) with versioned, signed models; first fine-tunes (LoRA/SFT) on internal corpus; reproducible serving images.
L3	Tiered model fleet (S/M/L/V/E) with multi-LoRA hot-swap; routing gateway; quantization (FP8/INT8/AWQ); distributed training (Ray/Kueue); model cards mandatory.
L4	Reproducible, validated training pipelines (locked datasets/seeds, signed lineage) suitable as validation evidence; canary + shadow deployment; rollback guarantees; per-domain adapters.
L5	Continuous fine-tuning from production signal; automated candidate training; eval-gated promotion; fleet-level capacity/cost optimization; data flywheel under governance.

D3 — Context & Knowledge Engineering#

L	Descriptor
L0	Whatever is in the editor buffer.
L1	Curated system prompts / org coding standards injected; no retrieval.
L2	`specs/` + `AGENTS.md` + BDD criteria in-repo; static rule files; basic code RAG.
L3	Governed RAG over code, design docs, SOPs, and the regulatory corpus; static-vs-dynamic context split; Agent Skills with progressive disclosure; PII/PHI context hygiene middleware.
L4	Validated knowledge sources (controlled, versioned, access-scoped); retrieval provenance recorded as evidence; graph-native code understanding for large legacy estates.
L5	Self-curating knowledge base; context quality measured and optimized; memory governed and access-audited fleet-wide.

D4 — Evaluation, Validation & Assurance#

L	Descriptor
L0	"Looks right." None.
L1	Lint + existing CI on human-authored code; no AI-specific eval.
L2	Deterministic eval harness in CI (build, unit/property tests, type-check, SAST); per-task acceptance criteria defined; AI changes can't merge red.
L3	Output + trajectory evals; golden datasets with rubrics; mutation testing; LLM-as-judge as secondary signal; abstention/escalation wired in; escape-rate tracked.
L4	≥99.9% release-gate enforced via generate→verify→repair + N-sample self-consistency + verifier voting; eval suites are controlled validation artifacts; statistical acceptance (CI bounds) per safety class; HITL mandatory for Class B/C.
L5	Continuous eval against live failure modes; auto-regression capture; eval-driven model promotion; drift detection closes the loop.

D5 — Agentic Orchestration & Tooling#

L	Descriptor
L0	None.
L1	Inline completion only; no tool use.
L2	Single-step agents, bounded tasks, no autonomous multi-file changes; tools read-only or PR-only.
L3	Multi-step agents in sandboxes (gVisor/Kata, ephemeral, egress-deny); MCP tool plane; policy server gates every call; HITL checkpoints designed in.
L4	Multi-agent (A2A) with role decomposition (planner/coder/test/review); validated tool catalog; deterministic hooks at lifecycle points; durable sessions/memory.
L5	Self-orchestrating workflows with governed dynamic planning; sub-agent fleets; automated decomposition; coordination cost-optimized.

D6 — Security & Zero-Trust#

L	Descriptor
L0	Uncontrolled; IP egress risk.
L1	SSO, RBAC, TLS, prompt/response logging, DLP/secret-scanning on inputs.
L2	Per-repo scoping; signed images; secret management (Vault); no agent write access to prod.
L3	Zero-trust mesh (mTLS, SPIFFE/SPIRE); OPA/Gatekeeper; sandboxed exec with egress allow-lists; prompt-injection & context-poisoning defenses; PII/PHI masking; SBOM.
L4	Supply-chain assurance (SLSA, Sigstore/cosign signed models+datasets+artifacts); semantic policy gating; IEC 62443 + FDA premarket cybersecurity (§524B) alignment; WORM immutable audit; dual-control for high-risk tool calls.
L5	Continuous adversarial testing (red-team agents); automated anomaly/drift response; tamper-evident, fleet-wide, self-defending posture.

D7 — Observability & FinOps#

L	Descriptor
L0	None.
L1	Basic request logging + GPU utilization metrics.
L2	Per-team usage dashboards; cost attribution; eval pass/fail visibility.
L3	End-to-end tracing (OpenTelemetry) of agent trajectories; token/GPU cost per task; routing telemetry; budget alerts.
L4	Cost-per-green-PR as a board metric; per-model/per-tier economics; SLOs on latency/quality/cost; capacity forecasting; budget guardrails enforced in-loop (hard stops).
L5	Closed-loop cost optimization; automated routing/quantization decisions; ROI attribution; predictive scaling.

D8 — People, Skills & Operating Model#

L	Descriptor
L0	Individual experimentation.
L1	Training on sanctioned tool + AUP; champions identified.
L2	Spec-writing & review skills built; "conductor" mode normalized; review checklists for AI output.
L3	Orchestrator role emerges; ownership split (e.g., API vs UX) to reduce merge conflict; approval-fatigue controls (digital quiet hours, batched review).
L4	Formal roles: Agent Steward, Eval Owner, Harness Engineer; no-blame integration culture; reviewers trained on AI failure modes (hallucinated deps, plausible-wrong logic).
L5	Hiring/skills reframed around judgment over implementation; org-wide agentic literacy; continuous capability development; humans focus on architecture, risk, and verification.

6. Level-gate rules (how you actually "are" at a level)#

Floor, not average. Your level on any dimension is its lowest satisfied descriptor. Your overall operating level is the minimum across D1, D4, and D6 (the safety/assurance/security triad). You may invest ahead in D2/D5, but you may not grant autonomy beyond what D1/D4/D6 support. (This is the single most important rule — it prevents capability from outrunning assurance.)
Evidence-based promotion. Advancing a level requires a documented assessment (§7) signed by Engineering + QA/RA + Security. No self-attestation.
Safety-class gating. Even at L4/L5, autonomy is capped per IEC 62304 class (P3). L5 does not mean "agents merge Class C code unattended" — it never does.
Reversibility. Every level must support rollback to the prior level's controls if eval escape-rate or incident metrics breach threshold.

7. Assessment & scoring method#

Cadence: quarterly self-assessment; annual independent (internal audit) assessment; event-driven re-assessment after any Sev-1 AI-attributable defect or recall-relevant escape.
Instrument: 8 dimensions × 6 levels rubric (this document), scored 0–5 each, with required evidence artifacts per cell (logs, validation records, eval reports, signed model lineage, policy configs).
Scoring outputs:
- Dimension scores (radar chart) → reveals imbalance.
- Governing level = min(D1, D4, D6).
- Capability level = mean(all) (informational only).
- Autonomy authorization = derived table mapping (governing level × safety class) → permitted agent actions (lives in 07-security-and-compliance).
Gate review: QA/RA holds veto. Security holds veto. Promotion requires both plus Engineering sign-off.

Illustrative target trajectory (informational)#

Quarter (from program start)	Target governing level
Q1–Q2	L1 (kill shadow AI, stand up serving + logging)
Q3–Q4	L2 (SDD + deterministic eval harness)
Q5–Q7	L3 (sandboxed orchestration + fleet + policy server)
Q8–Q11	L4 (CSA validation of agents, 99.9% gates)
Q12+	L5 (closed-loop, cost-optimized)

See 09-adoption-roadmap for the detailed phased plan, owners, and exit criteria.

8. Per-level KPIs#

Level	Capability KPIs	Assurance / cost KPIs
L1	% devs on sanctioned tool; shadow-AI incidents → 0	100% prompts logged; IP-egress events = 0
L2	% repos with `specs/`+`AGENTS.md`; agent-PR acceptance rate	CI deterministic-gate coverage ≥ X%; zero red merges
L3	% tasks via orchestrated agents; HITL checkpoint adherence	trajectory-eval pass rate; escape-rate; cost-per-task
L4	autonomous-task throughput within validated bounds	release-gate ≥99.9%; validation-evidence completeness; revalidation on time
L5	fine-tune cycle time; auto-promotion rate	cost-per-green-PR trend ↓; drift MTTR; audit-evidence automation %

9. Anti-patterns (explicitly disallowed)#

Autonomy ahead of assurance — granting L3+ autonomy while D4/D6 sit at L1/L2. Forbidden by §6.1.
LLM-as-sole-gate — using a model to approve a model's output on Class B/C code. Violates P2.
Unversioned models — serving a model you cannot reproduce or sign. Violates P7 and CSA.
Eval theater — a passing demo presented as a passing eval. Evals need rubrics + statistical acceptance (D4-L4).
Context dumping — stuffing 100k-token repos into prompts; burns GPU and degrades accuracy. See 08.
Approval-fatigue reflex-clicking — unbatched micro-approvals leading reviewers to rubber-stamp. Controlled at D8-L3.
Shadow SaaS — any call to external LLM endpoints. Hard-blocked at network layer (D6).

10. Regulatory mapping (orientation)#

Standard / regulation	How this model engages it
IEC 62304 (medical device software lifecycle)	Agent actions mapped to lifecycle activities; safety class drives autonomy (P3); traceability at D1-L4
ISO 13485 / FDA QMSR (21 CFR 820, effective Feb 2026)	AI agents recognized as validated production/quality tooling; SOPs + DHF integration
ISO 14971 (risk management)	AI-tooling failure modes in the risk file; mitigations = deterministic gates + HITL
FDA Computer Software Assurance (CSA)	Risk-based validation of each agent as production/QS software (D1-L4)
GAMP 5 (2nd ed.)	Risk-based, critical-thinking validation approach for the tool category
21 CFR Part 11	Immutable, attributable, signed e-records of all agent actions (P4, D6-L4)
ISO/IEC 42001 (AI management system)	Org-level AI governance controls (D1-L3+)
IEC 62443 / FDA premarket cybersecurity (§524B)	Zero-trust + supply-chain assurance for the agent platform (D6-L4)
FDA AI-enabled device guidance + PCCP	Pattern reused at D1-L5 for governed model evolution of the dev toolchain

Critical distinction: this model governs AI that builds the device (production/quality-system software). AI shipped inside the device (SaMD/AI-enabled function) is a separate submission-bearing track — but the same assurance muscles (eval rigor, reproducibility, PCCP) directly enable it. See 07-security-and-compliance §"Two regulated tracks."

11. How to read the rest of the set#

What we must satisfy → 01-requirements
What we build it on → 03-reference-architecture
What models, and how we make them ours → 04-model-strategy-and-finetuning
How we earn 99.9% → 05-evaluation-and-validation
How agents actually work day-to-day → 06-agentic-workflows
How we keep it safe & compliant → 07-security-and-compliance
How we afford it → 08-token-and-gpu-economics
How we get there → 09-adoption-roadmap