ASMM-Med — Agentic SDLC Maturity Model for Regulated Medical Device Engineering#
Audience: Engineering leadership, Quality/Regulatory (QA/RA), MLOps/Platform, and Security at a 1000+ developer medical-device organization (GE HealthCare / Siemens Healthineers class). Scope: AI agents used to build, test, document, and maintain regulated software (the tooling side of the SDLC), running on a Kubernetes cloud-native platform with self-hosted, fine-tuned open-weight models only (no Claude/OpenAI/Gemini SaaS APIs). Companion docs: 01-requirements · 03-reference-architecture · 04-model-strategy-and-finetuning · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 08-token-and-gpu-economics · 09-adoption-roadmap
1. Purpose#
This maturity model gives a regulated medical-device engineering organization a defensible, auditable path from ad-hoc AI assistance to a validated, agent-native software development lifecycle. It is explicitly not a vibe-coding model. Every level raises both capability and assurance, because in this domain unverified velocity is a liability, not an asset.
The model answers four questions leadership repeatedly asks:
- How do we get to ≥99.9% release-gate correctness when the underlying models are probabilistic?
- How do we stay inside FDA / IEC 62304 / ISO 13485 obligations while letting agents touch regulated code?
- How do we control GPU/token cost when agentic loops are expensive and we self-host?
- What does "good" look like at each step, so we can fund, audit, and de-risk the transition?
2. Foundational design principles#
These principles are invariant across all levels and bind the rest of the documentation set.
P1 — 99.9% is a system property, not a model property#
No single open-weight model will deterministically hit 99.9% functional correctness on regulated code. The target is met by the system: a probabilistic generator wrapped in deterministic verifiers and human checkpoints.
The model's job is to propose; the harness's job is to dispose. Correctness is earned at the gate, not assumed at generation. See 05-evaluation-and-validation for the full assurance argument.
P2 — Determinism wraps probabilism#
Wherever a check can be deterministic (type systems, unit/property/mutation tests, static analysis, schema validation, policy-as-code, formal/specification checks), it must be, and it sits on the critical path to merge. Probabilistic judgment (LLM-as-judge) is permitted only as a secondary, escalating signal, never as a sole gate for risk-classified changes.
P3 — Risk-proportional autonomy (IEC 62304 safety class drives the leash length)#
Agent autonomy is a function of the safety class of the artifact being modified:
- Class A (no injury possible): high autonomy permitted at higher maturity.
- Class B (non-serious injury): bounded autonomy, mandatory human review.
- Class C (death / serious injury): agent may propose and evidence, but a qualified human always authors the merge decision; dual control required.
P4 — Everything an agent does is evidence#
Every prompt, context bundle, model+adapter version, tool call, verifier result, and human decision is captured as immutable, attributable, replayable record (21 CFR Part 11-grade). If it isn't logged, it didn't happen — and it can't ship in a regulated product.
P5 — The harness is the product (Agent = Model + Harness)#
~90% of behavior and ~100% of assurance comes from the harness (instructions, tools, sandboxes, policies, evals, observability), not the raw model. Investment, validation, and change control concentrate on the harness. Models are hot-swappable inputs; the harness is the controlled system.
P6 — Cost is measured per resolved, verified task, not per token#
Self-hosting converts OpEx (API calls) into CapEx+OpEx (GPU fleet). The governing metric is cost-per-green-PR (a change that passes all deterministic gates and human review), not cost-per-token. Routing, caching, and tiering exist to minimize that. See 08-token-and-gpu-economics.
P7 — Self-hosted, sovereign, reproducible#
All inference and training run inside the organization's trust boundary (on-prem or sovereign VPC). Models, datasets, and training runs are versioned, signed, and reproducible so that any artifact an agent produced can be regenerated and defended in an audit or a recall investigation.
3. The six maturity levels#
| Level | Name | One-line essence | Dominant operating mode | Autonomy ceiling |
|---|---|---|---|---|
| L0 | Ad-hoc Assistance | Ungoverned, shadow AI | Individual, in-editor | Suggestions only |
| L1 | Governed Assistance | Sanctioned self-hosted assist | Conductor (human types) | Inline completion |
| L2 | Spec-Driven Bounded Automation | Specs + single agents on reviewable tasks | Conductor + bounded tasks | Single-step, full review |
| L3 | Orchestrated Agentic Workflows | Sandboxed multi-step agents open PRs | Orchestrator | Multi-step, HITL gates |
| L4 | Validated Autonomous Agents | Agents validated as CSA software tools | Orchestrator + validated autonomy | Autonomous within validated bounds |
| L5 | Self-Optimizing Agentic Enterprise | Closed-loop, eval-driven, cost-optimal | Fleet governance | Self-improving under PCCP-style control |
L0 — Ad-hoc Assistance (the starting reality, not a goal)#
Developers use whatever in-IDE autocomplete they can reach. No central policy, no logging, no model governance, and likely shadow use of public SaaS endpoints — an IP-leakage and compliance incident waiting to happen. There is no audit trail tying generated code to a model version. This level is non-compliant by default and the transformation's first job is to extinguish it.
L1 — Governed Assistance#
A sanctioned, self-hosted code-assistance model (Tier-S/Tier-M, see 04) is offered through the IDE behind SSO, with prompt/response logging, an Acceptable-Use Policy, and DLP/PII guards. Humans still author 100% of production code; the AI accelerates typing and lookup. The win is eliminating shadow AI and establishing the logging substrate that all later assurance depends on.
L2 — Spec-Driven Bounded Automation#
The organization adopts Spec-Driven Development: specs/, AGENTS.md/rule files, and BDD/Gherkin acceptance criteria live in the repo as the source of truth. Single-step agents perform bounded, individually reviewable tasks — test generation, documentation, mechanical refactors, boilerplate — and a deterministic evaluation harness runs in CI. Every agent output is a normal PR under full human review. This is the first level where agents write code that can ship, and the first level where the eval harness exists.
L3 — Orchestrated Agentic Workflows#
Agents run multi-step, in sandboxes (ephemeral, egress-restricted), call tools through an MCP tool plane, and are routed across a fleet of fine-tuned models by task complexity. A policy server gates every tool call (structural + semantic). Agents open PRs autonomously but human-in-the-loop checkpoints are mandatory at risk-defined boundaries. Trajectory + output evaluations join the deterministic gates. The org now ships agent-produced changes at scale with managed risk.
L4 — Validated Autonomous Agents (the regulated inflection point)#
Each production agent is treated as a software tool used in production of a medical device and is validated under FDA Computer Software Assurance (CSA): documented intended use, risk-based test evidence, and recorded validation. Agents operate autonomously within validated bounds for their safety-class envelope, coordinate via A2A multi-agent patterns, and are continuously gated by ≥99.9% eval thresholds with full IEC 62304 traceability (requirement → spec → code → test → eval → release). Class C work still requires dual human control. This is where "agentic" becomes "regulated-grade."
L5 — Self-Optimizing Agentic Enterprise#
A closed loop connects production telemetry → eval failures → curated fine-tuning data → candidate adapters → automated eval-driven promotion — all under a Predetermined Change Control Plan (PCCP)-style governance so model/harness evolution is pre-authorized and auditable. The fleet self-tunes for quality and cost-per-green-PR, observability is enterprise-wide, and agents participate in their own improvement under human governance. Maturity here is operational discipline at scale, not raw autonomy.
4. The eight capability dimensions#
Maturity is assessed independently along eight axes. An organization is rarely uniform; the floor across safety-relevant dimensions (D1, D4, D6) governs what autonomy is actually permitted, regardless of how advanced D2/D5 are.
| # | Dimension | What it measures |
|---|---|---|
| D1 | Governance, Quality & Regulatory Compliance | QMS integration, IEC 62304/ISO 13485/QMSR alignment, CSA validation of tools, traceability, change control |
| D2 | Model Infrastructure & MLOps | Self-hosted serving, fine-tuning pipeline, model registry, reproducibility, multi-LoRA, GPU platform |
| D3 | Context & Knowledge Engineering | Specs, rule files, RAG over code/docs/regulatory corpus, memory, context hygiene |
| D4 | Evaluation, Validation & Assurance | Deterministic verifiers, eval suites, trajectory eval, acceptance criteria, the 99.9% gate, abstention |
| D5 | Agentic Orchestration & Tooling | Single→multi-agent, MCP/A2A, sandboxing, HITL design, workflow engine |
| D6 | Security & Zero-Trust | Identity, egress control, prompt-injection defense, supply chain, secrets, IEC 62443, audit immutability |
| D7 | Observability & FinOps | Tracing, eval dashboards, token/GPU metering, routing economics, budget guardrails |
| D8 | People, Skills & Operating Model | Roles (conductor/orchestrator), review culture, training, approval-fatigue controls |
5. The maturity matrix (core artifact)#
Each cell is the exit criterion for that dimension at that level (you reach the level only when every dimension meets at least that level's descriptor; see §6).
D1 — Governance, Quality & Regulatory Compliance#
| L | Descriptor |
|---|---|
| L0 | No policy; shadow AI; no link between generated code and a model version. Non-compliant. |
| L1 | AUP published; AI-assist logged; QMS acknowledges AI tooling; data-handling/IP policy enforced (no external endpoints). |
| L2 | AI-tool use captured in the DHF/quality records; SDD specs are controlled documents; SOP for "AI-assisted change" exists; risk assessment (ISO 14971) covers AI tooling. |
| L3 | Risk-based tool classification per GAMP 5; agent actions mapped to IEC 62304 activities; change-control board reviews harness changes; ISO/IEC 42001 AI-management controls adopted. |
| L4 | Each agent validated under FDA CSA with documented intended use + evidence; full requirement-to-release traceability; agents recognized in ISO 13485/QMSR as validated production tooling; periodic revalidation triggers defined. |
| L5 | PCCP-style predetermined change control governs model/harness evolution; continuous compliance monitoring; automated audit-evidence generation; regulatory-grade reproducibility of any historical agent action. |
D2 — Model Infrastructure & MLOps#
| L | Descriptor |
|---|---|
| L0 | None / external SaaS. |
| L1 | Single self-hosted model served on K8s GPU (vLLM/Triton) behind an internal gateway; basic autoscaling. |
| L2 | Model registry (MLflow) with versioned, signed models; first fine-tunes (LoRA/SFT) on internal corpus; reproducible serving images. |
| L3 | Tiered model fleet (S/M/L/V/E) with multi-LoRA hot-swap; routing gateway; quantization (FP8/INT8/AWQ); distributed training (Ray/Kueue); model cards mandatory. |
| L4 | Reproducible, validated training pipelines (locked datasets/seeds, signed lineage) suitable as validation evidence; canary + shadow deployment; rollback guarantees; per-domain adapters. |
| L5 | Continuous fine-tuning from production signal; automated candidate training; eval-gated promotion; fleet-level capacity/cost optimization; data flywheel under governance. |
D3 — Context & Knowledge Engineering#
| L | Descriptor |
|---|---|
| L0 | Whatever is in the editor buffer. |
| L1 | Curated system prompts / org coding standards injected; no retrieval. |
| L2 | specs/ + AGENTS.md + BDD criteria in-repo; static rule files; basic code RAG. |
| L3 | Governed RAG over code, design docs, SOPs, and the regulatory corpus; static-vs-dynamic context split; Agent Skills with progressive disclosure; PII/PHI context hygiene middleware. |
| L4 | Validated knowledge sources (controlled, versioned, access-scoped); retrieval provenance recorded as evidence; graph-native code understanding for large legacy estates. |
| L5 | Self-curating knowledge base; context quality measured and optimized; memory governed and access-audited fleet-wide. |
D4 — Evaluation, Validation & Assurance#
| L | Descriptor |
|---|---|
| L0 | "Looks right." None. |
| L1 | Lint + existing CI on human-authored code; no AI-specific eval. |
| L2 | Deterministic eval harness in CI (build, unit/property tests, type-check, SAST); per-task acceptance criteria defined; AI changes can't merge red. |
| L3 | Output + trajectory evals; golden datasets with rubrics; mutation testing; LLM-as-judge as secondary signal; abstention/escalation wired in; escape-rate tracked. |
| L4 | ≥99.9% release-gate enforced via generate→verify→repair + N-sample self-consistency + verifier voting; eval suites are controlled validation artifacts; statistical acceptance (CI bounds) per safety class; HITL mandatory for Class B/C. |
| L5 | Continuous eval against live failure modes; auto-regression capture; eval-driven model promotion; drift detection closes the loop. |
D5 — Agentic Orchestration & Tooling#
| L | Descriptor |
|---|---|
| L0 | None. |
| L1 | Inline completion only; no tool use. |
| L2 | Single-step agents, bounded tasks, no autonomous multi-file changes; tools read-only or PR-only. |
| L3 | Multi-step agents in sandboxes (gVisor/Kata, ephemeral, egress-deny); MCP tool plane; policy server gates every call; HITL checkpoints designed in. |
| L4 | Multi-agent (A2A) with role decomposition (planner/coder/test/review); validated tool catalog; deterministic hooks at lifecycle points; durable sessions/memory. |
| L5 | Self-orchestrating workflows with governed dynamic planning; sub-agent fleets; automated decomposition; coordination cost-optimized. |
D6 — Security & Zero-Trust#
| L | Descriptor |
|---|---|
| L0 | Uncontrolled; IP egress risk. |
| L1 | SSO, RBAC, TLS, prompt/response logging, DLP/secret-scanning on inputs. |
| L2 | Per-repo scoping; signed images; secret management (Vault); no agent write access to prod. |
| L3 | Zero-trust mesh (mTLS, SPIFFE/SPIRE); OPA/Gatekeeper; sandboxed exec with egress allow-lists; prompt-injection & context-poisoning defenses; PII/PHI masking; SBOM. |
| L4 | Supply-chain assurance (SLSA, Sigstore/cosign signed models+datasets+artifacts); semantic policy gating; IEC 62443 + FDA premarket cybersecurity (§524B) alignment; WORM immutable audit; dual-control for high-risk tool calls. |
| L5 | Continuous adversarial testing (red-team agents); automated anomaly/drift response; tamper-evident, fleet-wide, self-defending posture. |
D7 — Observability & FinOps#
| L | Descriptor |
|---|---|
| L0 | None. |
| L1 | Basic request logging + GPU utilization metrics. |
| L2 | Per-team usage dashboards; cost attribution; eval pass/fail visibility. |
| L3 | End-to-end tracing (OpenTelemetry) of agent trajectories; token/GPU cost per task; routing telemetry; budget alerts. |
| L4 | Cost-per-green-PR as a board metric; per-model/per-tier economics; SLOs on latency/quality/cost; capacity forecasting; budget guardrails enforced in-loop (hard stops). |
| L5 | Closed-loop cost optimization; automated routing/quantization decisions; ROI attribution; predictive scaling. |
D8 — People, Skills & Operating Model#
| L | Descriptor |
|---|---|
| L0 | Individual experimentation. |
| L1 | Training on sanctioned tool + AUP; champions identified. |
| L2 | Spec-writing & review skills built; "conductor" mode normalized; review checklists for AI output. |
| L3 | Orchestrator role emerges; ownership split (e.g., API vs UX) to reduce merge conflict; approval-fatigue controls (digital quiet hours, batched review). |
| L4 | Formal roles: Agent Steward, Eval Owner, Harness Engineer; no-blame integration culture; reviewers trained on AI failure modes (hallucinated deps, plausible-wrong logic). |
| L5 | Hiring/skills reframed around judgment over implementation; org-wide agentic literacy; continuous capability development; humans focus on architecture, risk, and verification. |
6. Level-gate rules (how you actually "are" at a level)#
- Floor, not average. Your level on any dimension is its lowest satisfied descriptor. Your overall operating level is the minimum across D1, D4, and D6 (the safety/assurance/security triad). You may invest ahead in D2/D5, but you may not grant autonomy beyond what D1/D4/D6 support. (This is the single most important rule — it prevents capability from outrunning assurance.)
- Evidence-based promotion. Advancing a level requires a documented assessment (§7) signed by Engineering + QA/RA + Security. No self-attestation.
- Safety-class gating. Even at L4/L5, autonomy is capped per IEC 62304 class (P3). L5 does not mean "agents merge Class C code unattended" — it never does.
- Reversibility. Every level must support rollback to the prior level's controls if eval escape-rate or incident metrics breach threshold.
7. Assessment & scoring method#
- Cadence: quarterly self-assessment; annual independent (internal audit) assessment; event-driven re-assessment after any Sev-1 AI-attributable defect or recall-relevant escape.
- Instrument: 8 dimensions × 6 levels rubric (this document), scored 0–5 each, with required evidence artifacts per cell (logs, validation records, eval reports, signed model lineage, policy configs).
- Scoring outputs:
- Dimension scores (radar chart) → reveals imbalance.
- Governing level =
min(D1, D4, D6). - Capability level =
mean(all)(informational only). - Autonomy authorization = derived table mapping (governing level × safety class) → permitted agent actions (lives in 07-security-and-compliance).
- Gate review: QA/RA holds veto. Security holds veto. Promotion requires both plus Engineering sign-off.
Illustrative target trajectory (informational)#
| Quarter (from program start) | Target governing level |
|---|---|
| Q1–Q2 | L1 (kill shadow AI, stand up serving + logging) |
| Q3–Q4 | L2 (SDD + deterministic eval harness) |
| Q5–Q7 | L3 (sandboxed orchestration + fleet + policy server) |
| Q8–Q11 | L4 (CSA validation of agents, 99.9% gates) |
| Q12+ | L5 (closed-loop, cost-optimized) |
See 09-adoption-roadmap for the detailed phased plan, owners, and exit criteria.
8. Per-level KPIs#
| Level | Capability KPIs | Assurance / cost KPIs |
|---|---|---|
| L1 | % devs on sanctioned tool; shadow-AI incidents → 0 | 100% prompts logged; IP-egress events = 0 |
| L2 | % repos with specs/+AGENTS.md; agent-PR acceptance rate | CI deterministic-gate coverage ≥ X%; zero red merges |
| L3 | % tasks via orchestrated agents; HITL checkpoint adherence | trajectory-eval pass rate; escape-rate; cost-per-task |
| L4 | autonomous-task throughput within validated bounds | release-gate ≥99.9%; validation-evidence completeness; revalidation on time |
| L5 | fine-tune cycle time; auto-promotion rate | cost-per-green-PR trend ↓; drift MTTR; audit-evidence automation % |
9. Anti-patterns (explicitly disallowed)#
- Autonomy ahead of assurance — granting L3+ autonomy while D4/D6 sit at L1/L2. Forbidden by §6.1.
- LLM-as-sole-gate — using a model to approve a model's output on Class B/C code. Violates P2.
- Unversioned models — serving a model you cannot reproduce or sign. Violates P7 and CSA.
- Eval theater — a passing demo presented as a passing eval. Evals need rubrics + statistical acceptance (D4-L4).
- Context dumping — stuffing 100k-token repos into prompts; burns GPU and degrades accuracy. See 08.
- Approval-fatigue reflex-clicking — unbatched micro-approvals leading reviewers to rubber-stamp. Controlled at D8-L3.
- Shadow SaaS — any call to external LLM endpoints. Hard-blocked at network layer (D6).
10. Regulatory mapping (orientation)#
| Standard / regulation | How this model engages it |
|---|---|
| IEC 62304 (medical device software lifecycle) | Agent actions mapped to lifecycle activities; safety class drives autonomy (P3); traceability at D1-L4 |
| ISO 13485 / FDA QMSR (21 CFR 820, effective Feb 2026) | AI agents recognized as validated production/quality tooling; SOPs + DHF integration |
| ISO 14971 (risk management) | AI-tooling failure modes in the risk file; mitigations = deterministic gates + HITL |
| FDA Computer Software Assurance (CSA) | Risk-based validation of each agent as production/QS software (D1-L4) |
| GAMP 5 (2nd ed.) | Risk-based, critical-thinking validation approach for the tool category |
| 21 CFR Part 11 | Immutable, attributable, signed e-records of all agent actions (P4, D6-L4) |
| ISO/IEC 42001 (AI management system) | Org-level AI governance controls (D1-L3+) |
| IEC 62443 / FDA premarket cybersecurity (§524B) | Zero-trust + supply-chain assurance for the agent platform (D6-L4) |
| FDA AI-enabled device guidance + PCCP | Pattern reused at D1-L5 for governed model evolution of the dev toolchain |
Critical distinction: this model governs AI that builds the device (production/quality-system software). AI shipped inside the device (SaMD/AI-enabled function) is a separate submission-bearing track — but the same assurance muscles (eval rigor, reproducibility, PCCP) directly enable it. See 07-security-and-compliance §"Two regulated tracks."
11. How to read the rest of the set#
- What we must satisfy → 01-requirements
- What we build it on → 03-reference-architecture
- What models, and how we make them ours → 04-model-strategy-and-finetuning
- How we earn 99.9% → 05-evaluation-and-validation
- How agents actually work day-to-day → 06-agentic-workflows
- How we keep it safe & compliant → 07-security-and-compliance
- How we afford it → 08-token-and-gpu-economics
- How we get there → 09-adoption-roadmap