09 — Adoption Roadmap#
Program: Agentic-Native SDLC for Regulated Medical Device Engineering Audience: Executive sponsors, PMO, Engineering / QA-RA / Security leadership, AI Governance Board Status: Planning baseline (May 2026). All dates expressed as relative quarters from program start (Q1 = first full quarter after charter approval). All thresholds, ranges, and timelines are planning placeholders subject to gate-review revision. Cross-references: 01-requirements.md, 02-maturity-model.md, 03-reference-architecture.md, 04-model-strategy-and-finetuning.md, 05-evaluation-and-validation.md, 06-agentic-workflows.md, 07-security-and-compliance.md, 08-token-and-gpu-economics.md.
1. Roadmap Philosophy#
This roadmap is assurance-gated, not capability-gated. We do not deploy the most capable agent we can build; we deploy the most capable agent we can validate, secure, and govern — and no further. The maturity model (ASMM-Med, see 02) is the spine; this document is the sequencing, resourcing, and decision layer that moves the organization L0 → L5 without ever letting autonomy outrun control.
Five non-negotiable design rules:
| # | Rule | Consequence for sequencing |
|---|---|---|
| R1 | Governing level = min(D1, D4, D6) | Capability dimensions (D2, D3, D5) may sprint ahead in build, but the enabled autonomy level is clamped by the weakest of Governance (D1), Eval/Assurance (D4), Security (D6). |
| R2 | Don't grant autonomy you can't yet validate (P3 + P1) | Each phase ships its eval/validation evidence before the corresponding agent privilege is unlocked in production. The harness leads; the agent follows. |
| R3 | Reversible by construction | Every promotion has a defined rollback (feature-flag, model-version pin, privilege revocation, scope reduction). No one-way doors. |
| R4 | Pilot before scale | Capabilities are proven on low-safety-class, low-blast-radius repos before any Class B/C or fleet-wide exposure. |
| R5 | Value per phase | Every transition must unlock a defensible, measurable engineering or quality value — not just platform plumbing — or the phase is reconsidered. |
These rules operationalize the Seven Principles (02): P1 (99.9% release-gate correctness as a system property), P2 (determinism wraps probabilism), P3 (risk-proportional autonomy), P4 (Part 11 evidence), P5 (the harness is the product), P6 (cost-per-green-PR), P7 (self-hosted reproducibility). Promotion across any level requires tri-signature: Engineering + QA/RA + Security. Safety-class gating per IEC 62304 is absolute — Class C software changes are always under dual human control regardless of maturity level.
2. Phase Plan#
The program is five sequenced transitions. Build work for dimension N+1 may begin while operating at level N, but the level is only declared after the exit gate passes. Illustrative trajectory (from 02): Q1–Q2 → L1; Q3–Q4 → L2; Q5–Q7 → L3; Q8–Q11 → L4; Q12+ → L5.
gantt
title Agentic-Native SDLC — ASMM-Med Adoption Roadmap (relative quarters)
dateFormat YYYY-MM-DD
axisFormat Q%q
section L0→L1 Governed Assistance
Kill shadow AI / policy baseline :a1, 2026-01-01, 60d
Self-hosted serving + central logging :a2, 2026-01-15, 75d
L1 exit gate (Eng+QA/RA+Sec) :milestone, m1, 2026-03-31, 0d
section L1→L2 Spec-Driven Bounded Automation
SDD + AGENTS.md + specs/ adoption :b1, 2026-04-01, 80d
Deterministic eval harness in CI :b2, 2026-04-15, 90d
L2 exit gate :milestone, m2, 2026-09-30, 0d
section L2→L3 Orchestrated Agentic Workflows
Sandboxed multi-step agents + MCP plane :c1, 2026-10-01, 110d
Model fleet + routing + policy server :c2, 2026-10-15, 120d
HITL workflow rollout :c3, 2026-12-01, 90d
L3 exit gate :milestone, m3, 2027-06-30, 0d
section L3→L4 Validated Autonomous Agents
CSA agent validation + 62304 traceability:d1, 2027-07-01, 150d
99.9% gate hardening + A2A :d2, 2027-09-01, 160d
L4 exit gate :milestone, m4, 2028-06-30, 0d
section L4→L5 Self-Optimizing Enterprise
Closed-loop fine-tuning + eval promotion :e1, 2028-07-01, 150d
Cost-optimal routing + PCCP change ctrl :e2, 2028-09-01, 160d
L5 steady state :milestone, m5, 2029-03-31, 0d
2.1 Phase A — L0 → L1: Governed Assistance (≈ Q1–Q2)#
Objective. Eliminate ungoverned ("shadow") AI use; stand up self-hosted serving with complete prompt/response logging so that every AI interaction is observable, attributable, and policy-bound. This phase buys legitimacy, not autonomy.
Workstreams by dimension.
| Dim | Workstream |
|---|---|
| D1 | AI Use Policy ratified; AI Governance Board chartered; ISO/IEC 42001 AIMS scaffolding initiated; shadow-AI amnesty + cutover. |
| D2 | Stand up vLLM/Triton on K8s + GPU Operator; serve baseline open-weight S/M models; KServe endpoints; no fine-tuning yet. |
| D3 | Inventory knowledge sources; begin curated repo/context indexing; no agent retrieval yet. |
| D4 | Define eval taxonomy and baseline manual review checklist; no automated harness yet. |
| D5 | IDE assist only (completion/chat); no tools, no agency. |
| D6 | Egress controls to block external LLM SaaS; Vault for secrets; SPIFFE/SPIRE identity bootstrap; Istio mTLS baseline. |
| D7 | OpenTelemetry tracing of all inference; central prompt/response log (Part 11-style retention); first FinOps GPU dashboard. |
| D8 | Org-wide AI literacy training; name first Model Steward and Eval Owner; communicate the "structure scales, vibes don't" thesis. |
Deliverables / artifacts. AI Use Policy v1; serving platform runbook; inference audit log schema (Part 11-aligned); shadow-AI decommission report; GPU baseline capacity plan; Governance Board charter + RACI.
Exit criteria (who signs off).
- 100% of sanctioned AI traffic routed through self-hosted endpoints; zero known external LLM SaaS egress (Security verified).
- Complete, immutable, attributable logging of all prompts/responses (QA/RA verified for Part 11 retention).
- Governance Board operational with documented decision cadence (Engineering + QA/RA + Security tri-sign).
- Governing level confirmed: min(D1,D4,D6) ≥ L1.
Primary risks + mitigations. Shadow AI persists covertly → egress blocking + amnesty + monitoring + leader modeling. Serving instability erodes trust → conservative SLOs, gradual cutover, fallback. Logging seen as surveillance → no-blame framing, transparency on purpose (quality/regulatory, not individual performance).
Value unlocked. A defensible, auditable AI footprint; the substrate (serving + logging + identity) on which everything else is validated.
2.2 Phase B — L1 → L2: Spec-Driven Bounded Automation (≈ Q3–Q4)#
Objective. Make specification-driven development (SDD) and a deterministic eval harness in CI the default. Move from "AI helps me type" to "AI executes against a spec, and a deterministic gate judges the result." This is where the harness becomes the product (P5).
Workstreams by dimension.
| Dim | Workstream |
|---|---|
| D1 | Map SDD artifacts to design controls (DHF inputs); QMSR/ISO 13485 (Feb 2026) alignment of AI-assisted change records. |
| D2 | MLflow registry; first task-specific LoRA fine-tunes (S/M tier) on curated internal code/spec corpora; reproducible build pipeline. |
| D3 | AGENTS.md repo conventions; specs/ directory standard; curated retrieval over approved knowledge; context provenance. |
| D4 | Deterministic eval harness in CI (Argo / pipeline-triggered); golden datasets; pass/fail gates; seedable, reproducible scoring (P2, P7). |
| D5 | Bounded single-step automation: scaffold, test-gen, doc-gen — no multi-step autonomy, no tool plane yet. |
| D6 | Sigstore/cosign signing of model + harness artifacts; SLSA provenance + SBOM for the AI toolchain. |
| D7 | Cost-per-green-PR (P6) instrumented; eval run cost tracking; token budgets per pipeline. |
| D8 | First Harness Engineers chartered; "tests/evals before code" engineering norm; review-every-line discipline. |
Deliverables / artifacts. AGENTS.md + specs/ standard; deterministic eval harness (versioned, signed); golden eval datasets; LoRA fine-tune cards; SBOM + SLSA attestations; cost-per-green-PR baseline.
Exit criteria (who signs off).
- Deterministic eval harness gates CI on pilot repos with reproducible (seed-stable) results across runs (Engineering + Eval Owner).
- SDD artifacts traceable into design-control records (QA/RA).
- All models/harness artifacts signed with provenance (Security).
- Governing level: min(D1,D4,D6) ≥ L2.
Primary risks + mitigations. Eval flakiness/non-determinism → strict seeding, hermetic environments, quarantine of flaky cases. Spec quality varies → spec templates, peer review, "spec-as-design-input" training. Fine-tune overfit → held-out eval sets, eval-driven acceptance only.
Value unlocked. Trustworthy automated quality gates; measurable cost-per-green-PR; the first hard evidence that AI output can be validated before merge.
2.3 Phase C — L2 → L3: Orchestrated Agentic Workflows (≈ Q5–Q7)#
Objective. Introduce sandboxed multi-step agents with a governed MCP tool plane, a model fleet with routing, a policy server, and human-in-the-loop (HITL) controls. Agents now plan and act across steps — inside sandboxes, under policy, with a human approving consequential actions.
Workstreams by dimension.
| Dim | Workstream |
|---|---|
| D1 | Policy-as-code (OPA/Gatekeeper + policy server) encodes who/what/where agents may act; risk-proportional autonomy matrix by safety class. |
| D2 | Model fleet tiers S/M/L/V/E operational; Ray/Kueue scheduling; multi-LoRA serving; routing by task/cost/quality. |
| D3 | Agent-grade retrieval + MCP-exposed knowledge resources; context windows scoped per task and per safety class. |
| D4 | Eval extended to trajectory and tool-use evaluation; HITL decision logging feeds eval; assurance cases per workflow. |
| D5 | Argo Workflows orchestration; MCP tool plane; gVisor/Kata sandboxing; Agent Stewards own each workflow; HITL checkpoints. |
| D6 | Zero-trust per agent identity (SPIFFE/SPIRE); least-privilege tool scopes; IEC 62443 alignment; egress-controlled sandboxes. |
| D7 | KEDA autoscaling on agent load; per-workflow FinOps; trajectory observability; token/GPU attribution per agent run. |
| D8 | Agent Steward + Harness Engineer roles scaled; HITL reviewer training; approval-fatigue controls designed (see §4). |
Deliverables / artifacts. MCP tool registry + scopes; policy server rulesets; agent sandbox runbook; per-workflow assurance case; routing policy; HITL approval logs; agent identity inventory.
Exit criteria (who signs off).
- Multi-step agents run only in sandboxes with enforced least-privilege tool scopes (Security).
- Policy server denies out-of-scope actions by default; all consequential actions have HITL approval with audit trail (QA/RA + Security).
- Trajectory-level eval coverage meets threshold on pilot workflows (Eval Owner).
- Governing level: min(D1,D4,D6) ≥ L3. Class C remains dual-human.
Primary risks + mitigations. Agent escapes sandbox / scope creep → default-deny policy, runtime sandbox, continuous policy tests. Tool-plane supply-chain risk → signed MCP servers, scoped credentials, Vault brokering. HITL becomes rubber-stamp → batched-but-meaningful approvals, sampling audits, no-blame escalation.
Value unlocked. Real end-to-end task automation (multi-file changes, investigation, refactors) with human consequence-gating — the first order-of-magnitude productivity step, safely bounded.
2.4 Phase D — L3 → L4: Validated Autonomous Agents (≈ Q8–Q11)#
Objective. CSA-validate agents as part of the QMS so that defined agent workflows can act autonomously (within safety class) at ≥99.9% release-gate correctness, with full IEC 62304 traceability and A2A (agent-to-agent) coordination. This is the regulated leap: agents become validated tools.
Workstreams by dimension.
| Dim | Workstream |
|---|---|
| D1 | CSA validation packages per agent; ISO 14971 risk analysis for agent failure modes; QMSR/13485 integration; §524B + cybersecurity documentation. |
| D2 | Locked, signed model+LoRA versions per validated workflow; reproducible serving; change control on model versions. |
| D3 | Validated knowledge sources; controlled context; provenance required for any retrieval feeding a Class B/C change. |
| D4 | ≥99.9% system-property gate demonstrated and continuously monitored; deterministic eval as validation evidence; assurance cases signed. |
| D5 | A2A coordination among validated agents; autonomy scoped strictly by safety class; Class C always dual human control. |
| D6 | Full IEC 62443 posture; cryptographic attestation of every agent action; tamper-evident audit. |
| D7 | Continuous gate-correctness monitoring; cost-per-green-PR optimized; drift + regression alarms. |
| D8 | QA/RA + Security embedded in agent lifecycle; Eval Owner owns validation evidence; operating model matured (§4). |
Deliverables / artifacts. Per-agent CSA validation report; IEC 62304 traceability matrix (requirement → design → agent action → test/eval → evidence); ISO 14971 agent FMEA; 99.9% gate-correctness monitoring dashboard; A2A protocol spec; signed assurance cases.
Exit criteria (who signs off).
- Validated agents demonstrate ≥99.9% release-gate correctness as a sustained system property (Eval Owner + QA/RA).
- End-to-end IEC 62304 traceability for every autonomous action (QA/RA).
- CSA validation accepted into the QMS; reversibility + version pinning enforced (Engineering + QA/RA + Security).
- Class C dual-human control verified intact (Security + QA/RA).
- Governing level: min(D1,D4,D6) ≥ L4.
Primary risks + mitigations. Regulator non-acceptance of agent validation approach → early FDA/CSA engagement, conservative assurance cases, pilot scope. 99.9% not met → no promotion; remain L3; harden harness. Drift erodes validated state → continuous monitoring + automatic rollback to pinned version.
Value unlocked. Bounded autonomous engineering for lower-risk classes with regulatory-grade evidence — sustained throughput gains without sacrificing the audit trail.
2.5 Phase E — L4 → L5: Self-Optimizing Agentic Enterprise (≈ Q12+)#
Objective. Close the loop: eval-driven, cost-optimal fine-tuning and promotion under PCCP-style change control. The system improves itself within pre-authorized bounds, with every change gated by the deterministic harness and governed change control.
Workstreams by dimension.
| Dim | Workstream |
|---|---|
| D1 | FDA AI/PCCP-style predetermined change-control protocol authored and approved; ISO/IEC 42001 AIMS at full maturity. |
| D2 | Closed-loop fine-tuning pipeline; candidate models auto-trained from production signal; promotion only via eval gate. |
| D3 | Self-curating knowledge with provenance + freshness controls; feedback-curated eval datasets. |
| D4 | Eval-driven promotion: a model/agent is promoted only if it beats incumbent on the deterministic harness at ≥99.9% (P1, P5). |
| D5 | Autonomous fleet self-optimization (routing, LoRA selection) within PCCP envelope. |
| D6 | Continuous attestation of self-modifying components; change provenance; rollback always available. |
| D7 | Cost-optimal routing (P6) closed-loop with FinOps; auto-rightsizing GPU; token economics steered to target. |
| D8 | Operating model steady-state; CoE → embedded; continuous enablement; no-blame, evidence-first culture institutionalized. |
Deliverables / artifacts. PCCP change-control protocol; closed-loop fine-tune pipeline; eval-driven promotion policy; cost-optimization control loop; AIMS conformance evidence.
Exit criteria (steady state, who signs off).
- Every self-initiated model/agent change passes deterministic eval gate ≥99.9% before promotion, within PCCP envelope (Eval Owner + QA/RA).
- Cost-per-green-PR trending to target under FinOps governance (Engineering + PMO).
- All changes attested, reversible, and within pre-authorized change-control bounds (Security + QA/RA).
- Governing level: min(D1,D4,D6) ≥ L5.
Primary risks + mitigations. Self-optimization drifts outside intended behavior → PCCP envelope as hard boundary; eval-gated promotion; rollback. Cost optimization degrades quality → quality is the gate, cost is the objective subject to the gate. Change control too slow → predetermined protocol pre-authorizes the space of changes.
Value unlocked. A continuously improving, cost-optimal, self-hosted agentic SDLC where quality is provably non-decreasing and change is governed — the north star.
3. Pilot Strategy#
Principle: prove on the safe edge, then graduate inward.
Team / repo selection (in priority order):
- IEC 62304 Class A software first — non-safety internal tools, build tooling, test utilities, internal web apps. No patient-impact path.
- High test coverage + mature CI (the harness needs something to gate against).
- Volunteer teams with engaged tech leads (cultural readiness over raw size).
- Repos with clean, current specifications or willingness to write them.
- Explicitly excluded from early pilots: any Class B/C, regulated firmware, anything in a device's safety path.
Success criteria for a pilot.
| Metric | Target (placeholder) |
|---|---|
| Deterministic eval gate reproducibility | 100% seed-stable across reruns |
| Defect-escape rate vs. baseline | ≤ baseline (no regression) |
| Cost-per-green-PR | Measured + trending down |
| Reviewer trust (survey) | ≥ 70% "would expand scope" |
| Rollback events causing incident | 0 |
Blast-radius containment. Sandboxed execution (gVisor/Kata); least-privilege tool scopes; feature-flagged rollout; no production/clinical data; no write access to release branches without HITL; per-pilot kill switch (revoke agent identity via SPIFFE/SPIRE); model versions pinned and signed.
Graduation path. Pilot → cohort (3–5 teams, same safety class) → broader Class A → cautious Class B only after the corresponding ASMM-Med level + eval evidence exist → Class C only with validated agents (L4) and always dual human control. Every graduation is a documented decision checkpoint (§8) with tri-signature. Learnings (harness components, specs, eval datasets, runbooks) are promoted to shared org assets owned by the CoE — the harness is the product (P5).
4. Organization & Operating Model Evolution#
New / changed roles.
| Role | Mandate | Introduced |
|---|---|---|
| Harness Engineer | Builds/owns the deterministic eval harness, golden datasets, CI gates. Treats harness as a product. | L2 |
| Eval Owner | Owns validation evidence, eval coverage, the 99.9% system property, promotion eval gates. | L1 (named) → L2 (active) |
| Model Steward | Owns model fleet lifecycle, fine-tunes, versioning, signing, registry, reproducibility. | L1 |
| Agent Steward | Owns a specific agent workflow: scope, policy, sandbox, HITL design, assurance case. | L3 |
| AI Governance Board | Tri-functional (Eng + QA/RA + Security) authority over policy, promotions, gate reviews, stop/rollback. | L1 |
| QA/RA Integration | Embeds regulatory/quality into the AI lifecycle: CSA validation, 62304 traceability, design controls. | Throughout, deepening L2→L4 |
| Security Integration | Zero-trust agent identity, supply-chain, sandboxing, attestation, IEC 62443. | Throughout |
RACI for promotion decisions (level N → N+1).
| Activity | Eng Lead | Eval Owner | QA/RA | Security | Gov. Board | PMO |
|---|---|---|---|---|---|---|
| Produce eval/validation evidence | C | R | C | C | I | I |
| Verify regulatory traceability | I | C | R | C | I | I |
| Verify security posture | I | I | C | R | I | I |
| Promotion decision (tri-sign) | A | C | A | A | R | C |
| Stop / rollback trigger | A | C | A | A | R | I |
| Resource / schedule | C | I | I | I | C | R/A |
(R=Responsible, A=Accountable, C=Consulted, I=Informed. Promotion requires the three A signatures: Eng + QA/RA + Security.)
CoE vs. embedded. Start Center-of-Excellence (L1–L2): a small central team owns the harness, serving, policy, and standards. Transition to embedded (L3+): CoE retains shared assets, standards, and the Governance Board; Harness/Agent/Model Stewards embed in product teams. By L5, CoE is a thin standards-and-platform org; capability lives in teams.
Scaling enablement to 1000+ devs. Train-the-trainer cohorts; AGENTS.md/specs/ as self-serve standards; golden-path templates; internal certification for HITL reviewers and Agent Stewards; office hours + internal community; documentation as code.
Approval-fatigue & no-blame controls. Risk-proportional HITL (only consequential actions gated); batch low-risk approvals with audit sampling; clear escalation paths; rotation of reviewers; no-blame culture — logging is for quality/regulatory evidence, never individual performance; psychological safety to halt or roll back without penalty; "review every shipped line" framed as engineering craft, not blame.
5. Investment & Resourcing per Phase#
Illustrative qualitative ranges (planning placeholders; cost mechanics per 08).
| Phase | GPU capacity | Platform/MLOps HC | Fine-tuning effort | Eval engineering | Training/enablement |
|---|---|---|---|---|---|
| L0→L1 | Small (serving S/M, inference only) | 3–6 | None | Manual/baseline | Org-wide literacy (high reach, low depth) |
| L1→L2 | Small–Med (+ LoRA fine-tune jobs) | 6–10 | Moderate (task LoRAs) | Heavy (harness is the product) | Harness Engineer cohort; SDD training |
| L2→L3 | Med–Large (fleet S/M/L/V/E, routing) | 10–18 | Moderate–High | High (trajectory/tool eval) | Agent Steward + HITL reviewer training |
| L3→L4 | Large (validated serving + monitoring) | 15–25 | High (validated tunes) | Very high (99.9% assurance + CSA) | QA/RA + Security deep embed |
| L4→L5 | Large, cost-optimized (auto-rightsized) | 12–20 (efficiency gains) | Continuous (closed-loop) | Continuous (eval-driven promotion) | Steady-state continuous enablement |
Cost framing (ties to 08). GPU/token cost is a first-class constraint (P6). Early phases over-provision for trust; from L4→L5, FinOps + cost-optimal routing drive cost-per-green-PR down while quality (the gate) is held constant. Eval engineering is the largest sustained investment — the harness is the product, and validation evidence is the moat. Headcount shifts from central platform build (L1–L2) toward embedded stewardship + efficiency (L4–L5).
6. Consolidated Milestone & KPI Table per Phase#
Capability KPIs and assurance/cost KPIs (drawn from 02 §8). Targets are placeholders.
| Phase | Capability KPIs | Assurance KPIs | Cost KPIs | Key milestone |
|---|---|---|---|---|
| L0→L1 | % AI traffic on self-hosted endpoints (→100%); AI literacy completion | 100% prompt/response logged & attributable (Part 11) | GPU baseline $/inference established | Shadow AI killed; serving + logging live |
| L1→L2 | % pilot repos with SDD + AGENTS.md/specs/; automated change throughput | Deterministic eval gate reproducibility (→100%); eval coverage % | Cost-per-green-PR baseline | Deterministic harness gates CI |
| L2→L3 | # sandboxed agent workflows; multi-step task completion rate | Trajectory/tool-use eval coverage; HITL audit-trail completeness 100% | $/agent-run; routing cost efficiency | MCP plane + policy server + HITL live |
| L3→L4 | # validated autonomous workflows; autonomous PR throughput (by safety class) | ≥99.9% gate correctness (system property); 100% IEC 62304 traceability | Cost-per-green-PR optimized vs. L3 | CSA-validated agents in QMS; A2A |
| L4→L5 | Closed-loop promotion frequency; fleet self-optimization rate | Eval-driven promotion pass-rate ≥99.9%; PCCP-conformant changes 100% | Cost-per-green-PR at target; GPU utilization | Self-optimizing, PCCP-governed steady state |
7. Program Risk Register#
| ID | Risk | Likelihood | Impact | Mitigation | Owner |
|---|---|---|---|---|---|
| PR-1 | Capability outruns assurance (autonomy enabled before validation) | Med | Critical | min(D1,D4,D6) clamp; no promotion without tri-sign + eval evidence; "don't grant autonomy you can't validate" | AI Governance Board |
| PR-2 | Regulatory non-acceptance of agent validation / PCCP approach | Med | High | Early FDA/CSA engagement; conservative assurance cases; CSA + IEC 62304 grounding; pilot scope | QA/RA |
| PR-3 | Cost overrun (GPU/token) | Med | High | FinOps from L1; cost-per-green-PR KPI; cost-optimal routing; rightsizing; tier S/M/L/V/E discipline | PMO + Eng (FinOps) |
| PR-4 | Talent gap (Harness/Agent/Model Stewards, eval engineers) | High | High | Train-the-trainer; CoE seeding; certification; phased role introduction; embedded model | D8 lead / People |
| PR-5 | Shadow AI persists | Med | High | Egress blocking; amnesty; monitoring; leader modeling; no-blame culture; make sanctioned path better | Security |
| PR-6 | Model supply-chain compromise | Low | Critical | Self-hosted open-weight only; Sigstore/cosign + SLSA + SBOM; signed LoRAs; Vault-brokered creds; attestation | Security + Model Steward |
| PR-7 | Change-management resistance | Med | Med | No-blame culture; value-per-phase wins; reviewer rotation; approval-fatigue controls; transparent comms | Eng leadership |
| PR-8 | Eval non-determinism / flakiness | Med | High | Hermetic envs; strict seeding; flaky-case quarantine; harness-as-product investment | Harness Engineer / Eval Owner |
| PR-9 | Drift erodes validated state (post-L4) | Med | High | Continuous gate-correctness monitoring; auto-rollback to pinned version; PCCP envelope | Eval Owner |
8. Decision Checkpoints & Governance Cadence#
| Cadence | Forum | Purpose |
|---|---|---|
| Quarterly | ASMM-Med Assessment | Score all 8 dimensions; recompute governing level = min(D1,D4,D6); revise roadmap/thresholds |
| Per transition | Gate Review (tri-sign) | Verify exit criteria; Eng + QA/RA + Security promotion sign-off; record reversibility plan |
| Per pilot graduation | Checkpoint | Approve scope expansion / next cohort with evidence |
| Continuous | Monitoring + alarms | 99.9% gate correctness, drift, cost, security posture |
Stop / rollback triggers (any one triggers halt + Governance Board review):
- Release-gate correctness drops below the level's threshold (e.g., <99.9% at L4).
- Any agent action outside policy/scope, or sandbox escape.
- Loss of attributable audit trail / Part 11 integrity.
- Cost-per-green-PR breaches FinOps ceiling without quality justification.
- Regulatory or QA/RA finding against a deployed capability.
- Model supply-chain or attestation failure.
Rollback mechanics (always available, R3): feature-flag disable; pin to prior signed model/LoRA version; revoke agent identity (SPIFFE/SPIRE); reduce autonomy scope one ASMM-Med level; revert to HITL or dual-human control. No one-way doors.
9. "Start Monday" Quick Wins#
Regulated adaptation of the source-paper spirit — structure, not vibes.
For individual developers:
- Add an
AGENTS.mdto your repo: conventions, build/test commands, guardrails, what agents may and may not do. - Create a
specs/directory; write the spec (the design input) before the code. - Write tests and evals before code. The eval is the contract.
- Review every shipped line — AI-authored or not. Authorship is yours; craft is intent + validation.
- Route all AI use through sanctioned self-hosted endpoints. Kill your shadow AI today.
For engineering leaders:
- Stand up (or adopt) the deterministic eval harness in CI for one repo this week.
- Treat the harness, golden datasets, specs, and runbooks as shared assets, not local hacks.
- Pick a Class A pilot repo with good coverage and a willing team.
- Model no-blame behavior: reward halting and rollback, not heroics.
For the organization:
- Charter the AI Governance Board (Eng + QA/RA + Security).
- Name the first Eval Owner and Model Steward.
- Publish the AI Use Policy and the shadow-AI cutover plan.
- Begin org-wide AI literacy with the thesis up front: structure scales, vibes don't.
10. North-Star Vision Recap#
The craft is changing, not disappearing. Intent and validation are the new engineering craft: a developer's value moves from typing implementation to specifying intent precisely and proving correctness rigorously. The harness is the product (P5); the eval is the contract; the 99.9% release gate is a system property, not a hope (P1).
Structure scales; vibes don't. Spec-driven development, deterministic evaluation wrapping probabilistic generation (P2), risk-proportional autonomy (P3), and Part 11-grade evidence (P4) are what let a 1000+ engineer regulated organization adopt agentic SDLC without trading away the audit trail, the safety case, or patient trust.
AI here is an amplifier of engineering and quality culture — never a substitute for it. Applied to a mature, evidence-first, no-blame culture, it compounds quality and throughput. Applied to a weak one, it compounds risk. This roadmap's discipline — assurance-gated, reversible, governed by min(D1,D4,D6), pilot-before-scale — is precisely how we ensure it amplifies the right thing. Don't grant autonomy you can't yet validate. Earn each level. Then the structure carries you to the next.