← Unovie.AI Agentic-Native SDLC · Regulated MedTech

09 — Adoption Roadmap#

Program: Agentic-Native SDLC for Regulated Medical Device Engineering Audience: Executive sponsors, PMO, Engineering / QA-RA / Security leadership, AI Governance Board Status: Planning baseline (May 2026). All dates expressed as relative quarters from program start (Q1 = first full quarter after charter approval). All thresholds, ranges, and timelines are planning placeholders subject to gate-review revision. Cross-references: 01-requirements.md, 02-maturity-model.md, 03-reference-architecture.md, 04-model-strategy-and-finetuning.md, 05-evaluation-and-validation.md, 06-agentic-workflows.md, 07-security-and-compliance.md, 08-token-and-gpu-economics.md.


1. Roadmap Philosophy#

This roadmap is assurance-gated, not capability-gated. We do not deploy the most capable agent we can build; we deploy the most capable agent we can validate, secure, and govern — and no further. The maturity model (ASMM-Med, see 02) is the spine; this document is the sequencing, resourcing, and decision layer that moves the organization L0 → L5 without ever letting autonomy outrun control.

Five non-negotiable design rules:

#RuleConsequence for sequencing
R1Governing level = min(D1, D4, D6)Capability dimensions (D2, D3, D5) may sprint ahead in build, but the enabled autonomy level is clamped by the weakest of Governance (D1), Eval/Assurance (D4), Security (D6).
R2Don't grant autonomy you can't yet validate (P3 + P1)Each phase ships its eval/validation evidence before the corresponding agent privilege is unlocked in production. The harness leads; the agent follows.
R3Reversible by constructionEvery promotion has a defined rollback (feature-flag, model-version pin, privilege revocation, scope reduction). No one-way doors.
R4Pilot before scaleCapabilities are proven on low-safety-class, low-blast-radius repos before any Class B/C or fleet-wide exposure.
R5Value per phaseEvery transition must unlock a defensible, measurable engineering or quality value — not just platform plumbing — or the phase is reconsidered.

These rules operationalize the Seven Principles (02): P1 (99.9% release-gate correctness as a system property), P2 (determinism wraps probabilism), P3 (risk-proportional autonomy), P4 (Part 11 evidence), P5 (the harness is the product), P6 (cost-per-green-PR), P7 (self-hosted reproducibility). Promotion across any level requires tri-signature: Engineering + QA/RA + Security. Safety-class gating per IEC 62304 is absolute — Class C software changes are always under dual human control regardless of maturity level.


2. Phase Plan#

The program is five sequenced transitions. Build work for dimension N+1 may begin while operating at level N, but the level is only declared after the exit gate passes. Illustrative trajectory (from 02): Q1–Q2 → L1; Q3–Q4 → L2; Q5–Q7 → L3; Q8–Q11 → L4; Q12+ → L5.

gantt
    title Agentic-Native SDLC — ASMM-Med Adoption Roadmap (relative quarters)
    dateFormat  YYYY-MM-DD
    axisFormat  Q%q

    section L0→L1 Governed Assistance
    Kill shadow AI / policy baseline        :a1, 2026-01-01, 60d
    Self-hosted serving + central logging   :a2, 2026-01-15, 75d
    L1 exit gate (Eng+QA/RA+Sec)            :milestone, m1, 2026-03-31, 0d

    section L1→L2 Spec-Driven Bounded Automation
    SDD + AGENTS.md + specs/ adoption        :b1, 2026-04-01, 80d
    Deterministic eval harness in CI         :b2, 2026-04-15, 90d
    L2 exit gate                             :milestone, m2, 2026-09-30, 0d

    section L2→L3 Orchestrated Agentic Workflows
    Sandboxed multi-step agents + MCP plane  :c1, 2026-10-01, 110d
    Model fleet + routing + policy server    :c2, 2026-10-15, 120d
    HITL workflow rollout                    :c3, 2026-12-01, 90d
    L3 exit gate                             :milestone, m3, 2027-06-30, 0d

    section L3→L4 Validated Autonomous Agents
    CSA agent validation + 62304 traceability:d1, 2027-07-01, 150d
    99.9% gate hardening + A2A               :d2, 2027-09-01, 160d
    L4 exit gate                             :milestone, m4, 2028-06-30, 0d

    section L4→L5 Self-Optimizing Enterprise
    Closed-loop fine-tuning + eval promotion :e1, 2028-07-01, 150d
    Cost-optimal routing + PCCP change ctrl  :e2, 2028-09-01, 160d
    L5 steady state                          :milestone, m5, 2029-03-31, 0d

2.1 Phase A — L0 → L1: Governed Assistance (≈ Q1–Q2)#

Objective. Eliminate ungoverned ("shadow") AI use; stand up self-hosted serving with complete prompt/response logging so that every AI interaction is observable, attributable, and policy-bound. This phase buys legitimacy, not autonomy.

Workstreams by dimension.

DimWorkstream
D1AI Use Policy ratified; AI Governance Board chartered; ISO/IEC 42001 AIMS scaffolding initiated; shadow-AI amnesty + cutover.
D2Stand up vLLM/Triton on K8s + GPU Operator; serve baseline open-weight S/M models; KServe endpoints; no fine-tuning yet.
D3Inventory knowledge sources; begin curated repo/context indexing; no agent retrieval yet.
D4Define eval taxonomy and baseline manual review checklist; no automated harness yet.
D5IDE assist only (completion/chat); no tools, no agency.
D6Egress controls to block external LLM SaaS; Vault for secrets; SPIFFE/SPIRE identity bootstrap; Istio mTLS baseline.
D7OpenTelemetry tracing of all inference; central prompt/response log (Part 11-style retention); first FinOps GPU dashboard.
D8Org-wide AI literacy training; name first Model Steward and Eval Owner; communicate the "structure scales, vibes don't" thesis.

Deliverables / artifacts. AI Use Policy v1; serving platform runbook; inference audit log schema (Part 11-aligned); shadow-AI decommission report; GPU baseline capacity plan; Governance Board charter + RACI.

Exit criteria (who signs off).

  • 100% of sanctioned AI traffic routed through self-hosted endpoints; zero known external LLM SaaS egress (Security verified).
  • Complete, immutable, attributable logging of all prompts/responses (QA/RA verified for Part 11 retention).
  • Governance Board operational with documented decision cadence (Engineering + QA/RA + Security tri-sign).
  • Governing level confirmed: min(D1,D4,D6) ≥ L1.

Primary risks + mitigations. Shadow AI persists covertly → egress blocking + amnesty + monitoring + leader modeling. Serving instability erodes trust → conservative SLOs, gradual cutover, fallback. Logging seen as surveillance → no-blame framing, transparency on purpose (quality/regulatory, not individual performance).

Value unlocked. A defensible, auditable AI footprint; the substrate (serving + logging + identity) on which everything else is validated.


2.2 Phase B — L1 → L2: Spec-Driven Bounded Automation (≈ Q3–Q4)#

Objective. Make specification-driven development (SDD) and a deterministic eval harness in CI the default. Move from "AI helps me type" to "AI executes against a spec, and a deterministic gate judges the result." This is where the harness becomes the product (P5).

Workstreams by dimension.

DimWorkstream
D1Map SDD artifacts to design controls (DHF inputs); QMSR/ISO 13485 (Feb 2026) alignment of AI-assisted change records.
D2MLflow registry; first task-specific LoRA fine-tunes (S/M tier) on curated internal code/spec corpora; reproducible build pipeline.
D3AGENTS.md repo conventions; specs/ directory standard; curated retrieval over approved knowledge; context provenance.
D4Deterministic eval harness in CI (Argo / pipeline-triggered); golden datasets; pass/fail gates; seedable, reproducible scoring (P2, P7).
D5Bounded single-step automation: scaffold, test-gen, doc-gen — no multi-step autonomy, no tool plane yet.
D6Sigstore/cosign signing of model + harness artifacts; SLSA provenance + SBOM for the AI toolchain.
D7Cost-per-green-PR (P6) instrumented; eval run cost tracking; token budgets per pipeline.
D8First Harness Engineers chartered; "tests/evals before code" engineering norm; review-every-line discipline.

Deliverables / artifacts. AGENTS.md + specs/ standard; deterministic eval harness (versioned, signed); golden eval datasets; LoRA fine-tune cards; SBOM + SLSA attestations; cost-per-green-PR baseline.

Exit criteria (who signs off).

  • Deterministic eval harness gates CI on pilot repos with reproducible (seed-stable) results across runs (Engineering + Eval Owner).
  • SDD artifacts traceable into design-control records (QA/RA).
  • All models/harness artifacts signed with provenance (Security).
  • Governing level: min(D1,D4,D6) ≥ L2.

Primary risks + mitigations. Eval flakiness/non-determinism → strict seeding, hermetic environments, quarantine of flaky cases. Spec quality varies → spec templates, peer review, "spec-as-design-input" training. Fine-tune overfit → held-out eval sets, eval-driven acceptance only.

Value unlocked. Trustworthy automated quality gates; measurable cost-per-green-PR; the first hard evidence that AI output can be validated before merge.


2.3 Phase C — L2 → L3: Orchestrated Agentic Workflows (≈ Q5–Q7)#

Objective. Introduce sandboxed multi-step agents with a governed MCP tool plane, a model fleet with routing, a policy server, and human-in-the-loop (HITL) controls. Agents now plan and act across steps — inside sandboxes, under policy, with a human approving consequential actions.

Workstreams by dimension.

DimWorkstream
D1Policy-as-code (OPA/Gatekeeper + policy server) encodes who/what/where agents may act; risk-proportional autonomy matrix by safety class.
D2Model fleet tiers S/M/L/V/E operational; Ray/Kueue scheduling; multi-LoRA serving; routing by task/cost/quality.
D3Agent-grade retrieval + MCP-exposed knowledge resources; context windows scoped per task and per safety class.
D4Eval extended to trajectory and tool-use evaluation; HITL decision logging feeds eval; assurance cases per workflow.
D5Argo Workflows orchestration; MCP tool plane; gVisor/Kata sandboxing; Agent Stewards own each workflow; HITL checkpoints.
D6Zero-trust per agent identity (SPIFFE/SPIRE); least-privilege tool scopes; IEC 62443 alignment; egress-controlled sandboxes.
D7KEDA autoscaling on agent load; per-workflow FinOps; trajectory observability; token/GPU attribution per agent run.
D8Agent Steward + Harness Engineer roles scaled; HITL reviewer training; approval-fatigue controls designed (see §4).

Deliverables / artifacts. MCP tool registry + scopes; policy server rulesets; agent sandbox runbook; per-workflow assurance case; routing policy; HITL approval logs; agent identity inventory.

Exit criteria (who signs off).

  • Multi-step agents run only in sandboxes with enforced least-privilege tool scopes (Security).
  • Policy server denies out-of-scope actions by default; all consequential actions have HITL approval with audit trail (QA/RA + Security).
  • Trajectory-level eval coverage meets threshold on pilot workflows (Eval Owner).
  • Governing level: min(D1,D4,D6) ≥ L3. Class C remains dual-human.

Primary risks + mitigations. Agent escapes sandbox / scope creep → default-deny policy, runtime sandbox, continuous policy tests. Tool-plane supply-chain risk → signed MCP servers, scoped credentials, Vault brokering. HITL becomes rubber-stamp → batched-but-meaningful approvals, sampling audits, no-blame escalation.

Value unlocked. Real end-to-end task automation (multi-file changes, investigation, refactors) with human consequence-gating — the first order-of-magnitude productivity step, safely bounded.


2.4 Phase D — L3 → L4: Validated Autonomous Agents (≈ Q8–Q11)#

Objective. CSA-validate agents as part of the QMS so that defined agent workflows can act autonomously (within safety class) at ≥99.9% release-gate correctness, with full IEC 62304 traceability and A2A (agent-to-agent) coordination. This is the regulated leap: agents become validated tools.

Workstreams by dimension.

DimWorkstream
D1CSA validation packages per agent; ISO 14971 risk analysis for agent failure modes; QMSR/13485 integration; §524B + cybersecurity documentation.
D2Locked, signed model+LoRA versions per validated workflow; reproducible serving; change control on model versions.
D3Validated knowledge sources; controlled context; provenance required for any retrieval feeding a Class B/C change.
D4≥99.9% system-property gate demonstrated and continuously monitored; deterministic eval as validation evidence; assurance cases signed.
D5A2A coordination among validated agents; autonomy scoped strictly by safety class; Class C always dual human control.
D6Full IEC 62443 posture; cryptographic attestation of every agent action; tamper-evident audit.
D7Continuous gate-correctness monitoring; cost-per-green-PR optimized; drift + regression alarms.
D8QA/RA + Security embedded in agent lifecycle; Eval Owner owns validation evidence; operating model matured (§4).

Deliverables / artifacts. Per-agent CSA validation report; IEC 62304 traceability matrix (requirement → design → agent action → test/eval → evidence); ISO 14971 agent FMEA; 99.9% gate-correctness monitoring dashboard; A2A protocol spec; signed assurance cases.

Exit criteria (who signs off).

  • Validated agents demonstrate ≥99.9% release-gate correctness as a sustained system property (Eval Owner + QA/RA).
  • End-to-end IEC 62304 traceability for every autonomous action (QA/RA).
  • CSA validation accepted into the QMS; reversibility + version pinning enforced (Engineering + QA/RA + Security).
  • Class C dual-human control verified intact (Security + QA/RA).
  • Governing level: min(D1,D4,D6) ≥ L4.

Primary risks + mitigations. Regulator non-acceptance of agent validation approach → early FDA/CSA engagement, conservative assurance cases, pilot scope. 99.9% not met → no promotion; remain L3; harden harness. Drift erodes validated state → continuous monitoring + automatic rollback to pinned version.

Value unlocked. Bounded autonomous engineering for lower-risk classes with regulatory-grade evidence — sustained throughput gains without sacrificing the audit trail.


2.5 Phase E — L4 → L5: Self-Optimizing Agentic Enterprise (≈ Q12+)#

Objective. Close the loop: eval-driven, cost-optimal fine-tuning and promotion under PCCP-style change control. The system improves itself within pre-authorized bounds, with every change gated by the deterministic harness and governed change control.

Workstreams by dimension.

DimWorkstream
D1FDA AI/PCCP-style predetermined change-control protocol authored and approved; ISO/IEC 42001 AIMS at full maturity.
D2Closed-loop fine-tuning pipeline; candidate models auto-trained from production signal; promotion only via eval gate.
D3Self-curating knowledge with provenance + freshness controls; feedback-curated eval datasets.
D4Eval-driven promotion: a model/agent is promoted only if it beats incumbent on the deterministic harness at ≥99.9% (P1, P5).
D5Autonomous fleet self-optimization (routing, LoRA selection) within PCCP envelope.
D6Continuous attestation of self-modifying components; change provenance; rollback always available.
D7Cost-optimal routing (P6) closed-loop with FinOps; auto-rightsizing GPU; token economics steered to target.
D8Operating model steady-state; CoE → embedded; continuous enablement; no-blame, evidence-first culture institutionalized.

Deliverables / artifacts. PCCP change-control protocol; closed-loop fine-tune pipeline; eval-driven promotion policy; cost-optimization control loop; AIMS conformance evidence.

Exit criteria (steady state, who signs off).

  • Every self-initiated model/agent change passes deterministic eval gate ≥99.9% before promotion, within PCCP envelope (Eval Owner + QA/RA).
  • Cost-per-green-PR trending to target under FinOps governance (Engineering + PMO).
  • All changes attested, reversible, and within pre-authorized change-control bounds (Security + QA/RA).
  • Governing level: min(D1,D4,D6) ≥ L5.

Primary risks + mitigations. Self-optimization drifts outside intended behavior → PCCP envelope as hard boundary; eval-gated promotion; rollback. Cost optimization degrades quality → quality is the gate, cost is the objective subject to the gate. Change control too slow → predetermined protocol pre-authorizes the space of changes.

Value unlocked. A continuously improving, cost-optimal, self-hosted agentic SDLC where quality is provably non-decreasing and change is governed — the north star.


3. Pilot Strategy#

Principle: prove on the safe edge, then graduate inward.

Team / repo selection (in priority order):

  1. IEC 62304 Class A software first — non-safety internal tools, build tooling, test utilities, internal web apps. No patient-impact path.
  2. High test coverage + mature CI (the harness needs something to gate against).
  3. Volunteer teams with engaged tech leads (cultural readiness over raw size).
  4. Repos with clean, current specifications or willingness to write them.
  5. Explicitly excluded from early pilots: any Class B/C, regulated firmware, anything in a device's safety path.

Success criteria for a pilot.

MetricTarget (placeholder)
Deterministic eval gate reproducibility100% seed-stable across reruns
Defect-escape rate vs. baseline≤ baseline (no regression)
Cost-per-green-PRMeasured + trending down
Reviewer trust (survey)≥ 70% "would expand scope"
Rollback events causing incident0

Blast-radius containment. Sandboxed execution (gVisor/Kata); least-privilege tool scopes; feature-flagged rollout; no production/clinical data; no write access to release branches without HITL; per-pilot kill switch (revoke agent identity via SPIFFE/SPIRE); model versions pinned and signed.

Graduation path. Pilot → cohort (3–5 teams, same safety class) → broader Class A → cautious Class B only after the corresponding ASMM-Med level + eval evidence exist → Class C only with validated agents (L4) and always dual human control. Every graduation is a documented decision checkpoint (§8) with tri-signature. Learnings (harness components, specs, eval datasets, runbooks) are promoted to shared org assets owned by the CoE — the harness is the product (P5).


4. Organization & Operating Model Evolution#

New / changed roles.

RoleMandateIntroduced
Harness EngineerBuilds/owns the deterministic eval harness, golden datasets, CI gates. Treats harness as a product.L2
Eval OwnerOwns validation evidence, eval coverage, the 99.9% system property, promotion eval gates.L1 (named) → L2 (active)
Model StewardOwns model fleet lifecycle, fine-tunes, versioning, signing, registry, reproducibility.L1
Agent StewardOwns a specific agent workflow: scope, policy, sandbox, HITL design, assurance case.L3
AI Governance BoardTri-functional (Eng + QA/RA + Security) authority over policy, promotions, gate reviews, stop/rollback.L1
QA/RA IntegrationEmbeds regulatory/quality into the AI lifecycle: CSA validation, 62304 traceability, design controls.Throughout, deepening L2→L4
Security IntegrationZero-trust agent identity, supply-chain, sandboxing, attestation, IEC 62443.Throughout

RACI for promotion decisions (level N → N+1).

ActivityEng LeadEval OwnerQA/RASecurityGov. BoardPMO
Produce eval/validation evidenceCRCCII
Verify regulatory traceabilityICRCII
Verify security postureIICRII
Promotion decision (tri-sign)ACAARC
Stop / rollback triggerACAARI
Resource / scheduleCIIICR/A

(R=Responsible, A=Accountable, C=Consulted, I=Informed. Promotion requires the three A signatures: Eng + QA/RA + Security.)

CoE vs. embedded. Start Center-of-Excellence (L1–L2): a small central team owns the harness, serving, policy, and standards. Transition to embedded (L3+): CoE retains shared assets, standards, and the Governance Board; Harness/Agent/Model Stewards embed in product teams. By L5, CoE is a thin standards-and-platform org; capability lives in teams.

Scaling enablement to 1000+ devs. Train-the-trainer cohorts; AGENTS.md/specs/ as self-serve standards; golden-path templates; internal certification for HITL reviewers and Agent Stewards; office hours + internal community; documentation as code.

Approval-fatigue & no-blame controls. Risk-proportional HITL (only consequential actions gated); batch low-risk approvals with audit sampling; clear escalation paths; rotation of reviewers; no-blame culture — logging is for quality/regulatory evidence, never individual performance; psychological safety to halt or roll back without penalty; "review every shipped line" framed as engineering craft, not blame.


5. Investment & Resourcing per Phase#

Illustrative qualitative ranges (planning placeholders; cost mechanics per 08).

PhaseGPU capacityPlatform/MLOps HCFine-tuning effortEval engineeringTraining/enablement
L0→L1Small (serving S/M, inference only)3–6NoneManual/baselineOrg-wide literacy (high reach, low depth)
L1→L2Small–Med (+ LoRA fine-tune jobs)6–10Moderate (task LoRAs)Heavy (harness is the product)Harness Engineer cohort; SDD training
L2→L3Med–Large (fleet S/M/L/V/E, routing)10–18Moderate–HighHigh (trajectory/tool eval)Agent Steward + HITL reviewer training
L3→L4Large (validated serving + monitoring)15–25High (validated tunes)Very high (99.9% assurance + CSA)QA/RA + Security deep embed
L4→L5Large, cost-optimized (auto-rightsized)12–20 (efficiency gains)Continuous (closed-loop)Continuous (eval-driven promotion)Steady-state continuous enablement

Cost framing (ties to 08). GPU/token cost is a first-class constraint (P6). Early phases over-provision for trust; from L4→L5, FinOps + cost-optimal routing drive cost-per-green-PR down while quality (the gate) is held constant. Eval engineering is the largest sustained investment — the harness is the product, and validation evidence is the moat. Headcount shifts from central platform build (L1–L2) toward embedded stewardship + efficiency (L4–L5).


6. Consolidated Milestone & KPI Table per Phase#

Capability KPIs and assurance/cost KPIs (drawn from 02 §8). Targets are placeholders.

PhaseCapability KPIsAssurance KPIsCost KPIsKey milestone
L0→L1% AI traffic on self-hosted endpoints (→100%); AI literacy completion100% prompt/response logged & attributable (Part 11)GPU baseline $/inference establishedShadow AI killed; serving + logging live
L1→L2% pilot repos with SDD + AGENTS.md/specs/; automated change throughputDeterministic eval gate reproducibility (→100%); eval coverage %Cost-per-green-PR baselineDeterministic harness gates CI
L2→L3# sandboxed agent workflows; multi-step task completion rateTrajectory/tool-use eval coverage; HITL audit-trail completeness 100%$/agent-run; routing cost efficiencyMCP plane + policy server + HITL live
L3→L4# validated autonomous workflows; autonomous PR throughput (by safety class)≥99.9% gate correctness (system property); 100% IEC 62304 traceabilityCost-per-green-PR optimized vs. L3CSA-validated agents in QMS; A2A
L4→L5Closed-loop promotion frequency; fleet self-optimization rateEval-driven promotion pass-rate ≥99.9%; PCCP-conformant changes 100%Cost-per-green-PR at target; GPU utilizationSelf-optimizing, PCCP-governed steady state

7. Program Risk Register#

IDRiskLikelihoodImpactMitigationOwner
PR-1Capability outruns assurance (autonomy enabled before validation)MedCriticalmin(D1,D4,D6) clamp; no promotion without tri-sign + eval evidence; "don't grant autonomy you can't validate"AI Governance Board
PR-2Regulatory non-acceptance of agent validation / PCCP approachMedHighEarly FDA/CSA engagement; conservative assurance cases; CSA + IEC 62304 grounding; pilot scopeQA/RA
PR-3Cost overrun (GPU/token)MedHighFinOps from L1; cost-per-green-PR KPI; cost-optimal routing; rightsizing; tier S/M/L/V/E disciplinePMO + Eng (FinOps)
PR-4Talent gap (Harness/Agent/Model Stewards, eval engineers)HighHighTrain-the-trainer; CoE seeding; certification; phased role introduction; embedded modelD8 lead / People
PR-5Shadow AI persistsMedHighEgress blocking; amnesty; monitoring; leader modeling; no-blame culture; make sanctioned path betterSecurity
PR-6Model supply-chain compromiseLowCriticalSelf-hosted open-weight only; Sigstore/cosign + SLSA + SBOM; signed LoRAs; Vault-brokered creds; attestationSecurity + Model Steward
PR-7Change-management resistanceMedMedNo-blame culture; value-per-phase wins; reviewer rotation; approval-fatigue controls; transparent commsEng leadership
PR-8Eval non-determinism / flakinessMedHighHermetic envs; strict seeding; flaky-case quarantine; harness-as-product investmentHarness Engineer / Eval Owner
PR-9Drift erodes validated state (post-L4)MedHighContinuous gate-correctness monitoring; auto-rollback to pinned version; PCCP envelopeEval Owner

8. Decision Checkpoints & Governance Cadence#

CadenceForumPurpose
QuarterlyASMM-Med AssessmentScore all 8 dimensions; recompute governing level = min(D1,D4,D6); revise roadmap/thresholds
Per transitionGate Review (tri-sign)Verify exit criteria; Eng + QA/RA + Security promotion sign-off; record reversibility plan
Per pilot graduationCheckpointApprove scope expansion / next cohort with evidence
ContinuousMonitoring + alarms99.9% gate correctness, drift, cost, security posture

Stop / rollback triggers (any one triggers halt + Governance Board review):

  • Release-gate correctness drops below the level's threshold (e.g., <99.9% at L4).
  • Any agent action outside policy/scope, or sandbox escape.
  • Loss of attributable audit trail / Part 11 integrity.
  • Cost-per-green-PR breaches FinOps ceiling without quality justification.
  • Regulatory or QA/RA finding against a deployed capability.
  • Model supply-chain or attestation failure.

Rollback mechanics (always available, R3): feature-flag disable; pin to prior signed model/LoRA version; revoke agent identity (SPIFFE/SPIRE); reduce autonomy scope one ASMM-Med level; revert to HITL or dual-human control. No one-way doors.


9. "Start Monday" Quick Wins#

Regulated adaptation of the source-paper spirit — structure, not vibes.

For individual developers:

  • Add an AGENTS.md to your repo: conventions, build/test commands, guardrails, what agents may and may not do.
  • Create a specs/ directory; write the spec (the design input) before the code.
  • Write tests and evals before code. The eval is the contract.
  • Review every shipped line — AI-authored or not. Authorship is yours; craft is intent + validation.
  • Route all AI use through sanctioned self-hosted endpoints. Kill your shadow AI today.

For engineering leaders:

  • Stand up (or adopt) the deterministic eval harness in CI for one repo this week.
  • Treat the harness, golden datasets, specs, and runbooks as shared assets, not local hacks.
  • Pick a Class A pilot repo with good coverage and a willing team.
  • Model no-blame behavior: reward halting and rollback, not heroics.

For the organization:

  • Charter the AI Governance Board (Eng + QA/RA + Security).
  • Name the first Eval Owner and Model Steward.
  • Publish the AI Use Policy and the shadow-AI cutover plan.
  • Begin org-wide AI literacy with the thesis up front: structure scales, vibes don't.

10. North-Star Vision Recap#

The craft is changing, not disappearing. Intent and validation are the new engineering craft: a developer's value moves from typing implementation to specifying intent precisely and proving correctness rigorously. The harness is the product (P5); the eval is the contract; the 99.9% release gate is a system property, not a hope (P1).

Structure scales; vibes don't. Spec-driven development, deterministic evaluation wrapping probabilistic generation (P2), risk-proportional autonomy (P3), and Part 11-grade evidence (P4) are what let a 1000+ engineer regulated organization adopt agentic SDLC without trading away the audit trail, the safety case, or patient trust.

AI here is an amplifier of engineering and quality culture — never a substitute for it. Applied to a mature, evidence-first, no-blame culture, it compounds quality and throughput. Applied to a weak one, it compounds risk. This roadmap's discipline — assurance-gated, reversible, governed by min(D1,D4,D6), pilot-before-scale — is precisely how we ensure it amplifies the right thing. Don't grant autonomy you can't yet validate. Earn each level. Then the structure carries you to the next.