← Unovie.AI Agentic-Native SDLC · Regulated MedTech

08 — Token & GPU Economics (FinOps for Self-Hosted Agentic Dev)#

Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Reference baseline · Date: May 2026 · Audience: CFO/Finance, Platform Engineering, Quality/Regulatory. Cross-refs: 01-requirements · 02-maturity-model · 03-reference-architecture · 04-model-strategy-and-finetuning · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 09-adoption-roadmap

This document is the financial control plane for the program. Every formula, rate, and ratio below is illustrative; the parameters are org-set and owned by the FinOps practice. Where a number appears, treat it as a placeholder to be replaced by measured values from your own fleet telemetry.


1. The Economic Reframing#

Self-hosting fine-tuned open-weight models is not primarily a performance decision for this program — it is an economic and sovereignty decision, and it changes the shape of the cost, not just the magnitude.

A SaaS LLM API is pure, uncapped OpEx: a per-token meter that scales linearly and forever with usage, with no asset on the balance sheet and no floor on marginal cost. A 1000+ developer org running agentic workflows generates enormous token volume — most of it on repair loops, retrieval, and evaluation, not the final answer the human sees. At that volume, the per-token meter becomes the dominant line item and is structurally unbounded.

Self-hosting converts that into two different categories:

DimensionSaaS API (rejected)Self-Hosted Fleet (chosen)
Cost classOpEx, per-token, uncappedGPU CapEx (amortized) + Ops OpEx
Marginal cost of one more tokenVendor rate (fixed, never zero)≈ marginal electricity once GPU is owned
Cost behavior at scaleLinear, unboundedStep-function (buy capacity) + high utilization wins
Data pathSource/spec leaves the boundaryStays inside the regulated boundary
ReproducibilityVendor-controlled model driftPinned weights, P7 reproducible
Negotiating positionVendor pricing powerInternal control of the curve

Why the org chose this (decision is non-optional, see §6):

  1. Cost at scale. Above a break-even volume, owned-and-amortized GPU beats per-token billing, and our volume is far above it.
  2. Data sovereignty. Regulated source, specifications, defect data, and patient-adjacent context must not transit a third-party LLM (see 07-security-and-compliance).
  3. No per-token vendor billing / no model drift. Validation under IEC 62304 requires that a release gate run today reproduces tomorrow; a silently-updated vendor model breaks that (P7, 05-evaluation-and-validation).

The governing metric: COST-PER-GREEN-PR#

Per Principle P6, cost is measured per verified task, not per token. The unit of economic value this program produces is a Green PR: a change that passes all deterministic gates plus human review (see 06-agentic-workflows). Tokens that do not contribute to a Green PR are waste, regardless of how cheap each one was.

COST-PER-GREEN-PR =Total compute $ (inference + eval + retrieval + idle + amortized CapEx + ops)Number of PRs that passed all gates + review

Why optimizing cost-per-token alone is a trap. Cost-per-token is a component, not the objective. The classic failure: a team cuts token price 40% by quantizing aggressively or routing everything to a tiny model, first-pass yield collapses, agents loop and re-attempt, escape-rate to human reviewers rises — and cost-per-Green-PR goes UP even as cost-per-token went down. Cheap-but-wrong is the most expensive mode in a regulated SDLC, because rework, re-validation, and reviewer time dwarf raw inference. The denominator is the lever; the numerator is the temptation.


2. Cost Taxonomy#

All program compute spend decomposes into seven drivers. Each is metered separately (OpenTelemetry → cost, §7) so it can be attributed and optimized independently.

#Cost driverWhat moves itPrimary controls
1Inference — model sizeParams served, tier (S/M/L), MoE active paramsSmallest-capable-model, tiered routing (§4), distillation
2Inference — context lengthPrompt tokens, retrieved context, historyContext economy (§4), prefix/prompt cache, retrieval over stuffing
3Inference — output lengthGenerated tokens, reasoning-effort, loop countReasoning-effort caps, loop-count limits, structured output
4Inference — batch efficiencyContinuous batching, concurrency, queue depthvLLM continuous batching, PagedAttention, request shaping
5Inference — GPU type/utilizationGPU SKU $/hr, MIG slicing, idle fractionRight-sizing, MIG partitioning, KEDA scale-to-zero
6Training / fine-tuningLoRA vs full FT, dataset size, epochs, runsMulti-LoRA, spot/preemptible + Kueue, distillation runs
7Evaluation / validationGate suites, reproducibility reruns, 99.9% samplingEval caching, deterministic seeds, eval-tier routing
Retrieval / embeddingEmbedding calls, rerank, index refresh, vector storeTier-E batching, retrieval cache, incremental indexing
Idle / overprovisioningReserved-but-unused GPU, warm pools, headroomScale-to-zero, autoscaling, batch/interactive split
Ops / peoplePlatform SRE, FinOps, MLOps, on-call, eval engineeringAutomation, self-service, maturity (L1→L5)

Do not forget evaluation cost. The ≥99.9% release-gate correctness target (P1) is enforced by running a great deal of inference — large eval suites, adversarial probes, statistical sampling, and reproducibility reruns that re-execute gates on pinned weights for the regulatory record. For mature agentic repos, eval + reproducibility compute is frequently 25–45% of total inference spend and must be a first-class budget line, not an afterthought (see 05-evaluation-and-validation).


3. An Illustrative Cost Model#

ILLUSTRATIVE — all rates are placeholders. Parameters ($/GPU-hr, throughput, yields) are org-set and replaced by measured fleet telemetry. The structure is the deliverable, not the digits.

3.1 GPU-hour → cost per 1k tokens, by tier#

The per-token cost of a served model is the GPU rental cost divided by how many tokens that GPU produces per hour.

Cost_per_1k_tokens =(GPU_count × $/GPU-hr) ÷ UtilizationEffective_throughput_tok_per_hr× 1000

Effective_throughput already bakes in quantization, continuous batching, and speculative decoding (§4). Illustrative steady-state rates:

TierModel classGPU footprint (illus.)Eff. throughput (tok/s, batched)$/GPU-hr (illus.)$ / 1k tok (illus.)
Tier-S "Reflex"1–8B, FP8/INT8MIG slice / 1× GPU6,000$2.50$0.00012
Tier-M "Worker"14–34B, AWQ/GPTQ1–2× GPU2,200$2.50$0.00063
Tier-L "Reasoner"70B+/MoE4–8× GPU700$2.50$0.0079
Tier-V MultimodalVLM1–2× GPU1,000$2.50$0.0028
Tier-E Embed/RerankembeddingMIG slice40,000 (items)$2.50$0.0000175

The ~65× spread between Tier-S and Tier-L is the entire economic argument for tiered routing: a call needlessly sent to Tier-L costs as much as ~65 correct Tier-S calls.

3.2 Cost per agent task#

A single agent task is rarely one model call. It is a sequence of calls across tiers, plus the verifier/sandbox compute that makes the work verifiable, plus its share of evaluation — all divided by first-pass yield (FPY) to account for repair loops.

Cost_per_task =Σ_over_tiers( tokens_tier × rate_tier ) + Verifier_sandbox_compute + Eval_amortizedFirst_Pass_Yield (0 < FPY ≤ 1)

Verifier_sandbox_compute = build/test/static-analysis/sandbox-exec cost to check the candidate (the harness is the product, P5). Eval_amortized = task's share of gate + reproducibility runs. FPY is the multiplier that ties quality to cost: every failed attempt re-spends the numerator.

The repair-loop explosion, holding raw token cost constant, as FPY falls:

First-Pass YieldEffective cost multiplier (1 ÷ FPY)Interpretation
0.901.11×Healthy; small rework tax
0.701.43×Noticeable loop spend
0.502.00×Half of all work is redone
0.303.33×Loop-dominated; cheap model is a false economy
0.156.67×Pathological; escape-rate to humans spikes

This table is the quantified form of the §1 trap: driving down per-token cost while letting FPY fall is a net loss.

3.3 Worked numeric example (labeled placeholders)#

ILLUSTRATIVE. "Implement a bounded requirement-to-code change with passing unit tests."

Input (placeholder)SymbolValue
Tier-S router/classify + lint tokenst_S8,000 tok @ $0.00012/1k
Tier-M implementation tokenst_M40,000 tok @ $0.00063/1k
Tier-L escalation (10% of tasks need it)t_L6,000 tok @ $0.0079/1k × 0.10
Embedding/retrievalt_E20,000 items @ $0.0000175/1k
Verifier/sandbox compute (build+test+sast)C_v$0.018
Eval/repro amortized shareC_e$0.012
First-pass yieldFPY0.70
Token + retrieval cost:
  Tier-S : 8,000/1000  × $0.00012 = $0.00096
  Tier-M : 40,000/1000 × $0.00063 = $0.02520
  Tier-L : 6,000/1000  × $0.0079 × 0.10 = $0.00474
  Tier-E : 20,000/1000 × $0.0000175 = $0.00035
  Subtotal tokens                       = $0.03125

Numerator = $0.03125 + C_v($0.018) + C_e($0.012)  = $0.06125
Cost_per_task = $0.06125 ÷ FPY(0.70)              = $0.0875   ✅

Sensitivity — same task, FPY collapses to 0.30 (e.g., over-aggressive quantization or routing too small): $0.06125 ÷ 0.30 = $0.2042 — a 2.3× cost increase with zero change to per-token rates. If that low FPY also raises human escape-rate, the true cost-per-Green-PR rises further still (reviewer minutes are the most expensive tokens in the system).


4. The Optimization Levers#

Each lever lists mechanism → expected impact (illustrative) → tradeoff. They compound; they also interact (over-using one can sink FPY and undo another), so they are tuned against cost-per-Green-PR, never in isolation.

4.1 Tiered model routing#

  • Mechanism. A lightweight classifier/router running on Tier-S scores incoming task complexity and dispatches to the smallest capable tier; escalate to Tier-M/Tier-L only on confidence/complexity thresholds or verifier failure. Smallest-capable-model principle.
  • Impact. If the majority of low-complexity calls resolve on Tier-S (~65× cheaper than Tier-L), blended $/token can drop 40–70% vs. always-on Tier-L.
  • Tradeoff. Router error is double-edged: under-routing tanks FPY (loops); the router itself must be validated and is an eval surface. Mis-tuned thresholds look cheap per-token while raising cost-per-Green-PR.

4.2 Caching (KV/prefix, prompt, semantic, retrieval)#

  • Mechanism. PagedAttention/KV-cache reuse + prefix/prompt caching skip recompute of shared system prompts, specs, and skill preambles; semantic caching returns prior answers for near-duplicate requests; retrieval caching avoids re-embedding/re-fetching stable context.
  • Impact. Prompt/prefix cache can cut prefill compute 30–80% on repetitive agentic prompts (large shared spec/skill prefixes); retrieval cache cuts Tier-E load materially.
  • Tradeoff. Semantic cache must be conservative in regulated paths — a stale or near-miss hit that flips a gate decision is a correctness defect. Cache keys must include weight/version/spec hashes for reproducibility.

4.3 Quantization + speculative decoding + continuous batching#

  • Mechanism. FP8/INT8/AWQ/GPTQ shrink memory/raise throughput; speculative decoding uses a small draft model to propose tokens a larger model verifies; continuous batching (vLLM) keeps GPUs saturated across concurrent requests.
  • Impact. Quantization commonly yields 1.5–3× throughput/$; speculative decoding 1.5–2.5× latency/throughput on accept-heavy workloads; continuous batching lifts utilization from ~30% to 70–90%.
  • Tradeoff. Quantization can degrade accuracy — every quantized model must re-pass deterministic gates (05) before serving. Quantize, then measure FPY, never assume.

4.4 Multi-LoRA adapter amortization#

  • Mechanism. Serve one base model with many LoRA adapters (per-domain/per-task behaviors) hot-swapped per request, instead of standing up a full fine-tuned model per behavior.
  • Impact. Collapses N dedicated deployments into ~1 base footprint — large reduction in idle/overprovisioning and CapEx; new specialized behaviors become near-zero marginal serving cost.
  • Tradeoff. Adapter routing/versioning complexity; per-adapter eval still required; a bad base upgrade invalidates all adapters at once (manage via 04).

4.5 Context economy#

  • Mechanism. Lean specs; dynamic context / skills loaded on demand; retrieval over context-stuffing; prune history; structured rather than verbose I/O. Avoid "context dumping" entire repos/specs into every prompt.
  • Impact. Context length drives prefill cost super-linearly via attention; trimming 50% of tokens often cuts prefill cost >50% and improves FPY (less distraction).
  • Tradeoff. Under-supplying context tanks FPY too — economy means right context, not minimal context. Tune against yield.

4.6 Autoscaling, partitioning, queueing#

  • Mechanism. KEDA scale-to-zero for spiky/interactive services; MIG partitioning to pack small models onto GPU slices; spot/preemptible for training/eval batch; Kueue for queue + quota fairness.
  • Impact. Scale-to-zero eliminates overnight idle on bursty endpoints; MIG raises packing density; spot cuts training $ 60–90%.
  • Tradeoff. Cold-start latency on scale-from-zero (mitigate with warm minimums for interactive tiers); spot preemption requires checkpointing. Never put latency-critical interactive gates on pure scale-to-zero without a warm floor.

4.7 Distillation (big → small)#

  • Mechanism. Distill Tier-L behavior into Tier-S/M adapters; the expensive Reasoner generates training signal once, the cheap model serves it forever.
  • Impact. Shifts steady-state load down a tier — recurring 40–65% serving-cost reduction on distilled task families; reduces Tier-L invocation frequency.
  • Tradeoff. Up-front distillation + eval CapEx; distilled model can lag base capability on edge cases — gated re-validation required before it replaces escalation paths.

4.8 In-loop budget guardrails#

  • Mechanism. Hard token/compute caps per task, reasoning-effort caps, and loop-count limits enforced by the orchestrator (06); on breach, fail-closed to human triage rather than burning unbounded compute.
  • Impact. Bounds the worst-case tail — caps the cost of pathological low-FPY tasks that would otherwise loop indefinitely; protects the monthly budget from a single runaway agent.
  • Tradeoff. Caps set too tight truncate legitimately hard tasks (raising escape-rate); caps are themselves tuned against cost-per-Green-PR.
flowchart TD
    A[Incoming agent task] --> B{Cache hit?<br/>prompt / semantic / retrieval}
    B -- yes --> Z[Return cached / cheap path]
    B -- no --> C[Tier-S router/classifier]
    C -->|low complexity| D[Tier-S Reflex<br/>+ multi-LoRA adapter]
    C -->|medium| E[Tier-M Worker]
    C -->|high / escalated| F[Tier-L Reasoner<br/>sparingly]
    D --> V{Verifier / deterministic gates}
    E --> V
    F --> V
    V -- pass --> G[GREEN PR candidate]
    V -- fail --> H{Budget guardrail:<br/>tokens / loops / effort left?}
    H -- yes --> C
    H -- no --> I[Fail-closed → human triage]
    G --> M[OpenTelemetry cost metering →<br/>cost-per-Green-PR]

5. Capacity Planning for 1000+ Developers#

Sizing the fleet is a queueing problem, not a headcount multiplication. The goal is enough capacity to hold interactive latency SLOs at peak while keeping steady-state utilization high.

Estimating concurrent load (illustrative).

Active_devs           = 1000 × engagement_factor(0.6)        = 600
Req_per_active_dev_hr = 30  (agentic calls incl. loops/retrieval/eval)
Average_RPS           = 600 × 30 / 3600                       ≈ 5 RPS sustained
Peak_RPS              = Average_RPS × peakiness(3.0)          ≈ 15 RPS
GPU_needed_at_peak    = Peak_RPS ÷ per-GPU_throughput_at_SLO  (per tier)
Planning dimensionApproach
Peak vs. averageSize interactive tiers for peak RPS at the latency SLO; size batch (eval/training) for average throughput with queueing. Peakiness factor measured per region/timezone.
GPU fleet sizingPer-tier: ceil(Peak_RPS ÷ throughput_at_SLO) + headroom. Bottom-heavy fleet (mostly Tier-S/M, few Tier-L) mirrors the routing distribution.
Batch vs. interactive separationDedicated pools. Interactive = warm, latency-bounded, KEDA with warm floor. Batch = Kueue-queued, spot-backed, scale-to-zero, latency-tolerant. Never let a training job preempt an interactive gate.
Multi-tenancy fairnessKueue quotas per team/repo so no tenant starves others; borrowing from idle quotas allowed, reclaimable on demand.
Headroom for eval/trainingReserve explicit capacity (illus. 15–25%) for gate suites, reproducibility reruns, and fine-tuning — these are non-optional regulatory load, not discretionary.

6. Build-vs-Buy / Self-Host Math#

ILLUSTRATIVE structured comparison. Numbers are placeholders to frame the reasoning, not a quote.

FactorHypothetical SaaS API at this scaleSelf-Hosted Fleet
Annual token volume (illus.)~30B billable tok/yr (incl. loops, eval, retrieval)same workload, owned compute
Unit basisblended vendor $/1k tokamortized $/GPU-hr + ops
Annual run cost (illus.)30M × $X_blended_per_1k → large, uncapped, linearGPU_CapEx ÷ amort_yrs + Ops_OpEx + power
Marginal next-token costvendor rate (never zero)≈ marginal power on owned GPU
Cost trajectory at growthscales with usage foreverflattens as utilization rises

Break-even reasoning. Self-host carries up-front CapEx (GPUs, networking) + steady Ops OpEx (SRE, FinOps, MLOps, power, eval engineering). API carries zero fixed cost but a per-token meter. There is a crossover volume above which amortized self-host is cheaper:

Break-even when:  (CapEx ÷ amort_years) + Ops_OpEx_annual + Power_annual
                   <  Annual_token_volume × Blended_API_rate_per_token

→ Self-host wins decisively once volume × API_rate exceeds fixed+ops cost.
  At 1000+ devs with loop/eval/retrieval amplification, our volume is FAR above crossover.

Non-cost drivers that make self-host non-optional here (these hold even if the math were neutral):

  • Sovereignty. Regulated source, specs, and defect data must not leave the boundary (07).
  • IP control. Proprietary device engineering knowledge stays in-house; no third-party training on our data.
  • Regulatory control. Pinned, reproducible weights for IEC 62304 validation; no vendor-driven model drift mid-gate (P7, 05).

The economics make self-host attractive; the regulatory and sovereignty constraints make it mandatory.


7. FinOps Operating Model#

Ties to D7 in 02-maturity-model. FinOps is a standing practice, not a quarterly cleanup.

CapabilityImplementation
Cost attributionEvery inference/eval/training call tagged with team, repo, agent, tier, adapter, task-id, cache-status via OpenTelemetry spans → cost pipeline. Attributable to cost-per-Green-PR per repo.
DashboardsOTel → cost warehouse → dashboards: $/Green-PR, FPY, tier mix, cache hit-rate, GPU utilization, eval/repro share, idle %.
Budgets & alertsPer-team monthly budgets; alerts at 70/90/100%; automatic in-loop guardrails (§4.8) enforce hard caps independent of dashboards.
Showback / chargebackShowback by default (visibility, behavior change); chargeback for high-volume teams to internalize cost.
Cost SLOsExplicit objectives, e.g. $/Green-PR ≤ target, eval share ≤ 40%, GPU util ≥ 70%, cache hit ≥ 50%, idle ≤ 10%. Breach triggers review.
Quarterly optimization reviewRe-tune routing thresholds, quantization choices, cache policy, fleet mix, distillation candidates against measured FPY and $/Green-PR. Feeds the L5 closed loop (§8).

FinOps register (illustrative)#

IDMetric / controlIllustrative targetOwnerCadence
FIN-01Cost-per-Green-PR≤ $T per repo classFinOps + Repo leadWeekly
FIN-02First-Pass Yield≥ 0.70Platform + EvalWeekly
FIN-03GPU utilization≥ 70%Platform SREDaily
FIN-04Idle / overprovision≤ 10%Platform SREDaily
FIN-05Cache hit-rate (prompt+semantic)≥ 50%PlatformWeekly
FIN-06Tier-L invocation share≤ 10% of callsRouting ownerWeekly
FIN-07Eval + reproducibility share≤ 40% of inference $QualityMonthly
FIN-08Spot usage on training≥ 80% of training GPU-hrMLOpsMonthly
FIN-09Budget breach incidents0 unbounded-loop runawaysFinOpsMonthly

8. Cost Across the Maturity Levels#

Unit economics improve as the org climbs ASMM-Med — not because tokens get cheaper, but because the denominator (Green PRs) grows and waste shrinks.

LevelCost profileUnit economics & control$/Green-PR trend
L0 Ad-hocUntracked, sporadic; no fleet meteringNo attribution; cost-per-token invisibleUnknown / uncontrolled
L1 Governed AssistanceMostly interactive assist; basic metering beginsCost-per-token visible; FPY undefined; little cachingHigh, noisy
L2 Spec-Driven Bounded AutomationBounded tasks; routing + caching introduced$/Green-PR first measured; guardrails appearDeclining, variable
L3 Orchestrated Agentic WorkflowsMulti-step agents; eval load rises sharplyFull tier routing, multi-LoRA, budget caps; eval share managedStabilizing
L4 Validated Autonomous AgentsHigh-volume autonomous, heavy validation/reproStrong FPY; distillation steady-state; tight cost SLOsLow, predictable
L5 Self-Optimizing Agentic EnterpriseClosed-loop cost optimizationSystem auto-tunes routing/quantization/cache/fleet against $/Green-PR; data-driven distillation pipelineMinimized, self-correcting

L5 is the target end state: routing thresholds, quantization choices, cache policy, and distillation candidates are selected by the system from telemetry, continuously, against cost-per-Green-PR — with every change still passing deterministic gates (05).


9. Anti-Patterns#

Anti-patternWhy it costsCounter
Always-on big modelsTier-L (~65× Tier-S) serving routine calls; idle Reasoner GPUsTiered routing + smallest-capable-model + scale-to-zero (§4.1, §4.6)
Context dumpingWhole repos/specs into every prompt; super-linear prefill cost; lowers FPYContext economy, retrieval over stuffing, prompt cache (§4.5, §4.2)
Unbounded agentic loopsPathological tasks loop forever, blow the budget on one runawayHard token/loop/effort caps, fail-closed to human (§4.8)
Optimizing tokens, ignoring escape-rate/reworkCheap-but-wrong: per-token down, FPY down, $/Green-PR up — the §1/§3 trapGovern by cost-per-Green-PR; measure FPY before/after every change
Idle GPU sprawlReserved-but-unused GPUs, warm pools nobody uses, no batch/interactive splitScale-to-zero, MIG packing, Kueue quotas, idle SLO (FIN-04)
No cachingRecompute identical prefixes/embeddings every callKV/prefix + prompt + semantic + retrieval cache (§4.2)
Forgetting eval cost99.9% gates + reproducibility runs un-budgeted; surprise overrunsTreat eval/repro as first-class line (FIN-07), reserve headroom (§5)
Quantize-and-prayThroughput up, accuracy silently down, gates start failingRe-pass deterministic gates after every quantization (§4.3)

Bottom line. This program does not minimize cost-per-token; it minimizes cost-per-Green-PR while holding ≥99.9% gate correctness. Self-hosting gives us the cost curve and the sovereignty; the levers in §4, the capacity discipline in §5, and the FinOps loop in §7 keep that curve flat as the org scales L1→L5. Quality is the dominant cost lever — a higher first-pass yield is cheaper than any cheaper token.