08 — Token & GPU Economics (FinOps for Self-Hosted Agentic Dev)#
Part of Agentic-Native SDLC for Regulated Medical Device Engineering. Status: Reference baseline · Date: May 2026 · Audience: CFO/Finance, Platform Engineering, Quality/Regulatory. Cross-refs: 01-requirements · 02-maturity-model · 03-reference-architecture · 04-model-strategy-and-finetuning · 05-evaluation-and-validation · 06-agentic-workflows · 07-security-and-compliance · 09-adoption-roadmap
This document is the financial control plane for the program. Every formula, rate, and ratio below is illustrative; the parameters are org-set and owned by the FinOps practice. Where a number appears, treat it as a placeholder to be replaced by measured values from your own fleet telemetry.
1. The Economic Reframing#
Self-hosting fine-tuned open-weight models is not primarily a performance decision for this program — it is an economic and sovereignty decision, and it changes the shape of the cost, not just the magnitude.
A SaaS LLM API is pure, uncapped OpEx: a per-token meter that scales linearly and forever with usage, with no asset on the balance sheet and no floor on marginal cost. A 1000+ developer org running agentic workflows generates enormous token volume — most of it on repair loops, retrieval, and evaluation, not the final answer the human sees. At that volume, the per-token meter becomes the dominant line item and is structurally unbounded.
Self-hosting converts that into two different categories:
| Dimension | SaaS API (rejected) | Self-Hosted Fleet (chosen) |
|---|---|---|
| Cost class | OpEx, per-token, uncapped | GPU CapEx (amortized) + Ops OpEx |
| Marginal cost of one more token | Vendor rate (fixed, never zero) | ≈ marginal electricity once GPU is owned |
| Cost behavior at scale | Linear, unbounded | Step-function (buy capacity) + high utilization wins |
| Data path | Source/spec leaves the boundary | Stays inside the regulated boundary |
| Reproducibility | Vendor-controlled model drift | Pinned weights, P7 reproducible |
| Negotiating position | Vendor pricing power | Internal control of the curve |
Why the org chose this (decision is non-optional, see §6):
- Cost at scale. Above a break-even volume, owned-and-amortized GPU beats per-token billing, and our volume is far above it.
- Data sovereignty. Regulated source, specifications, defect data, and patient-adjacent context must not transit a third-party LLM (see 07-security-and-compliance).
- No per-token vendor billing / no model drift. Validation under IEC 62304 requires that a release gate run today reproduces tomorrow; a silently-updated vendor model breaks that (P7, 05-evaluation-and-validation).
The governing metric: COST-PER-GREEN-PR#
Per Principle P6, cost is measured per verified task, not per token. The unit of economic value this program produces is a Green PR: a change that passes all deterministic gates plus human review (see 06-agentic-workflows). Tokens that do not contribute to a Green PR are waste, regardless of how cheap each one was.
Why optimizing cost-per-token alone is a trap. Cost-per-token is a component, not the objective. The classic failure: a team cuts token price 40% by quantizing aggressively or routing everything to a tiny model, first-pass yield collapses, agents loop and re-attempt, escape-rate to human reviewers rises — and cost-per-Green-PR goes UP even as cost-per-token went down. Cheap-but-wrong is the most expensive mode in a regulated SDLC, because rework, re-validation, and reviewer time dwarf raw inference. The denominator is the lever; the numerator is the temptation.
2. Cost Taxonomy#
All program compute spend decomposes into seven drivers. Each is metered separately (OpenTelemetry → cost, §7) so it can be attributed and optimized independently.
| # | Cost driver | What moves it | Primary controls |
|---|---|---|---|
| 1 | Inference — model size | Params served, tier (S/M/L), MoE active params | Smallest-capable-model, tiered routing (§4), distillation |
| 2 | Inference — context length | Prompt tokens, retrieved context, history | Context economy (§4), prefix/prompt cache, retrieval over stuffing |
| 3 | Inference — output length | Generated tokens, reasoning-effort, loop count | Reasoning-effort caps, loop-count limits, structured output |
| 4 | Inference — batch efficiency | Continuous batching, concurrency, queue depth | vLLM continuous batching, PagedAttention, request shaping |
| 5 | Inference — GPU type/utilization | GPU SKU $/hr, MIG slicing, idle fraction | Right-sizing, MIG partitioning, KEDA scale-to-zero |
| 6 | Training / fine-tuning | LoRA vs full FT, dataset size, epochs, runs | Multi-LoRA, spot/preemptible + Kueue, distillation runs |
| 7 | Evaluation / validation | Gate suites, reproducibility reruns, 99.9% sampling | Eval caching, deterministic seeds, eval-tier routing |
| — | Retrieval / embedding | Embedding calls, rerank, index refresh, vector store | Tier-E batching, retrieval cache, incremental indexing |
| — | Idle / overprovisioning | Reserved-but-unused GPU, warm pools, headroom | Scale-to-zero, autoscaling, batch/interactive split |
| — | Ops / people | Platform SRE, FinOps, MLOps, on-call, eval engineering | Automation, self-service, maturity (L1→L5) |
Do not forget evaluation cost. The ≥99.9% release-gate correctness target (P1) is enforced by running a great deal of inference — large eval suites, adversarial probes, statistical sampling, and reproducibility reruns that re-execute gates on pinned weights for the regulatory record. For mature agentic repos, eval + reproducibility compute is frequently 25–45% of total inference spend and must be a first-class budget line, not an afterthought (see 05-evaluation-and-validation).
3. An Illustrative Cost Model#
ILLUSTRATIVE — all rates are placeholders. Parameters (
$/GPU-hr, throughput, yields) are org-set and replaced by measured fleet telemetry. The structure is the deliverable, not the digits.
3.1 GPU-hour → cost per 1k tokens, by tier#
The per-token cost of a served model is the GPU rental cost divided by how many tokens that GPU produces per hour.
Effective_throughput already bakes in quantization, continuous batching, and speculative decoding (§4). Illustrative steady-state rates:
| Tier | Model class | GPU footprint (illus.) | Eff. throughput (tok/s, batched) | $/GPU-hr (illus.) | $ / 1k tok (illus.) |
|---|---|---|---|---|---|
| Tier-S "Reflex" | 1–8B, FP8/INT8 | MIG slice / 1× GPU | 6,000 | $2.50 | $0.00012 |
| Tier-M "Worker" | 14–34B, AWQ/GPTQ | 1–2× GPU | 2,200 | $2.50 | $0.00063 |
| Tier-L "Reasoner" | 70B+/MoE | 4–8× GPU | 700 | $2.50 | $0.0079 |
| Tier-V Multimodal | VLM | 1–2× GPU | 1,000 | $2.50 | $0.0028 |
| Tier-E Embed/Rerank | embedding | MIG slice | 40,000 (items) | $2.50 | $0.0000175 |
The ~65× spread between Tier-S and Tier-L is the entire economic argument for tiered routing: a call needlessly sent to Tier-L costs as much as ~65 correct Tier-S calls.
3.2 Cost per agent task#
A single agent task is rarely one model call. It is a sequence of calls across tiers, plus the verifier/sandbox compute that makes the work verifiable, plus its share of evaluation — all divided by first-pass yield (FPY) to account for repair loops.
Verifier_sandbox_compute = build/test/static-analysis/sandbox-exec cost to check the candidate (the harness is the product, P5). Eval_amortized = task's share of gate + reproducibility runs. FPY is the multiplier that ties quality to cost: every failed attempt re-spends the numerator.
The repair-loop explosion, holding raw token cost constant, as FPY falls:
| First-Pass Yield | Effective cost multiplier (1 ÷ FPY) | Interpretation |
|---|---|---|
| 0.90 | 1.11× | Healthy; small rework tax |
| 0.70 | 1.43× | Noticeable loop spend |
| 0.50 | 2.00× | Half of all work is redone |
| 0.30 | 3.33× | Loop-dominated; cheap model is a false economy |
| 0.15 | 6.67× | Pathological; escape-rate to humans spikes |
This table is the quantified form of the §1 trap: driving down per-token cost while letting FPY fall is a net loss.
3.3 Worked numeric example (labeled placeholders)#
ILLUSTRATIVE. "Implement a bounded requirement-to-code change with passing unit tests."
| Input (placeholder) | Symbol | Value |
|---|---|---|
| Tier-S router/classify + lint tokens | t_S | 8,000 tok @ $0.00012/1k |
| Tier-M implementation tokens | t_M | 40,000 tok @ $0.00063/1k |
| Tier-L escalation (10% of tasks need it) | t_L | 6,000 tok @ $0.0079/1k × 0.10 |
| Embedding/retrieval | t_E | 20,000 items @ $0.0000175/1k |
| Verifier/sandbox compute (build+test+sast) | C_v | $0.018 |
| Eval/repro amortized share | C_e | $0.012 |
| First-pass yield | FPY | 0.70 |
Token + retrieval cost:
Tier-S : 8,000/1000 × $0.00012 = $0.00096
Tier-M : 40,000/1000 × $0.00063 = $0.02520
Tier-L : 6,000/1000 × $0.0079 × 0.10 = $0.00474
Tier-E : 20,000/1000 × $0.0000175 = $0.00035
Subtotal tokens = $0.03125
Numerator = $0.03125 + C_v($0.018) + C_e($0.012) = $0.06125
Cost_per_task = $0.06125 ÷ FPY(0.70) = $0.0875 ✅Sensitivity — same task, FPY collapses to 0.30 (e.g., over-aggressive quantization or routing too small): $0.06125 ÷ 0.30 = $0.2042 — a 2.3× cost increase with zero change to per-token rates. If that low FPY also raises human escape-rate, the true cost-per-Green-PR rises further still (reviewer minutes are the most expensive tokens in the system).
4. The Optimization Levers#
Each lever lists mechanism → expected impact (illustrative) → tradeoff. They compound; they also interact (over-using one can sink FPY and undo another), so they are tuned against cost-per-Green-PR, never in isolation.
4.1 Tiered model routing#
- Mechanism. A lightweight classifier/router running on Tier-S scores incoming task complexity and dispatches to the smallest capable tier; escalate to Tier-M/Tier-L only on confidence/complexity thresholds or verifier failure. Smallest-capable-model principle.
- Impact. If the majority of low-complexity calls resolve on Tier-S (~65× cheaper than Tier-L), blended $/token can drop 40–70% vs. always-on Tier-L.
- Tradeoff. Router error is double-edged: under-routing tanks FPY (loops); the router itself must be validated and is an eval surface. Mis-tuned thresholds look cheap per-token while raising cost-per-Green-PR.
4.2 Caching (KV/prefix, prompt, semantic, retrieval)#
- Mechanism. PagedAttention/KV-cache reuse + prefix/prompt caching skip recompute of shared system prompts, specs, and skill preambles; semantic caching returns prior answers for near-duplicate requests; retrieval caching avoids re-embedding/re-fetching stable context.
- Impact. Prompt/prefix cache can cut prefill compute 30–80% on repetitive agentic prompts (large shared spec/skill prefixes); retrieval cache cuts Tier-E load materially.
- Tradeoff. Semantic cache must be conservative in regulated paths — a stale or near-miss hit that flips a gate decision is a correctness defect. Cache keys must include weight/version/spec hashes for reproducibility.
4.3 Quantization + speculative decoding + continuous batching#
- Mechanism. FP8/INT8/AWQ/GPTQ shrink memory/raise throughput; speculative decoding uses a small draft model to propose tokens a larger model verifies; continuous batching (vLLM) keeps GPUs saturated across concurrent requests.
- Impact. Quantization commonly yields 1.5–3× throughput/$; speculative decoding 1.5–2.5× latency/throughput on accept-heavy workloads; continuous batching lifts utilization from ~30% to 70–90%.
- Tradeoff. Quantization can degrade accuracy — every quantized model must re-pass deterministic gates (05) before serving. Quantize, then measure FPY, never assume.
4.4 Multi-LoRA adapter amortization#
- Mechanism. Serve one base model with many LoRA adapters (per-domain/per-task behaviors) hot-swapped per request, instead of standing up a full fine-tuned model per behavior.
- Impact. Collapses N dedicated deployments into ~1 base footprint — large reduction in idle/overprovisioning and CapEx; new specialized behaviors become near-zero marginal serving cost.
- Tradeoff. Adapter routing/versioning complexity; per-adapter eval still required; a bad base upgrade invalidates all adapters at once (manage via 04).
4.5 Context economy#
- Mechanism. Lean specs; dynamic context / skills loaded on demand; retrieval over context-stuffing; prune history; structured rather than verbose I/O. Avoid "context dumping" entire repos/specs into every prompt.
- Impact. Context length drives prefill cost super-linearly via attention; trimming 50% of tokens often cuts prefill cost >50% and improves FPY (less distraction).
- Tradeoff. Under-supplying context tanks FPY too — economy means right context, not minimal context. Tune against yield.
4.6 Autoscaling, partitioning, queueing#
- Mechanism. KEDA scale-to-zero for spiky/interactive services; MIG partitioning to pack small models onto GPU slices; spot/preemptible for training/eval batch; Kueue for queue + quota fairness.
- Impact. Scale-to-zero eliminates overnight idle on bursty endpoints; MIG raises packing density; spot cuts training $ 60–90%.
- Tradeoff. Cold-start latency on scale-from-zero (mitigate with warm minimums for interactive tiers); spot preemption requires checkpointing. Never put latency-critical interactive gates on pure scale-to-zero without a warm floor.
4.7 Distillation (big → small)#
- Mechanism. Distill Tier-L behavior into Tier-S/M adapters; the expensive Reasoner generates training signal once, the cheap model serves it forever.
- Impact. Shifts steady-state load down a tier — recurring 40–65% serving-cost reduction on distilled task families; reduces Tier-L invocation frequency.
- Tradeoff. Up-front distillation + eval CapEx; distilled model can lag base capability on edge cases — gated re-validation required before it replaces escalation paths.
4.8 In-loop budget guardrails#
- Mechanism. Hard token/compute caps per task, reasoning-effort caps, and loop-count limits enforced by the orchestrator (06); on breach, fail-closed to human triage rather than burning unbounded compute.
- Impact. Bounds the worst-case tail — caps the cost of pathological low-FPY tasks that would otherwise loop indefinitely; protects the monthly budget from a single runaway agent.
- Tradeoff. Caps set too tight truncate legitimately hard tasks (raising escape-rate); caps are themselves tuned against cost-per-Green-PR.
flowchart TD
A[Incoming agent task] --> B{Cache hit?<br/>prompt / semantic / retrieval}
B -- yes --> Z[Return cached / cheap path]
B -- no --> C[Tier-S router/classifier]
C -->|low complexity| D[Tier-S Reflex<br/>+ multi-LoRA adapter]
C -->|medium| E[Tier-M Worker]
C -->|high / escalated| F[Tier-L Reasoner<br/>sparingly]
D --> V{Verifier / deterministic gates}
E --> V
F --> V
V -- pass --> G[GREEN PR candidate]
V -- fail --> H{Budget guardrail:<br/>tokens / loops / effort left?}
H -- yes --> C
H -- no --> I[Fail-closed → human triage]
G --> M[OpenTelemetry cost metering →<br/>cost-per-Green-PR]
5. Capacity Planning for 1000+ Developers#
Sizing the fleet is a queueing problem, not a headcount multiplication. The goal is enough capacity to hold interactive latency SLOs at peak while keeping steady-state utilization high.
Estimating concurrent load (illustrative).
Active_devs = 1000 × engagement_factor(0.6) = 600
Req_per_active_dev_hr = 30 (agentic calls incl. loops/retrieval/eval)
Average_RPS = 600 × 30 / 3600 ≈ 5 RPS sustained
Peak_RPS = Average_RPS × peakiness(3.0) ≈ 15 RPS
GPU_needed_at_peak = Peak_RPS ÷ per-GPU_throughput_at_SLO (per tier)| Planning dimension | Approach |
|---|---|
| Peak vs. average | Size interactive tiers for peak RPS at the latency SLO; size batch (eval/training) for average throughput with queueing. Peakiness factor measured per region/timezone. |
| GPU fleet sizing | Per-tier: ceil(Peak_RPS ÷ throughput_at_SLO) + headroom. Bottom-heavy fleet (mostly Tier-S/M, few Tier-L) mirrors the routing distribution. |
| Batch vs. interactive separation | Dedicated pools. Interactive = warm, latency-bounded, KEDA with warm floor. Batch = Kueue-queued, spot-backed, scale-to-zero, latency-tolerant. Never let a training job preempt an interactive gate. |
| Multi-tenancy fairness | Kueue quotas per team/repo so no tenant starves others; borrowing from idle quotas allowed, reclaimable on demand. |
| Headroom for eval/training | Reserve explicit capacity (illus. 15–25%) for gate suites, reproducibility reruns, and fine-tuning — these are non-optional regulatory load, not discretionary. |
6. Build-vs-Buy / Self-Host Math#
ILLUSTRATIVE structured comparison. Numbers are placeholders to frame the reasoning, not a quote.
| Factor | Hypothetical SaaS API at this scale | Self-Hosted Fleet |
|---|---|---|
| Annual token volume (illus.) | ~30B billable tok/yr (incl. loops, eval, retrieval) | same workload, owned compute |
| Unit basis | blended vendor $/1k tok | amortized $/GPU-hr + ops |
| Annual run cost (illus.) | 30M × $X_blended_per_1k → large, uncapped, linear | GPU_CapEx ÷ amort_yrs + Ops_OpEx + power |
| Marginal next-token cost | vendor rate (never zero) | ≈ marginal power on owned GPU |
| Cost trajectory at growth | scales with usage forever | flattens as utilization rises |
Break-even reasoning. Self-host carries up-front CapEx (GPUs, networking) + steady Ops OpEx (SRE, FinOps, MLOps, power, eval engineering). API carries zero fixed cost but a per-token meter. There is a crossover volume above which amortized self-host is cheaper:
Break-even when: (CapEx ÷ amort_years) + Ops_OpEx_annual + Power_annual
< Annual_token_volume × Blended_API_rate_per_token
→ Self-host wins decisively once volume × API_rate exceeds fixed+ops cost.
At 1000+ devs with loop/eval/retrieval amplification, our volume is FAR above crossover.Non-cost drivers that make self-host non-optional here (these hold even if the math were neutral):
- Sovereignty. Regulated source, specs, and defect data must not leave the boundary (07).
- IP control. Proprietary device engineering knowledge stays in-house; no third-party training on our data.
- Regulatory control. Pinned, reproducible weights for IEC 62304 validation; no vendor-driven model drift mid-gate (P7, 05).
The economics make self-host attractive; the regulatory and sovereignty constraints make it mandatory.
7. FinOps Operating Model#
Ties to D7 in 02-maturity-model. FinOps is a standing practice, not a quarterly cleanup.
| Capability | Implementation |
|---|---|
| Cost attribution | Every inference/eval/training call tagged with team, repo, agent, tier, adapter, task-id, cache-status via OpenTelemetry spans → cost pipeline. Attributable to cost-per-Green-PR per repo. |
| Dashboards | OTel → cost warehouse → dashboards: $/Green-PR, FPY, tier mix, cache hit-rate, GPU utilization, eval/repro share, idle %. |
| Budgets & alerts | Per-team monthly budgets; alerts at 70/90/100%; automatic in-loop guardrails (§4.8) enforce hard caps independent of dashboards. |
| Showback / chargeback | Showback by default (visibility, behavior change); chargeback for high-volume teams to internalize cost. |
| Cost SLOs | Explicit objectives, e.g. $/Green-PR ≤ target, eval share ≤ 40%, GPU util ≥ 70%, cache hit ≥ 50%, idle ≤ 10%. Breach triggers review. |
| Quarterly optimization review | Re-tune routing thresholds, quantization choices, cache policy, fleet mix, distillation candidates against measured FPY and $/Green-PR. Feeds the L5 closed loop (§8). |
FinOps register (illustrative)#
| ID | Metric / control | Illustrative target | Owner | Cadence |
|---|---|---|---|---|
| FIN-01 | Cost-per-Green-PR | ≤ $T per repo class | FinOps + Repo lead | Weekly |
| FIN-02 | First-Pass Yield | ≥ 0.70 | Platform + Eval | Weekly |
| FIN-03 | GPU utilization | ≥ 70% | Platform SRE | Daily |
| FIN-04 | Idle / overprovision | ≤ 10% | Platform SRE | Daily |
| FIN-05 | Cache hit-rate (prompt+semantic) | ≥ 50% | Platform | Weekly |
| FIN-06 | Tier-L invocation share | ≤ 10% of calls | Routing owner | Weekly |
| FIN-07 | Eval + reproducibility share | ≤ 40% of inference $ | Quality | Monthly |
| FIN-08 | Spot usage on training | ≥ 80% of training GPU-hr | MLOps | Monthly |
| FIN-09 | Budget breach incidents | 0 unbounded-loop runaways | FinOps | Monthly |
8. Cost Across the Maturity Levels#
Unit economics improve as the org climbs ASMM-Med — not because tokens get cheaper, but because the denominator (Green PRs) grows and waste shrinks.
| Level | Cost profile | Unit economics & control | $/Green-PR trend |
|---|---|---|---|
| L0 Ad-hoc | Untracked, sporadic; no fleet metering | No attribution; cost-per-token invisible | Unknown / uncontrolled |
| L1 Governed Assistance | Mostly interactive assist; basic metering begins | Cost-per-token visible; FPY undefined; little caching | High, noisy |
| L2 Spec-Driven Bounded Automation | Bounded tasks; routing + caching introduced | $/Green-PR first measured; guardrails appear | Declining, variable |
| L3 Orchestrated Agentic Workflows | Multi-step agents; eval load rises sharply | Full tier routing, multi-LoRA, budget caps; eval share managed | Stabilizing |
| L4 Validated Autonomous Agents | High-volume autonomous, heavy validation/repro | Strong FPY; distillation steady-state; tight cost SLOs | Low, predictable |
| L5 Self-Optimizing Agentic Enterprise | Closed-loop cost optimization | System auto-tunes routing/quantization/cache/fleet against $/Green-PR; data-driven distillation pipeline | Minimized, self-correcting |
L5 is the target end state: routing thresholds, quantization choices, cache policy, and distillation candidates are selected by the system from telemetry, continuously, against cost-per-Green-PR — with every change still passing deterministic gates (05).
9. Anti-Patterns#
| Anti-pattern | Why it costs | Counter |
|---|---|---|
| Always-on big models | Tier-L (~65× Tier-S) serving routine calls; idle Reasoner GPUs | Tiered routing + smallest-capable-model + scale-to-zero (§4.1, §4.6) |
| Context dumping | Whole repos/specs into every prompt; super-linear prefill cost; lowers FPY | Context economy, retrieval over stuffing, prompt cache (§4.5, §4.2) |
| Unbounded agentic loops | Pathological tasks loop forever, blow the budget on one runaway | Hard token/loop/effort caps, fail-closed to human (§4.8) |
| Optimizing tokens, ignoring escape-rate/rework | Cheap-but-wrong: per-token down, FPY down, $/Green-PR up — the §1/§3 trap | Govern by cost-per-Green-PR; measure FPY before/after every change |
| Idle GPU sprawl | Reserved-but-unused GPUs, warm pools nobody uses, no batch/interactive split | Scale-to-zero, MIG packing, Kueue quotas, idle SLO (FIN-04) |
| No caching | Recompute identical prefixes/embeddings every call | KV/prefix + prompt + semantic + retrieval cache (§4.2) |
| Forgetting eval cost | 99.9% gates + reproducibility runs un-budgeted; surprise overruns | Treat eval/repro as first-class line (FIN-07), reserve headroom (§5) |
| Quantize-and-pray | Throughput up, accuracy silently down, gates start failing | Re-pass deterministic gates after every quantization (§4.3) |
Bottom line. This program does not minimize cost-per-token; it minimizes cost-per-Green-PR while holding ≥99.9% gate correctness. Self-hosting gives us the cost curve and the sovereignty; the levers in §4, the capacity discipline in §5, and the FinOps loop in §7 keep that curve flat as the org scales L1→L5. Quality is the dominant cost lever — a higher first-pass yield is cheaper than any cheaper token.