You design and govern systems for a living. You know what an API, a container, and a database are — but terms like gradient descent and backpropagation get hand-waved in vendor decks. This book closes that gap. Part I gives you an accurate working model of how AI learns, in plain language and pictures; the rest shows how to turn that understanding into sound architecture decisions for on-premise, regulated, and edge deployments. No mathematics is required — only the diagrams.
How machines learn
A correct, jargon-free mental model of training, gradient descent, and backpropagation — the foundation for every decision that follows.
1What a model actually is
Strip away the mystique and a neural network is a very large mathematical function with millions or billions of adjustable numbers, called weights (or parameters). You feed something in — text, an image — it flows through layers of simple operations, and a result comes out (Figure 1).
The weights are the knobs. Set them randomly and the output is gibberish. Set them well and the same function translates languages, drafts code, or extracts fields from a document. Training is the process of finding good knob settings. Inference is using the trained function — one pass through the layers with the knobs held fixed. Almost everything your users experience is inference; training happened earlier, usually on someone else’s cluster.
A “model” is a function plus a giant table of numbers. “Training” changes the numbers; “inference” reads them. Knowing which one you are doing — and whether you need to do it at all — is the central architectural question of this book.
Inference is cheap, repeatable, and stateless; it scales like any read-heavy service. Training is expensive, stateful, and risky. When a vendor says “we’ll fine-tune a model for you,” they are proposing to run the expensive, risky activity — possibly on your data, possibly repeatedly. Always ask which activity is being proposed, where it runs, and how it is reverted.
2Gradient descent: learning as rolling downhill
How does training find good knob settings among astronomically many possibilities? It does not search them all. It uses feedback to take small, smart steps — an algorithm called gradient descent (Figure 2).
Imagine the model’s total error as a landscape of hills and valleys. Every possible combination of weights is a point on the ground; the height at that point is how wrong the model is. Good weight settings sit in the valleys (low error); bad ones sit on the peaks. Training drops a ball onto this landscape and rolls it downhill: at each point it measures which direction is most steeply downhill (the gradient) and takes one small step that way. Repeat millions of times and the ball settles near a valley floor — a good set of weights.
Two terms you will hear, now demystified: the learning rate is the size of each step (too big and the ball overshoots the valley; too small and training crawls), and convergence means the ball has stopped meaningfully moving — it found a valley. “An epoch” is one pass over the training data; models take many.
Gradient descent is just “measure which way is downhill, step a little, repeat.” The cleverness is in computing “downhill” efficiently for billions of knobs at once — which is exactly what backpropagation does.
3Backpropagation: assigning the blame
Gradient descent needs to know, for every single weight, “which way is downhill?” With billions of weights, computing that naively would be hopeless. Backpropagation is the efficient trick that makes it possible — and it is the engine behind essentially every modern AI model.
It works in two passes. First the forward pass: the input flows left-to-right through the layers and the model makes a prediction. Then the model compares that prediction to the known correct answer and computes an error. Now the backward pass: that error is pushed back through the network from right to left, and at each layer it is divided up — each weight receives a share of the blame proportional to how much it contributed to the mistake. Every weight learns the direction it should nudge to reduce the error. Take that nudge (that is the gradient-descent step), and repeat (Figure 3).
That is the whole idea: forward to predict, backward to assign blame, nudge, repeat. Run it across a large dataset and the weights organize themselves into something useful. Backpropagation is extraordinarily effective — it built the models everyone is talking about — but, as we will see, it has properties that make it an awkward fit for on-device adaptation.
Backpropagation needs a global error signal threaded backward through the entire network, and it changes all the weights at once. On the edge — small data, no rollback, live systems — that global, all-at-once update is precisely what causes drift and forgetting. Hold this thought; Part II builds on it.
4Pretraining, fine-tuning, and inference
Three words get used loosely. Separating them cleanly removes most of the confusion in enterprise AI conversations.
| Activity | What happens | Cost & risk |
|---|---|---|
| Pretraining | The foundation model is trained from scratch on enormous general data (the lab does this). | Millions of dollars; not something you do. |
| Fine-tuning | You continue training a pretrained model on your data, changing its weights. | Moderate cost; real risk — drift, forgetting, hard to reverse (Figure 4). |
| Inference | You run the (pre/fine-)trained model to get answers. Weights are read-only. | Cheap, safe, repeatable. |
The industry reflex is: “generic model underperforms → fine-tune it.” This book argues that on the edge there is a fourth option that is usually better: keep the weights frozen and change the model’s inputs and surroundings instead. That option — in-context learning plus external memory — gets most of the benefit of fine-tuning with almost none of the risk. Parts III–IV make it concrete.
Why “just fine-tune it” fails on the edge
Four hard constraints, the science that quietly supports freezing the base, and how to tell a capacity problem from an adaptation problem.
5The four constraints of the edge
In a cloud lab, fine-tuning is routine. On an edge appliance — a box in a hospital, a factory, a retail back office — four constraints turn the same operation into a liability.
- Reversibility. A weight update is baked into billions of numbers. Proving you can undo it, instantly and completely, is hard — and regulators ask.
- Forgetting & drift. Training on a narrow corpus quietly degrades unrelated abilities (next chapter).
- Capacity & data. A small on-device model plus a small in-house dataset is the regime where fine-tuning pays least and risks most.
- Cadence. Retraining is an event you schedule. Domains change continuously. Adaptation needs to be a habit, not a quarterly project.
Map each constraint to a control you already understand: reversibility → versioning & rollback; drift → regression testing; capacity → right-sizing; cadence → CI/CD. The frozen-base design in Part III gives you all four as properties of the architecture rather than processes you must bolt on.
6Catastrophic forgetting and drift
The most underappreciated risk of fine-tuning has a dramatic name: catastrophic forgetting (Figure 4).
Because backpropagation changes all the weights to fit the new task, it overwrites some of what made the model good at everything else. Teach it your billing terminology and it may get subtly worse at general reasoning, other languages, or last quarter’s task — with no error message. You only find out if you have a comprehensive test suite covering the old abilities, which most teams do not.
Drift is the slow-motion cousin: re-fine-tune every few weeks and the model wanders away from its tested behavior, one update at a time. In a regulated setting, “we are not sure exactly how today’s model differs from the one we validated” is an unacceptable sentence. Freezing the base eliminates this class of risk by construction.
7What the brain-alignment research hints at
There is a striking line of computational-neuroscience research suggesting the foundation we freeze is doing more of the work than we assume — and that heavy retraining can even be counterproductive.
Researchers compared several ways of training small vision networks — including standard backpropagation, two “local” biologically-inspired rules, and an untrained network with random weights — and measured how closely each matched real brain activity. Three findings rhyme with everything in this book:
- The untrained network often matched or beat the backprop-trained one at early processing stages. The structure of the network carried most of the signal; aggressive training sometimes moved it the wrong way.
- Local, lightweight learning preserved structure that global backprop eroded. Gentle, targeted updates beat sweeping ones.
- Higher-level abstraction needed more capacity and data, not a cleverer update rule. When the task genuinely required more, only a bigger model helped — the learning rule was irrelevant.
These are small vision experiments scored against brain data, with deliberately hedged results — not proof about language models. We use them as intuition, not evidence. The engineering case stands on its own measurements. But the rhyme is useful: freeze the structure, adapt gently, and spend capacity only when the problem truly demands it.
8Capacity vs. adaptation: which problem do you have?
Before choosing a tool, diagnose the failure. Almost every “the model is not good enough” complaint is one of two very different problems.
| Symptom | It is a… | Right tool |
|---|---|---|
| Model knows the domain but is sloppy, inconsistent, wrong format, ignores your rules | Adaptation problem | In-context learning, memory, the self-learning loop (Parts III–IV) — no weight changes. |
| Model fundamentally cannot represent the concept — reasoning is beyond its size, even with perfect context | Capacity problem | A larger model, or distillation into the small one (Part VI). Weight changes justified. |
The expensive mistake is treating an adaptation problem as a capacity problem — fine-tuning (or buying a bigger model) to fix what was really a prompt-and-grounding issue. The discipline in Part IV is precisely how you tell them apart with data instead of opinion.
The frozen-base doctrine
Keep the weights fixed and move every form of adaptation into external, reversible state: knowledge, an in-context skill, and lightweight controllers.
9Freeze the base, adapt the surroundings
The doctrine in one line: the model’s weights never change; everything that makes it yours lives outside the weights, in state you can read, diff, and revert.
“Self-learning” does not have to mean “weight-updating.” We define it as swappable external state that measurably improves a held-out score over time. It lives in four layers — only the last touches weights, and that one is held in reserve (Figure 5):
| Layer | What it holds | Weights? |
|---|---|---|
| L0 · Knowledge | A living knowledge graph + search index of your corpus. | No |
| L1 · Skill | A plain-language instruction the model follows — the unit we optimize. | No |
| L2 · Behavior | Tiny per-request controllers that steer the frozen model on hard cases. | No |
| L3 · Weights (parked) | A detachable, never-merged adapter — last resort only. | Yes (revertible) |
Every change is an artifact — a graph delta, a versioned text file, a small controller blob — not an opaque shift in billions of parameters. Auditability and one-step rollback are built in, not bolted on.
10In-context learning: teaching without retraining
The single most useful idea for enterprise teams: you can change a model’s behavior dramatically without touching its weights, simply by changing what you put in front of it. This is in-context learning (Figure 6).
The system prompt is not decoration — it is the procedure the model executes, and it is the thing we treat as the unit of learning. We call a well-developed one a skill: a precise, failure-aware specification of how to do the domain task. Because it is text, it is human-readable, version-controlled, reviewable, and portable — the skill that scored highest last night is exactly what production loads today.
Think of the skill the way you think of configuration or a stored procedure: a versioned artifact in your repository, promoted through environments, rolled back on regression. “Training the model” becomes “promoting a new skill version” — a change management process you already operate.
11Memory and retrieval: grounding answers in your data
A frozen model knows the world broadly but knows nothing of your records. Rather than burn your data into its weights, we keep the data in an external memory and hand the model the relevant pieces at request time. This is retrieval-augmented generation (RAG).
A continuous knowledge layer turns your corpus into typed, queryable structure (a graph) plus a search index. When a question arrives, the system retrieves the relevant records and composes them into the prompt (Figure 7), so the answer is grounded in source records and can cite them. Crucially, the knowledge can change — new documents, corrected facts — without retraining anything. The same memory also supplies the score the self-learning loop optimizes against (Part IV), coupling “what the system knows” and “how well it performs” through one metric.
12Runtime controllers: lightweight local steering
The lightest form of “learning” in the stack is a set of tiny runtime controllers — kilobyte-sized adjustments, fitted from already-verified examples, that nudge the frozen model on hard or ambiguous cases.
They are the on-device echo of the research finding from Chapter 7 — local, structure-preserving adaptation beats global retraining. A controller (the L2 layer of Figure 5) attaches and detaches like a feature flag, never alters the base, and is discarded the moment it stops helping. For most deployments L0–L1 (knowledge + skill) carry the load; controllers are the optional third gear for the genuinely hard cases.
Making it improve itself
A closed loop that improves the in-context skill automatically, every night, with a program as the judge — no labels, no humans, no data leaving the box.
13The self-learning loop
In-context adaptation only compounds if something improves the context automatically. That engine is a nightly loop with a programmatic verifier at its heart (Figure 8).
One cycle: the model tries the domain tasks (sampling several candidate answers each); a program verifies every attempt against ground truth and scores it; a larger “teacher” model reflects on the failures and proposes a small text patch to the skill; and a gate keeps the patch only if it beats a frozen test set the optimizer never sees. Accept → new skill version; reject → revert and log. It runs unattended and entirely on the box.
# one nightly cycle
champion = skill.current()
scored = [verify(t, a) for t in tasks
for a in rollout(model, champion, t, n=N)]
baseline = score_on_frozen_holdout(champion)
patch = teacher.reflect(champion, failures(scored)) # text edits, not scoring
candidate = apply(champion, patch)
if score_on_frozen_holdout(candidate) - baseline >= MIN_LIFT:
skill.commit(candidate) # new champion
else:
keep(champion) # auto-revert
14The verifier is the reward
The make-or-break component is the judge. We insist it be a graded, deterministic program — not a human, and not another AI grading the first.
For an extraction task, the verifier is simply: what fraction of the expected fields did the model get right, after normalization? — a number between 0 and 1. Because the reward is code, the loop runs fully unattended; because it is graded (not pass/fail), a near-miss is distinguishable from a disaster, giving the teacher a gradient to climb. The teacher model writes patches but never scores — generation and evaluation stay strictly separate (step 3 in Figure 8), so the optimizer cannot grade its own homework.
Goodharting — optimize exactly what you measure and the model finds technically-valid garbage; defend with layered checks (format and correctness and grounding). Sparsity — a yes/no reward starves the loop; always grade on a 0–1 scale.
15Evaluation discipline: never fool yourself
The most transferable rule in all of this: a change is only “better” relative to the unadapted model, measured on data the optimizer never saw. The neuroscience papers state it bluntly — always include an untrained baseline — and the loop bakes it in.
- Frozen, fingerprinted test set. Content-hashed and verified before every run; if a byte drifts, the run aborts. The optimizer can never see or touch it.
- One metric, end to end. The same scorer grades practice and the final gate — you optimize exactly what you measure.
- A noise floor. A minimum-improvement threshold rejects sub-noise “wins.” A good gate says no far more often than yes.
- Self-limiting. On a task already near its ceiling, a long string of rejections holding the line is the correct outcome — a system that cannot quietly make itself worse.
This is a regression-test gate for AI, expressed in language you already enforce for code: a protected test set, a single source-of-truth metric, and a merge that is blocked unless it provably improves things. Audit logs fall out for free — every accept/reject is recorded with its score delta.
16What the results taught us
The loop is not a thought experiment. We ran it unattended, night after night, against three real-world corpora — telecom network incidents, pharmaceutical recall records and medical-device files — and watched a frozen ~4-billion-parameter model teach itself with no human labels and no data leaving the box.
Telecom — large headroom, captured automatically. Starting from a short seed instruction that scored 0.69 on a frozen 60-prompt test set, the skill climbed to 0.75, then 0.90, then 0.98 over three accepted nights (Figure 9) — a +42% relative jump to near-perfect extraction of vendor, cell, severity and resolution from messy incident logs. Then it plateaued and the gate stopped accepting changes. No labels were written; the only human input was the scorer.
Pharma — the discipline of knowing when to stop. On a frozen 30-prompt holdout of real public recall data the seed skill already scored 0.96. Over the following nights the loop kept proposing changes and the gate rejected every one — including a candidate that would have dropped the score to 0.93 — holding the line at 0.96. That is not the loop failing; it is the loop working correctly on a task already near its ceiling.
Medical devices — the middle case. A third, harder corpus of medical-device records confirmed the pattern from the other direction: over sixteen nights the loop accepted exactly one improvement (0.77 → 0.78 on a 100-prompt holdout) and rejected the other fifteen. Three domains, three honest outcomes, one unchanged loop.
The headline (telecom 0.69 → 0.98) and the anticlimax (pharma held at 0.96) come from the same machinery. A system that compounds gains where there is room and refuses changes where there is none is exactly what you want running unattended in production.
Five things the testing made concrete:
- Real domains carry large recoverable headroom, and a programmatic verifier captures it without labels (telecom: +0.29 absolute, fully autonomously).
- The same loop knows when to stop. On a near-ceiling domain the correct behaviour is a long run of rejections (pharma) — direct evidence the gate is doing its job (Figure 8).
- It transfers cheaply. Two very different corpora ran on the same hardware and the same loop; only a corpus and a scorer changed — no model retraining, no new architecture.
- Honest, boring results are a feature. “+0.01 and then nothing for two weeks” is the signature of a system that cannot quietly make itself worse — the property regulated buyers actually pay for.
- The discipline is what makes the numbers trustworthy. Every value above is measured on a frozen, fingerprinted test set the optimizer never saw; without that control (Chapter 15) the gains would be unfalsifiable.
Read these as a regression-test track record, not a benchmark leaderboard. The valuable artifact is not a single high score — it is the audit log: a per-night record of what was proposed, accepted or rejected, and by how much it beat the frozen baseline. That log is your evidence trail for change control and compliance.
These corpora are domain-representative but curated, and the test sets are small (30–100 prompts). The point is not the absolute scores — it is the shape: autonomous, label-free improvement where headroom exists, and disciplined refusal where it does not. Re-validate on your own corpus before quoting figures.
Behind the scores sit real stores. In the multi-domain deployment the telecom knowledge graph holds 8.3 MB of typed graph across 41 indexed chunks, 49 entities and 63 edge types (service memory-telecom, port 9003); the medical-device graph is a leaner 1.4 MB (memory-medical, port 9013). Serving runs the executor at roughly 120 tokens/sec on the NVFP4 fast path versus ~30 on bf16, and the whole stack — executor, nightly teacher and memory — occupies about 50 of the 128 GB available.
The hardware reality
What “edge” means in the Blackwell era, the model built to match it, quantization without tears, why throughput sets the pace of learning, and the develop-then-deploy workflow.
17Edge hardware in the Blackwell era — Thor, DGX Spark & RTX Spark
“Edge” no longer means underpowered. A new class of device puts datacenter-grade AI silicon in a box you can put on a desk, in a rack, or on a robot — and run entirely offline.
The common thread across NVIDIA’s current devices is a large pool of unified memory (128 GB shared coherently between CPU and GPU, ~273 GB/s) and native 4-bit (NVFP4) compute. That combination lets the serving model, a larger teacher used only at night, and the memory layer all stay resident at once — no swapping, no second machine.
Two devices anchor this book, at opposite ends of the spectrum (Figure 10): the desktop-class DGX Spark developer box and the embedded Jetson Thor edge module. A third, the consumer RTX Spark, brings the same substrate to Windows PCs.
| Spec | Jetson AGX Thor | DGX Spark | RTX Spark |
|---|---|---|---|
| Role | Edge · physical AI | Desktop · develop & fine-tune | Windows PC · agents |
| FP4 compute | ~2070 TFLOPS | ~1 PFLOP | ~1 PFLOP (6,144 RTX cores) |
| CPU | 14-core Arm Neoverse-V3AE | 20-core Arm (GB10 Grace) | 20-core Arm (MediaTek) |
| Memory | 128 GB unified LPDDR5X · ~273 GB/s (shared across all three) | ||
| Power | 40–130 W | desktop | laptop / mini-desktop |
| Notable | Holoscan + Isaac sensor fusion, MIG | fine-tune ~70B; cluster ×2 (ConnectX-7) → ~405B | 120B LLM @ up to 1M-token context |
Three form factors, one substrate — complementary, not competing. The workflow follows the shape of the table: do the heavy, occasional work on the desktop and deploy the result at the edge (Figure 11).
18Gemma 4 E4B: a model built for the edge
Edge-class hardware is only half the story. It is paired with a model designed, from its name down, to be edge-first: Gemma 4 E4B, a compact instruction-tuned open model whose architecture trades raw size for footprint efficiency at every turn.
The name itself decodes the key ideas:
- E = “Effective.” About 8B weights are stored, but only ~4B activate for any given token — large-model behaviour at a ~4B compute and memory cost.
- PLE — Per-Layer Embeddings. The mechanism behind “effective”: a large block of embedding parameters (roughly 40% of the model) feeds each decoder layer from a cheaper memory tier instead of crowding the accelerator’s hot path. It is also the quantization-fragile block the next chapter protects.
- E2B nested (MatFormer). A smaller ~2B model lives inside the ~4B one, usable for cheap drafts or speculative decoding.
- 128K context, hybrid attention. Long inputs are handled with a mix of sliding-window and global attention — efficient on long sequences.
Why this is edge-first. Every one of these choices buys footprint efficiency rather than raw scale (Figure 12). They are not generic tricks; they are specifically what lets the whole stack — the executor model, a larger nightly teacher, and the memory store — fit inside 128 GB of unified memory and run at interactive speed on a 40–130 W board. Pairing E4B with a Blackwell device is a co-design, not a compromise.
“Effective” is the whole game: the model stores more parameters than it activates, so it punches above its weight class on reasoning while keeping a small model’s memory and speed footprint.
Because E4B is small and its embeddings can sit in a cheaper tier, the executor, the teacher, and the memory layer can all stay resident on one box at once. That residency is exactly what makes the nightly self-learning loop (Part IV) feasible on a single edge appliance — you are not paging models in and out to make room.
PLE is fragile. A serving runtime that silently drops it produces degraded output with no crash to warn you. The rule (detailed next chapter): keep PLE in bf16, quantize only the compute weights to NVFP4, and fail the build if the conversion touches a PLE tensor.
19Quantization without tears
The lever that makes big models fit and run fast on the edge is quantization: storing each weight in fewer bits (Figure 13).
A weight stored in 16-bit precision (bf16) is accurate but heavy; the same weight in 4-bit (NVFP4) is a quarter of the size and runs roughly four times faster, with carefully managed accuracy loss. The catch is that not all parts of the model tolerate it equally. Some compact models keep a fragile block of per-layer embeddings (PLE) that must stay high-precision — quantize them and quality collapses silently, with no crash to warn you.
Quantize only the compute weights (attention and feed-forward) to NVFP4; keep the per-layer embedding tables in bf16. Bake a safety check into the build so it fails if a fragile tensor gets quantized, rather than shipping a quietly worse model. And never serve such a model on a runtime that silently drops PLE.
20Throughput is the learning rate
Here is the non-obvious connection that ties hardware to outcomes: on an edge device, how fast the model runs sets how fast it can learn.
The self-learning loop improves by running many graded attempts per night — try, score, reflect, repeat. The more attempts you can run in the nightly window, the more the skill improves. Because 4-bit compute is ~4× faster than 16-bit (Figure 13), it buys ~4× more turns of the self-learning loop (Figure 8) per night on the same box. Throughput is not a vanity metric here; it is the literal pace of improvement. This is why the quantization and memory choices in the previous chapters are first-order, not housekeeping.
21Develop on the desktop, deploy at the edge
Because the same substrate appears in a desktop developer box and an embedded edge module, a clean workflow falls out.
Do the heavy, occasional work — building the corpus, any optional fine-tuning, proving out the skill — on the desktop-class DGX Spark (which can even cluster two units for large jobs). Then deploy the frozen base plus the proven skill artifact, unchanged, to Jetson Thor at the edge for real-time use, and to RTX Spark PCs for individual users. Nothing is re-architected between them; the artifact is portable because the substrate is shared. The artifact is portable because the substrate is shared (Figure 11). This mirrors the system’s own split: the nightly teacher loop is desktop-class work; the on-device serving is the edge deployment.
Enterprise architecture & economics
A reference architecture, scaling to many domains, the rare case for touching weights, and the ROI and governance story for decision-makers.
22A reference architecture
Production systems built on these ideas share a recognizable shape: an OpenAI-compatible orchestrator behind a secure edge, with model serving and a retrieval layer, fed by an offline ingestion pipeline.
The request path is familiar to any architect: a client calls a single authenticated entry point; identity and routing are enforced at the edge; an orchestration layer selects an agent and tools; the agent reasons with the served model and grounds its answer via the retrieval layer. A separate, scheduled ingestion pipeline keeps the retrieval index fresh out-of-band. The self-learning loop plugs in as a nightly job against the same serving and retrieval components — it is not a separate system.
Everything here maps to controls you already run: an API gateway, OIDC/JWT identity, a service mesh, stateless inference workers, a vector/graph store, and scheduled jobs. The AI-specific parts are the skill artifact, the verifier, and the frozen-test gate — three small additions to an otherwise conventional platform.
23Scaling to many domains
The frozen-base design scales sideways cheaply because the expensive parts are shared and the domain-specific parts are small and isolated (Figure 14).
The larger teacher model and the embedding encoder are domain-agnostic and idle most of the day, so one instance serves every domain. Each domain keeps its own executor, its own memory collection (no cross-contamination), and its own skill chain and frozen test set — an independent, auditable proof trail. Onboarding a new domain means supplying a corpus and a scorer, declared in one config — no model retraining. Three to four domains fit comfortably on a single 128 GB box; the per-domain marginal cost is a slice of one power-efficient appliance.
24When to actually touch the weights
The doctrine is “weights last,” not “weights never.” Chapter 8 told you when last arrives: when the limit is genuinely capacity, not adaptation.
Operationalize it with an objective trigger: if held-out improvement stalls below a small threshold for several consecutive nights while the lighter levers are still active, the reversible options have demonstrably flattened — the task may now be capacity-bound. Only then consider a weight update, under strict conditions:
- Detachable, never merged — keep any adapter as a separate, hot-swappable file so the frozen base is always recoverable.
- Distill from the verified corpus — the teacher generates and the program filters; you are amortizing an already-proven corpus, not inventing capability beyond the small model’s ceiling.
- Forgetting gate — promote the adapter only if new-domain gain clears a bar and no prior-domain regression exceeds a tight bound, measured on frozen test sets. The same discipline as Chapter 15, now guarding against forgetting.
25ROI, governance, and risk
For decision-makers, the frozen-base approach changes the shape of both the cost curve and the risk profile.
| Dimension | Cloud fine-tune / API | Frozen-base on the edge |
|---|---|---|
| Cost shape | Opex that rises with usage; repricing risk | Capex once; marginal query cost ≈ electricity |
| Data | Leaves the building | Never leaves; compliance unlock |
| Improvement | Event-driven; vendor-paced | Compounds nightly on your data |
| Reversibility | Hard (merged weights) | One step (detach the artifact) |
| Audit | Opaque weight diff | Versioned artifacts + scored accept/reject log |
The through-line of the whole book is a single inversion of the default. The cloud reflex — “if it underperforms, train the weights” — is the wrong primitive on the edge, where reversibility, drift, capacity, and cadence all argue the other way. Freeze the substrate, adapt in context, verify against a frozen baseline, and let throughput compound the gains nightly. Spend capacity — deliberately, and last — only on the problems that truly demand it.
Reference
A plain-language glossary of every term introduced, and the sources behind the science.
AGlossary
- Weight / parameter
- One of the millions/billions of adjustable numbers inside a model. Training sets them; inference reads them.
- Training
- The process of adjusting weights so the model performs better, using gradient descent + backpropagation.
- Inference
- Running a trained model to get an output. Weights are read-only; cheap and repeatable.
- Gradient descent
- The learning algorithm: repeatedly step the weights a little in the most error-reducing direction (“roll downhill”).
- Backpropagation
- The efficient method that computes, for every weight at once, which way reduces error — by sending the error backward through the network.
- Learning rate
- The size of each gradient-descent step. Too big overshoots; too small crawls.
- Fine-tuning
- Continuing to train a pretrained model on your data — it changes the weights, with the risks this book describes.
- Catastrophic forgetting
- When training on a new task silently degrades previously-learned abilities.
- In-context learning
- Changing a model’s behavior via what you put in the prompt, without changing weights.
- Skill
- A versioned, optimized in-context instruction — the unit of learning in the frozen-base design.
- RAG (retrieval-augmented generation)
- Fetching relevant records from a knowledge store and giving them to the model as context, so answers are grounded.
- Verifier
- A deterministic program that scores a model’s answer 0–1 against ground truth — the reward the loop optimizes.
- Frozen holdout
- A protected test set the optimizer never sees, used to decide whether a change is genuinely better.
- Quantization
- Storing weights in fewer bits to shrink and speed up the model (e.g. 16-bit → 4-bit NVFP4).
- PLE (per-layer embeddings)
- A quantization-fragile part of some compact models that must stay high-precision.
- Unified memory
- Memory shared coherently between CPU and GPU — lets a whole stack stay resident on one device.
- Distillation
- Training a small model to imitate a larger one — a capacity tool, used only when adaptation has plateaued.
BReferences & further reading
- Leutenegger, N. (2026). Untrained CNNs Match Backpropagation at V1: A Systematic RSA Comparison of Four Learning Rules Against Human fMRI. arXiv:2604.16875. Frozen/untrained network matches trained at early stages; local rules beat global; “always include an untrained baseline.”
- Leutenegger, N. (2026). Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings. arXiv:2605.22401. Higher-level alignment scales with model capacity and data, not the learning rule.
- Rumelhart, Hinton & Williams (1986). Learning representations by back-propagating errors. Nature. The original backpropagation paper.
- Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS. In-context learning at scale.
- Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. Detachable adapters.
- Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.
- NVIDIA (2026). DGX Spark, Jetson Thor, and RTX Spark product documentation. Blackwell, 128 GB unified memory, NVFP4.
- Unovie.AI (2026). Training Without Retraining: A Frozen-Base Doctrine for Custom Models on the Edge (companion whitepaper).