← All summaries

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast · Dwarkesh Patel — Reiner Pope · April 29, 2026 · Original

Most important take away

The economics, latency, and feature set of every frontier LLM API are dictated by a small set of hardware ratios (compute vs. memory bandwidth, scale-up vs. scale-out, HBM capacity vs. context length) and one optimization: batching. Once you understand the roofline math, things like why “fast mode” costs 6x, why context above 200K is priced higher, why decode is ~5x more expensive than prefill, and why context lengths have stalled around 1M all become predictable. The biggest current bottleneck for both bigger models and longer context is the memory wall (HBM bandwidth and capacity), not compute.

Summary

This is a blackboard-style lecture on the math of training and serving large language models, with Reiner Pope (CEO of Maddox, formerly TPU architecture at Google).

Key themes and insights:

  • Batch size dominates everything. The single biggest lever in inference economics is how many user requests you batch together. Without batching, costs can be ~1000x worse. The optimal micro-batch size is roughly 300 × sparsity_ratio (~2,000 tokens per forward pass for typical MoE models), determined by the dimensionless ratio of chip FLOPs to memory bandwidth, which has stayed near 300 across GPU generations.
  • Why “fast mode” exists (and why slow mode could too). Paying more for faster output buys you a smaller batch (less amortization of weight loads); paying less could buy you a slot in a bigger, slower batch. The lower-bound latency on a given hardware is just total_params / aggregate_memory_bandwidth — typically ~20ms per forward pass on modern HBM.
  • Sparsity is essentially “free quality.” Empirical scaling laws (e.g., the routed-language-models paper) show ~4x active params is roughly equivalent in quality to a dense model 64x bigger. Combined with batching math, this is pure win until you exhaust user demand or HBM capacity.
  • Rack architecture explains model architecture. Mixture-of-experts uses all-to-all communication and fits perfectly inside a single NVLink scale-up domain (NVL72 = 72 GPUs). Going to two racks bottlenecks on the ~8x slower scale-out network. Pipeline parallelism works across racks and solves model-weight capacity but not KV-cache capacity (the per-rack KV term doesn’t shrink with pipeline depth).
  • Why frontier models stayed near GPT-4 size for ~3 years. It was waiting for scale-up domains big enough to hold a multi-trillion-parameter MoE plus KV caches. Hopper (8 GPUs, 640GB) → Blackwell (72 GPUs, 10–20TB) finally enabled 5T-parameter models. Google’s TPU pods had bigger scale-up earlier, which partly explains the Gemini lead.
  • Compute equalization heuristic for over-training. Pre-training cost ≈ RL cost ≈ inference cost over a model’s two-month deployment. Working it out implies frontier models are trained on ~100x more tokens than Chinchilla-optimal. From public token-per-second numbers, you can roughly back out pre-training data sizes (~150T tokens for a frontier model).
  • API pricing leaks architecture. The 50% jump in Gemini 3 pricing above 200K context implies KV cache is roughly ~2KB/token. The ~5x premium of output over input tokens shows decode is heavily memory-bandwidth bound vs. prefill which is compute bound. Cache-hit pricing tiers (5-min vs. 1-hour) likely correspond to flash vs. spinning disk based on drain-time math — yes, hyperscalers still use spinning disk for KV cache.
  • The memory wall caps context length. Sparse attention helps (square-root scaling) but isn’t infinite. There’s no clear path to 100M-token context without an HBM breakthrough — bad news for “in-context learning replaces continual learning” theses for AGI.

Career advice

  • Understand the hardware before the model. Reiner repeatedly demonstrates that knowing the roofline math (compute throughput, memory bandwidth, scale-up topology) lets you predict why models, prices, and product features look the way they do. ML engineers who can read API price sheets and reason backward to architecture decisions are valuable.
  • Approximate aggressively. “Set A equal to B and figure it out” — back-of-envelope models that ignore second-order terms are powerful and a good interview/work habit.
  • Watch what’s being deployed in racks, not just papers. Architectural choices (MoE expert count, KV-cache sharing, sparse attention) are gated by physical realities (cable density, rack power, HBM bandwidth) more than by ML theory.

Stocks / investments mentioned

  • NVIDIA — the entire NVL72/Blackwell/Rubin scale-up roadmap is central to the discussion; Blackwell unlocked frontier model deployment and Rubin (~500-GPU scale-up) is positioned as the next jump.
  • Google (TPUs) — large scale-up domains long preceded NVIDIA’s, framed as a structural advantage behind Gemini’s success.
  • Maddox — Reiner’s chip startup; Dwarkesh discloses he is an angel investor. Not publicly investable but flagged as a competitor in the scale-up/inference-chip space.
  • HBM and memory suppliers (implied) — Dylan Patel is cited saying hyperscalers spend ~50% of capex on memory; smartphone memory volumes will fall ~30% due to HBM crowd-out. This points to HBM-makers (SK Hynix, Samsung, Micron) as the constrained bottleneck of the AI build-out.
  • Anthropic / OpenAI / Cursor — referenced via “fast mode” pricing and Cursor Composer; not investment recommendations but signals about where margin and product differentiation live in the API stack.

Actionable insights

  • For builders: choose batch sizes, context windows, and caching tiers (5-min vs. 1-hour) deliberately based on the roofline math — there’s real money in matching workload to memory tier.
  • For prompt designers: caching pays off ~10x; prefill-heavy workflows are ~5x cheaper per token than decode-heavy ones; favor architectures that reuse cached prefixes.
  • For investors: bet on the memory wall — HBM capacity and bandwidth are the binding constraint on both bigger models and longer context, more so than compute.
  • For researchers/operators: sparse attention is an under-deployed lever (DeepSeek’s published work is a roadmap); large scale-up domains are a moat for whoever has them.

Chapter Summaries

Setup and motivation (FastMode, SlowMode). The lecture is framed by the puzzle of API “fast mode” pricing (6x cost for 2.5x speed). The answer is batch size — the single biggest variable in inference economics.

Roofline analysis: time = max(compute, memory). Inference time is bounded below by either compute (batch × active params / FLOPs) or memory (total params / bandwidth + KV cache term). KV cache fetches scale linearly with batch and context.

Latency vs. batch curves. A latency lower bound emerges at ~20ms (HBM drain time). Cost-per-token plummets as batch grows because weight fetches amortize, then flatlines when compute dominates.

Optimal batch size derivation. Solving compute_time = memory_time gives batch ≥ 300 × sparsity — typically ~2,000 tokens, hardware-stable across GPU generations. With ~20ms cycles, this implies a system serving ~128K tokens/sec (~1/1000 of Gemini’s reported global throughput).

Sparsity is a free quality lever. Citing the routed-LM scaling-laws paper: holding active params constant, increasing total params (more experts) keeps improving quality. ~4x active ≈ 64x total at equivalent dense quality.

MoE on a rack: all-to-all and expert parallelism. Mixture-of-experts maps naturally onto a Blackwell NVL72 rack via expert parallelism. The all-to-all traffic pattern is well-served by full NVLink connectivity within a rack but bottlenecks on rack-to-rack scale-out (8x slower).

Why scale-up domains matter. Cabling density, power, and cooling cap rack size. Hopper (8) → Blackwell (72) → Rubin (~500) is enabled by physical rack redesigns. Bigger scale-up = more aggregate HBM bandwidth, which lowers latency lower bound.

Why frontier model size stalled 2022–2025. Took until Blackwell’s 10–20TB scale-up to fit 5T-parameter MoE plus KV caches. Google’s earlier large scale-up partially explains the Gemini lead.

Pipeline parallelism: across racks. Pipelining (different layers on different racks) works because the per-token communication ratio favors it: 8 (bandwidth ratio) is easily overcome by activated experts × layers per stage. Frontier inference still mostly avoids pipelining and stays within a single scale-up domain.

Pipeline parallelism solves weight capacity but not KV capacity. Micro-batching cancels the pipeline-stage gain on KV terms — KV capacity remains the binding constraint, not weight capacity.

Compute-equalization heuristic for training/RL/inference. The minimum-cost design equalizes pre-training, RL, and inference compute. Working it through implies pre-training data ~ inference tokens served, roughly 100x Chinchilla-optimal over-training. From ~50M tokens/sec/model × 2 months ≈ 200T tokens.

Reading API prices. The 200K-token Gemini 3 price step suggests ~2KB/token KV cache (consistent with character.ai-style cross-layer KV sharing or modest sparse attention). The ~5x output/input premium shows decode is memory-bandwidth bound; prefill is compute bound.

KV cache memory tiers. Cache-hit pricing tiers map to memory-tier drain times: 5-min ≈ flash, 1-hour ≈ spinning disk. Hyperscalers really do use spinning disk for KV storage.

The memory wall and context length. Context lengths plateaued at ~100K–1M because HBM bandwidth/capacity, not compute, is the binding constraint. Sparse attention (square-root scaling) helps but is finite. No clear path to 100M-token context — challenges the “in-context learning replaces continual learning” thesis for AGI.

Tangent: cryptography vs. neural nets. Both jumble information aggressively but for opposite purposes. Differentiability is what makes neural nets trainable; differential cryptanalysis is the symmetric attack on ciphers.

Reversible networks (RevNets / Feistel). A Feistel-style construction makes neural net layers invertible, letting backward pass rematerialize activations instead of storing them — trading more compute for less memory (the opposite of the KV cache tradeoff).