NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)
Most important take away
NVIDIA’s Dynamo is a data-center-scale inference engine that sits on top of existing frameworks (vLLM, TensorRT-LLM) and unlocks major cost and latency improvements through disaggregation of pre-fill and decode phases, intelligent scaling, and KV cache optimization. Combined with NVIDIA’s “Speed of Light” (S.O.L.) culture of stripping problems to their physics-level minimum, the episode paints a picture of inference infrastructure becoming the critical bottleneck and opportunity as agents consume exponentially more tokens.
Chapter Summaries
Brev Origin Story and NVIDIA Acquisition — Nader Khalil recounts building Brev, a developer tool for easy GPU provisioning, from scrappy startup (surfboards at GTC booths, foil-printed GPU gift cards) to NVIDIA acquisition. The cultural fit was strong because both teams share a passion for developer experience. Brev now lives at brev.nv.com and is growing rapidly inside and outside NVIDIA.
DJX Spark and Developer UX at NVIDIA — NVIDIA’s developer audience has expanded dramatically beyond CUDA engineers to include hobbyists and “vibe coders.” Brev is building tools so users can register a DJX Spark at home and access it remotely like a cloud GPU. NVIDIA’s developer UX strategy is adapting to serve this much wider population.
S.O.L. (Speed of Light) Culture — Jensen Huang’s S.O.L. principle asks teams to identify the theoretical physics-level minimum time for any deliverable, then layer reality back in. It forces first-principles thinking, creates urgency, and prevents accepting arbitrary timelines. The concept originated from GPU hardware benchmarking (theoretical max throughput) and is now applied company-wide to all projects.
Kyle Kranen’s Background: Recommenders, Graph Neural Networks, and NVIDIA Culture — Kyle joined NVIDIA out of college and moved from recommender systems to graph neural networks largely by self-direction. NVIDIA’s culture lets engineers propose projects, email up the chain freely, and pursue “zero billion dollar businesses” — markets with no current revenue that NVIDIA believes will matter in the future (e.g., autonomous driving for a decade before licensing).
Dynamo: Data-Center-Scale Inference — Dynamo addresses the problem that single-replica inference engines hit scaling limits. It disaggregates pre-fill (compute-bound, reads input) from decode (memory-bound, generates tokens), allowing each to scale independently with dynamic ratios. It includes a Kubernetes component called Grove for orchestrating multi-node, multi-phase inference deployments. On Blackwell with NVLink-72, Dynamo delivers roughly 35x cheaper per-token inference versus Hopper for MoE models.
Model-Hardware Co-Design and Context Scaling — Kyle discusses how models like Kimi K2 and DeepSeek deliberately trade attention heads for more experts to manage context scaling. Current context windows are stuck around 1M tokens, but “unhobbling” breakthroughs (a term from Leopold Aschenbrenner’s “Situational Awareness” essay) like MLA and group query attention could enable 10-100M token contexts. Kyle theorizes that models doing local pre-fill and global decode could break the quadratic pre-fill barrier.
Agents in Production and Security — NVIDIA deployed OpenAI Codex across tens of thousands of employees. An engineer built an Outlook CLI that lets coding agents triage email. The security principle: agents should only do two of three things (access files, access internet, write/execute code) — never all three simultaneously. NVIDIA runs agent workloads on Brev VMs isolated from the corporate network.
CLIs as the Agent Interface — Every dev tool needs a CLI because pre-training data is overwhelmingly CLI-oriented, CLIs are portable and deterministic, and they limit the attack surface compared to arbitrary API calls. NVIDIA is open-sourcing CLIs for its business applications and encourages an “open CLI foundation” approach.
The Year of System-as-Model and Sub-Agents — Kyle frames 2026 as the year of “system as model” where a single API call hides a constellation of models, routers, and agents underneath. Dynamo’s model router for DJX Spark already routes queries between local and cloud models. Agent run times currently average 20-45 minutes of autonomous work; Kyle predicts we will see agents capable of 24+ hour runs with self-consistency by year-end.
Summary
Actionable Insights
-
Learn inference infrastructure deeply. As models shift from training to serving, the bottleneck is moving to inference. Understanding concepts like pre-fill vs. decode disaggregation, KV cache management, tensor parallelism, and MoE expert routing is increasingly valuable for ML engineers and platform teams.
-
Apply S.O.L. thinking to your own projects. Before accepting any timeline, ask: what is the physics-level minimum? Strip away organizational friction, identify the theoretical fastest path, then layer constraints back in. This first-principles approach prevents teams from accepting arbitrary delays.
-
Build CLIs for your tools now. Coding agents work best through terminals. If your product or internal tool lacks a CLI, you are invisible to the agent workflow that is rapidly becoming standard. NVIDIA is converting all its business applications to CLI-first; this is a strong signal of where developer tooling is headed.
-
Adopt the “two of three” agent security rule. When deploying agents, limit them to two of: file system access, internet access, and code execution. Allowing all three creates a full vulnerability chain. Run agent workloads in isolated VMs off your corporate network.
-
Use Brev (brev.nv.com) and build.nvidia.com for GPU access. Brev provides fast GPU provisioning with a clean UX. build.nvidia.com offers free, rate-limited API access to open-source models — useful for prototyping before committing to infrastructure.
-
Watch for “system as model” architectures. The pattern of routing between local models (e.g., on DJX Spark) and cloud foundation models through an intelligent router is going mainstream. Design your applications to be model-agnostic and multi-model from the start.
-
Invest in harness/tool training for your models. Models perform significantly better when post-trained against the specific harness (tools, context structure, system prompts) they will run in production. If you are using off-the-shelf models with custom tooling, budget for post-training or at minimum prompt engineering that aligns with the model’s trained tool format.
Career Advice
- At NVIDIA, momentum is the only authority. You can propose projects by emailing anyone, including VPs, and just start building. If you show progress and something people can try, it spreads. This applies broadly: in any organization, demonstrating working software is more persuasive than asking permission.
- Follow your passion within large organizations. Kyle moved from recommender systems to graph neural nets to Dynamo largely by self-direction. NVIDIA’s culture explicitly encourages this. Look for similar latitude in your own org.
- Inject your genuine voice into your work. Nader credits early advice to stop stripping personality from his writing. Authenticity in developer content, marketing, and communication is a real competitive advantage.
- The “try again” principle applies everywhere. A recent Google paper showed double-digit accuracy gains from simply retrying prompts with failure context. This basic approach — self-distillation, retry with feedback — works both for AI systems and career problem-solving.
Stocks and Investments Mentioned
- NVIDIA (NVDA) is the central company discussed. Key signals: massive internal adoption of coding agents (Codex deployed to tens of thousands), Dynamo delivering 35x cost improvements on Blackwell, Rubin CPX announced as a pre-fill-specific accelerator, acquisition of Groq (LPU), and continued investment in “zero billion dollar markets.” The DJX Spark ($3,999 consumer GPU box) and RTX 6000 Pro (~$8,000, 96GB VRAM) represent new consumer/prosumer revenue streams.
- The broader inference infrastructure thesis is strongly bullish: token demand is growing exponentially, agents are consuming orders of magnitude more compute, and every improvement in cost/latency only increases demand further. This supports long-term GPU demand growth regardless of which model providers win.