Baseten CEO Tuhin Srivastava on the AI Inference Crunch, Custom Models, and Building the Inference Cloud
Most important take away
Inference is “the last market” — even in an AGI future, all that’s left is inference, and compute capacity is the dominant strategic asset. The application layer survives because user signal and workflow data feed reward functions for post-trained, specialized models, creating a tight loop between inference and post-training. Speed and access to compute (across many clouds, with operational excellence) beat any single technical edge in this moment.
Summary
Actionable insights for builders and operators
- Start with capability, then optimize. Customers nearly universally pick the best-in-class model first to prove there is something worth optimizing. Only after PMF should you move to post-training, quantization, and custom inference for cost/speed/quality.
- Own the user signal, not the model. The defensible part of an application company is the proprietary user signal it can encode into reward functions for post-training. Workflow integration (e.g., Abridge in EMRs, Decagon in support) is the moat — frontier labs cannot easily access that signal.
- Specialize models per task. A model that is great at customer support does not need to be great at coding. Specialized post-trained models are usually faster, cheaper, and better for a defined workflow.
- Treat inference and post-training as one loop. Inference produces eval data; evals produce reward functions; reward functions enable continual post-training. Quantization decisions depend on how the model was trained. Build (or buy) for the loop, not isolated stages.
- Compute is the scarce resource — plan financing around it. Securing 1,024 H200s now requires 3–5 year contracts with ~20–30% TCV prepay. Inference businesses look more like infrastructure than SaaS; cost of capital and structured debt matter, and going public earlier may be necessary to fund supply commitments.
- Diversify across clouds early. Baseten runs across 18 clouds and 90 clusters with a single runtime fabric, allowing them to onboard a new provider in roughly half a day. Only ~3–4 providers are “gold tier” operationally; many newer GPU clouds lack SLA discipline.
- Don’t bet against NVIDIA short-term. CUDA, the developer ecosystem, and supply chain mastery make NVIDIA dominant for the next several years. Alt-chip ecosystems struggle when one buyer locks up most supply, killing the developer flywheel needed for adoption.
- Use Chinese open-source models without fear, but invest in US ones. Network-bounded models can’t exfiltrate data; the bigger risk is the US lacking a domestic open-source frontier. Effectively, Chinese government subsidies on these models flow through to US enterprises adopting them.
Career and team-building advice
- Hire leaders, not just engineers, earlier than you think. Baseten stayed flat too long. Founders feeling the need to be in everything is a signal they don’t have the right people, not that they need to micromanage harder.
- Be ruthlessly explicit about what you optimize for. Generic “smart, hardworking” criteria don’t filter. Baseten optimizes for first-principles thinkers who are kind, low-ego, collaborative, and don’t need to be managed. The clarity makes both fit and misfit obvious fast — and reduces turnover.
- Build for the highest-scale customer. Following the Stripe playbook: serve the frontier customer who pushes you technically, and the enterprise requirements translate down through them (data retention, SLAs, latency, GPU types).
- Cultivate an operations culture if you are running infrastructure. Pages going off mid-meeting is normal. The culture self-selects: engineers who avoid being on-call leave quickly. Speed of response is the cultural muscle.
- Move fast — the market rewards aggression. In a market this big, the answer is always “go faster.” Exhausting but the right call.
Tech patterns to watch
- Runtime fabric across many clouds with abstracted capacity, failover, and reliability — letting you treat 18 clouds as one inference substrate.
- Disentangling pre-fill and decode as separate problems; inference-specific chips (decode-specific chips) emerging.
- KV cache-aware routing, speculation techniques, and continued runtime maturation for LLM serving.
- Async batch inference as a first-class product to drive utilization.
- Sandboxes for code agents as a core inference-cloud primitive.
- Continual learning APIs that turn training into a continuous, not discrete, process.
- Jevons paradox confirmed for inference — cheaper inference does not reduce consumption; developers extend agent runtimes and insert more intelligence whenever cost drops.
- Edge cases at scale are mostly systems-level (kernel panics from log workers, hyperscaler primitives hitting unforeseen limits) rather than LLM-specific.
Chapter Summaries
Scale and the state of inference
Baseten has grown 30x in the last year and is on track for over $1B in revenue. The driver is the application layer expanding rapidly, customers in-housing intelligence, open-source models crossing a quality threshold, and post-training/RL becoming mainstream.
Why the application layer survives
Frontier labs can’t easily replicate companies with deep workflow integration and proprietary user signal (Abridge, Decagon, Open Evidence). That signal becomes the reward function for specialized post-trained models. By inference count, ~99% of the market hasn’t come online yet — enterprise adoption is still ahead.
Serving frontier-scale customers
Building for the most demanding customer (Stripe’s playbook) future-proofs the platform. Even though Baseten doesn’t sell directly to enterprises, its customers do, so enterprise requirements (data retention, latency, GPU types, transparency) translate through naturally.
Open-source model dynamics and Chinese models
Customers want frontier capability regardless of origin. Chinese models like DeepSeek can run at ~20% the cost of comparable closed models with similar latency. The US needs domestic open-source frontier capability or it loses access to a critical input. Network-bounded deployments mitigate security risk.
Workload composition: it’s all custom now
~95% of Baseten’s tokens run on dedicated inference where customers have modified the model — for quality and increasingly for performance via custom compilation. The Pi acquisition added research/post-training expertise; post-training and inference are tightly coupled (e.g., quantization depends on training choices).
When to invest in custom models
Only after proving PMF with a best-in-class model. Specialized models win when you have user signal and a clear value-creating workflow to optimize against.
The capacity crunch
Slack compute is essentially nonexistent. Baseten runs at “midnight utilization” most of the time, spans 18 clouds and 90 clusters, and can onboard new providers in half a day. A daily 4 PM standing meeting manages capacity-vs-demand. Many newer suppliers are operationally immature; only ~3–4 are “gold tier.” Securing supply now requires 3–5 year contracts with 20–30% TCV prepay, pushing inference companies toward earlier IPOs and structured debt financing.
Winning factors
Excellence at everything: software stickiness (no top-30 customer churn, ~400% NDR), access to compute as a strategic asset, and operational maturity. “You can’t make a good hot chocolate without milk” — owning compute is the foundation.
Multi-chip future
Diversification is desirable, and inference-specific (e.g., decode-specific) chips make sense. But NVIDIA’s CUDA, developer ecosystem, and supply chain mastery make displacement difficult in the near term. Alt-chip ecosystems often fail because one buyer ties up most supply, preventing the broader developer ecosystem from forming.
Workload evolution and product roadmap
Focus areas: runtime efficiency on diverse chips, diffusion transformers, sandboxes for coding agents, speculation techniques, KV-cache-aware routing, and disentangling pre-fill/decode. Strategy is to build the inference + post-training loop and partner around it (e.g., Braintrust for evals).
Edge cases at scale
Hyperscalers don’t actually support infinite scale — every edge case eventually shows up. Recent example: a kernel panic caused by Fluent Bit log workers overwhelming a node. Most surprises are systems-level rather than LLM-specific. LLM runtimes (KV cache handling, etc.) remain immature.
Scaling the team
Baseten stayed too flat for too long. Hiring senior leaders (Danny, Samir, Stephen) was key. Two principles: (1) hire people you can hand a whole problem to — if you’re micromanaging, that’s a founder signal you have the wrong people; (2) be very specific about what you optimize for (first-principles thinkers, low-ego, collaborative, no hero culture). Clear rubric reduces turnover.
Operations culture
Pages going off in meetings is normal in infrastructure. The culture self-selects against engineers who avoid being on-call. P0 means everyone is on the call.
Jevons paradox in inference
Confirmed: cheaper intelligence drives more consumption. Agents run longer, developers insert more intelligence, customers always want better answers. Even in an AGI world, “all that’s left is inference.”
The future
Consumers get concierge-level personalized agents (doctor, education, life). For developers and incumbents, this is an extinction moment for those who don’t embrace embedding intelligence in their workflows — but more software gets built overall, not less.