Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient

Latent Space · Latent Space hosts — Chris Manning, Fan-yun Sun · April 2, 2026 · Original

Most important take away

Moon Lake is building world models that prioritize causal reasoning, interactivity, and long-term consistency over pixel-level fidelity, arguing that symbolic abstractions combined with multimodal reasoning are fundamentally more efficient and capable than scaling up video diffusion models alone. Their approach uses a reasoning model that employs game engines and other tools as “cognitive tools,” enabling interactive worlds where actions have real consequences — something current video generation models like Sora cannot achieve.

Chapter Summaries

Origins of Moon Lake

Fan-yun Sun describes how the company emerged from his PhD work with Nvidia Research on generating interactive worlds for embodied AI agents. He observed massive industry spending on interactive training environments and saw an opportunity to build reasoning models that understand consequences of actions, rather than relying solely on video generation. Chris Manning joined motivated by the pursuit of multimodal AI beyond language, noting that vision understanding has stalled despite decades of investment.

What Makes a World Model

The team defines a true world model as “action-conditioned” — it must predict what changes in the world given some action. This is fundamentally different from video generation models (Sora, V03) that produce visually impressive output but lack 3D understanding, object physics, or consequence prediction. Longer time horizons require abstracted semantic models rather than pixel-level prediction.

Structure vs. Scale (The Bitter Lesson Debate)

Manning argues that while scale matters, working with symbolic abstractions can achieve results with five orders of magnitude less data than pure pixel-based approaches. They draw an analogy to human cognition: most visual input is never deeply processed; humans work with semantic abstractions. Sun clarifies they are not anti-bitter-lesson but argue the right abstraction level matters — training next-byte prediction would be maximally bitter-lesson but computationally infeasible.

Philosophical Differences with Yann LeCun and JEPA

Manning directly contrasts their approach with LeCun’s, arguing LeCun fundamentally undervalues language and symbolic representations. He makes the evolutionary argument that language gave humans a “cognitive tool” (borrowing Dan Dennett’s term) that vaulted intelligence beyond what chimps can achieve with vision alone. On JEPA specifically, Manning agrees joint embeddings are valuable but argues that transformer weights already serve as a form of joint representation.

Interactive Reasoning Traces and the Bowling Demo

The team walks through detailed reasoning traces showing how their model builds an interactive bowling game — handling physics, scoring, audio triggers, pin dynamics, and timer mechanics. This demonstrates interactivity that competitors like Google’s Genie and World Labs’ Marble demos lack.

Referee: The Diffusion-Based Renderer

Moon Lake trains two models: a multimodal reasoning model for causality and logic, and “Referee,” a diffusion model that re-styles the persistent world representation into photorealistic or arbitrary visual styles. This separates world understanding from visual fidelity, with Referee respecting the world state while improving pixel quality. They position this as a replacement for traditional rendering pipelines like DLSS.

Evaluating World Models

Evaluation is extremely difficult because metrics depend entirely on use case. For games, it is time spent in-world; for embodied AI, it is policy robustness in target environments. Manning compares this to the broader challenge of evaluating LLMs on open-ended tasks. People “vote with their feet.”

Audio and Multimodal Integration

Moon Lake’s spatial audio is enabled by the game engine tools available to the reasoning model, producing audio that is spatially consistent with the world state — something video generation models with bolted-on TTS cannot achieve. They are training toward a combined latent representation across audio, text, language, and video modalities.

Chris Manning’s Career Arc

Manning traces his path from language-focused NLP through visual question answering (where he noticed vision understanding was poor) to generative AI and world models. He mentions connections to students like Demi Guo (Pika founder) and Andre Karpathy, and joint language-vision work with Richard Socher.

Hiring and Company Vision

Moon Lake is ~18 people based in San Mateo (moving to SF), hiring people at the intersection of code generation, computer vision, and computer graphics. The three-year vision is a platform where users specify goals (teach kids humility, train rescue drones, fine-tune vacuum robots) and the world model generates appropriate training/evaluation environments.

The Name “Moon Lake”

Inspired by DreamWorks and Industrial Light & Magic vibes, the “moon” represents reflection — both the visual concept and the self-improvement loop they see as the path to multimodal general intelligence.

Summary

Actionable Insights:

The symbolic + neural hybrid approach is gaining momentum. Moon Lake’s bet on combining symbolic reasoning with diffusion models mirrors a broader trend (Physical Intelligence made similar observations about storing text for long-term memory). Engineers and researchers should consider structured, abstracted representations rather than defaulting to end-to-end pixel-based approaches for spatial and causal reasoning tasks.
Action-conditioned data is the bottleneck, not just video data. Observational video data (scraped from the internet) lacks action labels. Companies collecting or generating action-conditioned data (via simulation or robotics) hold a significant advantage. If you are building in embodied AI or world simulation, prioritize environments where actions and consequences are explicitly labeled.
Rendering is being unbundled from world state. Moon Lake’s Referee model treats rendering as a separable “skin” layer on top of a persistent world model. This suggests opportunities in stylization, customization, and programmable rendering that could disrupt traditional game engines and DLSS-style upscaling.
Evaluation of world models is an unsolved and critical problem. There are no established benchmarks. Companies building in this space should develop domain-specific evaluation frameworks tied to their actual use cases rather than waiting for universal benchmarks.
Career advice (implicit): The most valuable skill profile for this emerging field is the intersection of code generation, computer vision, and computer graphics. People who have written game engines, trained multimodal models across different objectives, or done multimodal latent space alignment are in especially high demand.

No specific stocks or investments were mentioned in this episode.