Notion's Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion
Most important take away
Notion’s AI team has rebuilt its agent harness five times in three years, and the single biggest lesson is to design around what the model actually wants (Markdown, SQL-like queries, goal-driven tools) rather than around your internal data model — then distribute tool ownership across product teams so the agent scales with the company. Their core strategic bet is that Notion wins by being the system of record for enterprise work, so every feature (custom agents, meeting notes, search) is built to make notion the place where humans and agents collaborate — not by training proprietary frontier models.
Summary
Actionable insights
- Build for where the river is flowing, but don’t swim upstream. The Notion AI team’s core skill is distinguishing “the model can’t do this yet” from “we haven’t set up the right infrastructure/context.” If you’re pressing against raw capability, stop; if you’re just missing scaffolding, keep going. Take a portfolio approach: ship things that work now, maintain shipped things, and always keep a few “AGI-pilled” bets 12–18 months out.
- Cater to the model, not your internal data model. Their biggest wins came from scrapping notion-flavored XML and custom JSON query languages in favor of plain Markdown and SQL-like queries — because that’s what the models already know. Lesson: never expose the model to unnecessary complexity from your system.
- Move from few-shot prompts to goal-driven tool definitions. This was the single biggest velocity unlock — it let them distribute tool ownership across product teams instead of having 5–6 people gatekeeping one giant system-prompt string. If you’re still hand-tuning few-shot examples, plan to rip them out.
- Treat your eval system as an agent harness. Build it so an agent can end-to-end download a dataset, run evals, debug failures, and implement a fix, with humans observing the outer loop. Have three eval tiers: (1) unit/regression tests in CI, (2) launch-quality report cards at 80–90%, and (3) “frontier/headroom” evals you want to be failing at ~30% so you can give frontier labs real feedback.
- Hire “Model Behavior Engineers” — and don’t require engineering backgrounds. Notion staffs a dedicated data-scientist + MBE + evals-engineer trio on the “Notion’s last exam” frontier evals. They explicitly welcome linguistics PhDs, new grads, “misfits.” If you’re building AI products, this is a missing role in most orgs.
- Make demos, not memos. Notion’s design team maintains a separate GitHub repo (“design playground”) of prototype components and ships working prototypes with URLs instead of mocks. For engineers, a prototype = a real feature flag. But raise product-conviction filters accordingly, or you’ll build a flat hill instead of one tall tower.
- Involve security review early, not late. Notion brings security in on day one of a project because late involvement “slows us down way more and causes a lot of tension” — and the product is better when security is a design partner.
- CLIs > MCPs for powerful/bootstrapping agents; MCPs > CLIs for narrow, tightly-permissioned ones. Simon is bullish on CLIs because they live in a terminal environment with progressive disclosure and self-bootstrap (the agent can fix its own broken tools). MCPs win when you want a lightweight agent with a strong permission model. Don’t view them as rivals — they’re different layers of the stack.
- Use deterministic code paths where you can; save model tokens for tasks that truly need them. Sarah explicitly flags that calling an LLM to interface with deterministic third-party APIs is “wasteful” — both a bad deal for customers and bad business. Prefer straight API calls over MCPs over open-ended agents, in that order, whenever possible.
- Design your “software factory” around three primitives. (1) A human-readable, mutable spec layer (Markdown files / a Notion database), (2) a strong self-verification/testing loop, (3) a clear bug-flow workflow (sub-agent → PR → review → merge).
- Compose agents via the system of record, not bespoke message buses. When one Notion employee built 30+ custom agents and was drowning in 70 notifications/day, the fix was a “manager agent” that watches an internal-issue-tracker database the other agents write to. Use databases and pages as the coordination substrate — don’t invent new primitives unless you have to.
- For custom agents, the editing chat and the using chat should be the same chat. Notion’s “flippy” redesign — where you talk to the agent to configure it, test it, and use it in one surface — delayed launch by a month but is clearly the right pattern. If your builder UX is separated from your usage UX, reconsider.
- Don’t assume all tokens are worth the same. Notion sells usage-based credits (not raw tokens) so they can abstract over GPU-hosted open-source models, sandboxes, priority tiers, and cache hit rates — and so enterprise sales motions work. But they deliberately do not price per task value because the complexity isn’t worth it yet.
- Pick the right model for the job — “auto” is not just the cheapest. Notion’s “auto” picker is positioned as the robo-advisor: the model best suited to the task, not the cheapest demo model. If you’re building agent products, invest in the routing layer; the frontier labs are clustering at “very capable / very expensive,” leaving a no-man’s-land in the middle that open-source models (Minimax etc.) are filling.
- Don’t train your own frontier model — but do invest in ranking and retrieval. Simon burned time on fine-tuning and concluded it’s the wrong tradeoff when your tools change daily. The exception: retrieval. Notion is hiring ranking/model-training engineers specifically because agent-driven search has different requirements than human search (top-k recall matters far more than positional ranking; query diversity in parallel fan-out matters more than embedding choice).
- Meeting notes as a data-capture primitive. Notion’s meeting notes is one of their biggest growth levers precisely because it creates a self-reinforcing data flywheel for the agent. If your product depends on user context, invest in the capture surface.
Career advice
- Engineering leadership in the AI era is non-hierarchical. Sarah explicitly rejects the “ideas person / technical expert” model of leadership. Her job is to make sure everyone understands the objective, has resources to prioritize, and has an avenue to escalate what they think is important. Most of their best ideas come from prototypes by people who saw a user problem.
- Build a team comfortable deleting its own code. The second rule of engineering leadership at Notion: low ego, driven by what’s best for the company, willing to redo work because the landscape changed. Avoid engineers who “redesign docs because they think it’s their promotion packet.”
- Every software engineer is going through the identity crisis managers already went through. The ability to write code is becoming less important than the ability to delegate-in-context. But unlike managing humans, delegating to agents is actually a rigorous technical design problem — so this is still deeply technical work, just at a higher abstraction level.
- Leverage now accrues to the most curious and excited people. If someone is prototyping something on a weekend because they care about it, that should become the main thing the team is doing. Hackathons are useful for uplifting the general population, but if they’re your only mechanism for innovation, “you’re toast.”
- Model Behavior Engineer is a real career path — and it doesn’t require a CS degree. If you have a linguistics / data-science / PM / prompt-engineering instinct for what models can and can’t do, this is a growing function at top AI teams. Sarah is actively hiring MBEs.
- For AI engineers specifically: stop fine-tuning, start thinking about outer loops. Simon’s advice: when something fails, 99% of the time it’s a bug in one of the tools, not the model. Focus energy on the harness, tool quality, eval velocity, and verification — not on training runs.
Stocks / investments mentioned
No specific public stocks or investments are recommended or actionable. Companies mentioned in passing (for context, not as investment recommendations):
- Notion — private; guests are employees.
- Datadog — used as an analogy (“Datadog on AWS”) for how Notion is a polished wrapper on frontier-lab capability, not a competitive recommendation.
- Robinhood — Sarah’s prior employer; used as an analogy for auto-investing / robo-advisors applied to model routing.
- Anthropic, OpenAI, Google/Gemini, Fireworks, Minimax, Bedrock, Azure — mentioned as model/API vendors Notion partners with; no investment angle.
- Embra — mentioned because Notion’s meeting-notes team manager Zach joined from there; no investment angle.
The one arguably actionable investing mental model in the episode: Sarah notes that frontier labs are clustering at the “very capable / very expensive” end of the intelligence-price-latency triangle, leaving a middle gap that open-source model providers are filling. If you’re making build-vs-buy or vendor decisions, the implication is to plan for multi-model optionality and watch open-source labs closely — not to pick a single frontier-lab bet.
Chapter Summaries
1. Custom agents launch & the “too early” years (2022–2024)
Notion just shipped custom agents — their most successful launch in terms of free trials and conversion. But this was the fourth or fifth rebuild; they started trying to build an agent with GPT-4 access in late 2022, back when “tools” didn’t exist and they were fine-tuning their own function-calling format with OpenAI and Fireworks. Models were too dumb, context was too short, and “glimmers” of working kept them going. The real unlock came with Sonnet 3.6/3.7 in early 2025, followed by a year of reliability and permissions work for background execution.
2. Balancing AGI-pilled bets with shipping
Simon and Sarah describe their portfolio approach: maintenance, near-term shipping, and a few “crazy” long-term bets. Two skills are crucial for building on future capabilities: (1) knowing when you’re swimming upstream against model limits vs. just missing infrastructure, and (2) building the product ahead of the capability so you’re ready when the model catches up. They pitch “coding agents as the kernel of AGI” and the “software factory” — an automated dev → debug → merge → review → maintain loop where agents collaborate inside.
3. The “agent lab” thesis — why Notion isn’t a wrapper
Sarah uses the Datadog/AWS analogy: Datadog couldn’t exist without cloud storage, but AWS’s own CloudWatch doesn’t beat it because Datadog is an expert on how people want observability. Notion’s expertise is how people want to collaborate. Unlike vertical SaaS, Notion is horizontally sliced, so they balance customer feedback with decomposition into clean primitives. They still focus on user journeys (email triage, PDF export) rather than tool-for-tool-sake.
4. Team culture & how they ship
Sarah’s “Tao of engineering leadership”: non-hierarchical, resource-allocating, willing to change direction based on prototypes. Second rule: build a team that will delete its own code. Simon’s role is more AGI-forward prototyping. They run small “Simon vortex” teams with rotating senior engineers, company-wide hackathons for literacy, and “demos over memos” — the design team ships working prototype URLs, engineers ship real feature flags behind real flags.
5. Evals as an agent harness
Sarah passionately separates “evals” into three tiers: CI unit/regression tests, launch-quality report cards (80–90% pass), and frontier/headroom evals at 30% pass rate (this is where “Notion’s last exam” lives). They dedicate a data scientist, a Model Behavior Engineer, and an evals engineer full-time to the frontier tier. Simon’s insight: treat the eval system itself as an agent harness so agents can download datasets, run evals, debug, and fix. They’ve also caught vendors secretly quantizing models by running evals across multiple providers of “the same” model.
6. Model Behavior Engineer as a career path
The MBE role started as “data specialists” — linguistics PhDs and new grads manually inspecting outputs — and evolved into a hybrid of data science, PM, prompt engineering, and taste. Sarah firmly believes you don’t need an engineering background; they welcome “misfits.” Now MBEs build agents that write evals and LLM-judges for themselves. Sarah is actively hiring.
7. Are software engineers still needed?
Simon argues we’re on a continuum from typing code → autocomplete → Copilot → long-range agentic PRs; the human role moves to observing/maintaining the outer loop. Sarah adds that every Notion engineer is going through the identity crisis managers already went through — the skill shifts from writing code to delegating-in-context, which is still a deeply technical design problem.
8. The “software factory” vision
Simon outlines three required primitives: (1) a human-readable/mutable spec layer (Markdown or Notion pages), (2) a strong self-verification testing loop, (3) a clear bug-flow workflow (how does a bug become a sub-agent task, a PR, a review, a merge?). The goal is to minimize human intervention while preserving the variances you care about.
9. Token Town demo — building an agent live
Simon walks through a custom agent he built for “Kernel Labs” tenant applications: it receives emails, creates a Notion database row, web-searches to enrich each applicant profile, and routes. ~15 minutes to set up. Key design point: Notion is already the system of record, so data should live there, not in a third-party tool. Biggest internal use case: bug triage — a custom agent in Slack routes issues to the right team and files database tasks, which “completely changed how Notion functions.”
10. Composing agents via the system of record
When a Notion employee built 30+ custom agents and was getting 70 notifications/day, Simon helped them add a “manager agent” watching an internal-issue-tracker database the other agents write to. The agents coordinate via Notion databases — no special message-bus primitive. Same pattern for memory: no built-in memory concept, just give agents access to a page. “Compose the primitives if you can.”
11. Notion Mail, Calendar & native vs MCP integrations
Notion builds natively first (Mail, Calendar) so they can hand-tune tool latency and quality; they use MCP when they want the long tail or when third-party capabilities are good enough. Search is the biggest example of not using third-party MCPs (Slack/Linear/Jira/Notion search) because agent search trajectories are too critical to hand off.
12. CLI vs MCP — the long segment
Simon is bullish on CLIs for three reasons: (1) terminal environment gives extra power (pagination, progressive disclosure), (2) bootstrap ability — if a tool breaks, the agent can fix itself in the same environment, (3) flexibility (e.g., an agent giving itself a browser in 100 lines). MCPs win for narrow, lightweight, tightly-permissioned agents where the strong permission model matters. Sarah adds the pricing angle: calling an LLM to wrap deterministic third-party APIs is wasteful for both customers and business; Notion wants the right tool at the right layer. They argue there’s no fundamental conflict — different layers of the stack.
13. The five (plus) rebuilds — a brief history
- Coding-engine version — everything is JavaScript. Failed because models couldn’t code well enough.
- XML tool-calling version — lossless mapping to Notion blocks. Failed because models didn’t know the format.
- Notion-flavored Markdown version — big unlock: cater to what the model knows.
- SQL-like database queries — scrapped the JSON format, use SQL. Models are excellent at it.
- Few-shot → goal-driven tool definitions — the biggest velocity unlock; let them distribute tool ownership across product teams. Previously 5–6 people gatekept one system-prompt string; now teams own their own tools.
- Progressive disclosure for 100+ tools — shipping next week; fixes the problem of new tools nerfing the overall model.
14. Custom agent builder UX — “flippy”
Notion’s custom-agent builder lets you chat with the agent to configure, test, and use it in one surface. The agent sets up its own system prompt, debugs itself, and generates its own emojis. This redesign delayed launch by a month but is clearly the right pattern. Permission model: the agent can’t edit its own permissions unless you enter an explicit admin mode.
15. Pricing, credits, and “auto”
Notion sells usage-based credits rather than raw tokens because not all tokens are equal (GPUs for open-source, sandboxes, priority tiers, cache hit rates). They explicitly chose not to price per task value — too complex, not where the market is. Their “auto” model picker is positioned as a “robo-advisor” for model selection, not a margin-maker. Sarah notes frontier labs are clustering at “very capable / very expensive,” leaving a middle gap that open-source labs (Minimax etc.) are filling.
16. Why Notion won’t train its own frontier model
Simon has burned time on training and is done with it. The outer loop matters more than the inner model — 99% of failures are tool bugs, not model issues. The exception: retrieval. They’re hiring ranking/retrieval engineers because agent search is structurally different from human search (top-k recall matters more than positional ranking; parallel query diversity matters more than embedding choice). They’re also rethinking indexing for agent-generated content like meeting notes.
17. Meeting notes as a data-capture flywheel
Meeting notes is one of Notion’s biggest growth/retention levers. Sarah uses it for her own self-performance-review. It’s valuable because it creates a data flywheel that feeds back into the agent (“we can’t bring our own data, so it’s amazing when users create it”). Recent improvement: @-mentions in summaries trigger notifications. The team is vertical (quality + UX as one tiger team, managed by Zach from Embra).
18. Wearables, partnerships, and closing
Simon is personally excited about AI wearables; Sarah says Notion would partner with builders rather than build their own — Notion’s job is to be the best place where meeting notes live, not the best capture device. Final note: there are people with Notion tattoos, which surprises everyone given how understated the swag is.