← All summaries

The world of voice AI, with Mati Staniszewski of ElevenLabs

Cheeky Pint · John — Mati Staniszewski · April 14, 2026 · Original

Most important take away

ElevenLabs has grown to an $11 billion valuation by pioneering voice AI that treats qualities like accent and emotion as emergent properties rather than hard-coded parameters, enabling breakthroughs in text-to-speech, real-time voice agents, and multilingual communication. The biggest career and organizational insight is that high-agency individuals who proactively explore and leverage AI tools are the clear winners of the current technology shift, regardless of seniority level.

Summary

How voice AI models work: ElevenLabs co-founder Piotr innovated a transformer-based approach where voice models predict the next phoneme using contextual text, mel spectrograms, and waveforms. The key breakthrough was making voice characteristics (accent, emotion, prosody) emergent rather than hard-coded, and adding controllability so users can direct how speech is delivered (pacing, dramatic pauses, emotional tone).

ElevenLabs product ecosystem and company trajectory: The company offers text-to-speech, speech-to-text (transcription), voice agents, and ElevenLabs Reader (for PDFs and audiobooks). They raised $500 million at an $11 billion valuation. Their voice models use a few billion to low tens of billions of parameters — substantially smaller than leading LLMs — making training cheaper but still capital-intensive.

Actionable insights on voice AI adoption:

  • Cascaded vs. speech-to-speech architecture: For enterprise and business use cases, the cascaded approach (speech-to-text, then LLM, then text-to-speech) wins because it provides visibility, reliability, and integration flexibility. Speech-to-speech is faster but less reliable and suited more for companion/consumer applications where hallucinations are less critical.
  • Voice agents boost conversion: ElevenLabs found that replacing web forms with voice agents increased form completion rates and elicited richer, more detailed information from users. People are more open-ended and trusting when speaking versus typing.
  • Personalized transcription is coming: Person-specific transcription models (fine-tuned to individual accents and speech patterns) are expected from ElevenLabs within months. This has major implications for healthcare, smart home devices, and any domain where accurate recognition of a specific speaker matters.
  • Keyword detection already available: For real-time and async settings, ElevenLabs supports keyword detection to improve accuracy in domain-specific contexts (e.g., ordering at a coffee shop, medical commands).

Career advice and organizational insights:

  • Small, flat teams with high agency: ElevenLabs operates with unusually flat hierarchies — both the CEO and co-founder have 15+ direct reports each, with roughly 10-person team sizes. They believe this is not just a startup effect but an AI-era organizational model.
  • Technical resources embedded in every team: Even non-technical teams (ops, talent/HR, go-to-market) have a dedicated “tech lead” who helps automate workflows and uplevel the team. Ukraine’s government validated this model by embedding technical resources in every ministry for their DIA digital platform.
  • Agency is the top trait to hire for: Mati emphasized that high-agency people, regardless of experience level, are the biggest beneficiaries of AI advancements. The combination of agency, ownership, and enjoying the craft is what defines ElevenLabs’ culture.
  • AI-native pricing models: Like other AI companies, ElevenLabs is moving toward a hybrid of subscription plans with usage-based overage pricing, recognizing that pure subscriptions with hard limits frustrate power users.

Specific companies mentioned:

  • ElevenLabs: $11B valuation, 500M raised, pioneering voice AI across speech generation, transcription, and voice agents.
  • Stripe: Sponsor; ElevenLabs uses Stripe Link for faster checkout that reduces checkout abandonment.
  • Ukraine government (DIA platform): Using ElevenLabs voice agents across ministries for citizen services during wartime, representing one of the most advanced digital government implementations.
  • Neuralink: Collaborated with ElevenLabs to restore a patient’s voice through neural interface technology.
  • Audible: Historically blocked AI-generated audiobooks, which prompted ElevenLabs to build their own Reader distribution platform.

Chapter Summaries

How voice AI models work: Mati walks through the history from mechanical speech replication to phoneme concatenation to modern transformer-based neural nets. ElevenLabs’ innovation was making voice parameters emergent rather than hard-coded and improving how context (surrounding text and emotional cues) is encoded and decoded.

Real-time voice and automotive applications: Production-quality real-time voice interaction only became possible about a year ago. Cloud-based voice in cars is expected this year; on-device (without connectivity) is still 2-3 years out.

ElevenLabs Reader and audiobook distribution: Created because platforms like Audible blocked AI audiobooks. Allows users to upload PDFs/text and have them read by high-quality AI voices.

The voice assistant Turing test: Neither Siri, Gemini, nor ChatGPT voice modes have passed a conversational voice Turing test. The orchestration challenges — turn-taking, knowing when to speak vs. wait, tool calls mid-conversation — remain hard research problems. Domain-specific agents (customer support) are closer than general interactive ones.

Personalized transcription and keyword detection: Person-specific transcription (fine-tuned to individual voices/accents) is months away. Speaker diarization already works well. Keyword detection is available now for domain-specific accuracy boosts.

Voice enhancement and controllability: ElevenLabs’ V3 model introduced controllability — directing emotional delivery, pacing, and dramatic pauses. “Expressive mode” lets voice agents detect caller emotions and respond appropriately.

Speech-to-speech vs. cascaded architecture: Cascaded (STT + LLM + TTS) is the enterprise-grade approach offering visibility and reliability. Speech-to-speech is faster but less controllable and accurate. Future may combine both depending on task complexity.

Business and economics of voice models: Voice models are a few billion to low tens of billions of parameters, cheaper to train than LLMs but still capital-intensive. ElevenLabs raised $500M at $11B valuation.

Organizational design in the AI era: Small flat teams, 15+ direct reports per leader, embedded technical resources in every team. High agency is the most valued trait. Ukraine’s government mirrors this model across ministries. Culture scales the company more than any single product or person.