No Priors: Artificial Intelligence | Technology | Startups

The Future of Voice AI: Agents, Dubbing, and Real-Time Translation with ElevenLabs Co-Founder Mati Staniszewski

12/11/2025

Continuing from the introduction... we now meet the architect behind the fastest-growing voice AI in history.

Redefining the Human Signal.

Mati Staniszewski and ElevenLabs have achieved what was previously thought impossible: emotional fidelity in synthetic speech. From a cold start in 2022 to a $300M run rate, they aren't just building tools; they are rewriting the interface of the future.

The Guest

Mati Staniszewski

Co-founder & CEO, ElevenLabs

Location: London / Global Remote

Annual Recurring Revenue

$300M

In just 3 years of operation.

Global Team

350

Employees across London, NYC, Warsaw, SF, & Tokyo.

Monthly Actives

5M+

Creators and developers building on the platform.

Revenue Composition: A Balanced Ecosystem

Data

The Origin Story

"If you watch a movie in Polish... all the voices, male or female, are narrated with one single character. It is a terrible experience."

The frustration with "flat delivery" in dubbing sparked the quest for an AI that could preserve original emotion and intonation across languages.

Voice as the Ultimate Interface

We have spent decades adapting ourselves to machines—typing on glass, clicking mice, staring at screens. Mati argues this interface is fundamentally broken.

The future isn't about better keyboards; it's about returning to humanity's oldest interface: speech. Whether it's interacting with robots, smartphones, or immersive media, the friction of the screen is disappearing. ElevenLabs isn't just generating audio; they are building the conversational layer for the entire internet.

READING: Introduction & Vision

Coming Up Next

Building the Machine: Research vs. Product →

Continuing from Growth & Scale

The Architecture of Invention

We leave the metrics of scale to examine the engine room. How does a startup sequence the delicate dance between pure research and consumer-ready product?

The "Robotic" Barrier

The initial hurdle wasn't scale—it was quality. Early attempts to use existing market models failed because the output was simply "not good speech."

"We realized pretty quickly that the models that existed just produced such a robotic speech that people didn't want to listen to it."

Market Pull vs. Visionary Push

Development sequencing based on user demand vs. internal conviction.

The Lab Structure

01. Voice Lab

The foundational layer. Solving the core issue of human-sounding narration via TTS.

02. Agent Lab

The orchestration layer. Linking Knowledge + LLM + TTS + STT for interaction.

03. Music Lab

The expansion layer. Responding to creator needs for licensed background audio.

The North Star

"The full Babel Fish idea from Hitchhiker's Guide to the Galaxy... breaking down language barriers."

Dubbing wasn't requested by the market—it was built because the future demanded it.

We have the structure. We have the labs. But as we move from static narration to dynamic agents, the challenge shifts.

Next: Quality & Preferences

Continuing the Thread

While research and product development build the engine, the interface remains deeply human. We now pivot from the technical architecture to the subjective art of perception and application.

The Voice Sommelier
& The Agentic Future

Buyers aren't machine learning scientists. They don't want benchmarks; they want a feeling. The industry is shifting from static text-to-speech to dynamic, context-aware "Voice Sommeliers" who curate audio identities, paving the way for fully autonomous agents in government and commerce.

4 Key Sectors Transformed

The "Sommelier" Approach to AI

The transcript highlights a critical gap: standard ML evals cannot measure "brand fit." The solution is the Voice Sommelier—a human-in-the-loop expert who pairs enterprise needs with sonic texture.

"We have a voice sommelier... That person is like a voice coach, has an incredible voice themselves, and will partner to help you find the right branding."

Dynamic Personalization

The future isn't one voice fits all. It is dynamic adaptation: A high-energy persona for a morning briefing, a calm, slower cadence for the elderly, or a soothing tone for evening reading.

Case Study: Demographic Tuning

Based on Japan/Korea client data: Optimizing delivery for distinct user bases.

From Static Support to Immersive Agents

Proactive Commerce

Example: Meesho (India)

Moving beyond "Where is my refund?" to full shopping assistants that navigate catalogs, suggest gifts, and manage checkouts via voice widgets.

Living IP

Example: Epic Games

Static characters become interactive. Millions of players engaging live with Darth Vader in Fortnite, creating a scalable, personalized narrative experience.

Elite Tutoring

Example: Chess.com / Masterclass

Learning from the masters, not just watching them. Interactive negotiation practice with Chris Voss or chess analysis with Magnus Carlsen.

★

The Agentic State

Example: Ukraine

The most ambitious goal: A fully digital ministry. Proactive citizen engagement, benefits navigation, and education reform run by AI agents.

"It sounds like a big ambitious goal... but the crazy thing is, they are so ahead in actually doing that."

To execute visions as complex as an "Agentic Government," the choice of infrastructure becomes existential.

Coming Up Next Choosing the Right Technology Partner →

From Capabilities to Strategy

Having established what these agents can do—from customer support to internal training—the conversation shifts to the boardroom dilemma. For the Global 2000, the question isn't just about voice quality anymore; it's about architectural philosophy. Do you hire a consultant, buy a point solution, or partner with a platform?

The Enterprise Decision Matrix

As enterprises look to deploy rich voice interactions, the landscape fractures into three distinct paths. The host posits a choice between consulting giants like Palantir, point solutions like Sierra, or platform technology companies.

The speaker, drawing on his Palantir background, delineates the ElevenLabs approach: it is not a "one-pointed solution." Instead, it functions as an open infrastructure designed to sprawl across an organization—powering support, sales, and internal training simultaneously.

"If you are looking to deploy across a plethora of different experiences... then we are the right solution."

The Vendor Landscape

Consulting (e.g., Palantir)

Best for: Wider digital transformation journeys requiring massive resource allocation.

Point Solutions (e.g., Sierra)

Best for: Specific, contained use cases where an "out of the box" agent is needed immediately.

Platform + Forward Deployed (ElevenLabs)

Best for: Multi-modal deployment across departments (Sales + Support + Training) with custom engineering support.

The Global Audio Elite

Estimated count of top-tier researchers capable of architectural breakthroughs in audio.

~100

Total Researchers Globally

Why Giants Don't Win

The conversation tackles the "no priors" assumption: Why can a startup compete with Google or OpenAI? The answer lies in focus. While the labs prioritize general scale, audio requires specific architectural breakthroughs rather than just raw compute.

The speaker reveals a startling metric: the pool of researchers capable of pushing the frontier in audio is incredibly small—perhaps only 50 to 100 people globally. By concentrating ~10 of these minds under one roof and obsessing over the product layer—integrations, latency, control—ElevenLabs claims to beat the generalists on benchmarks.

FOCUS: RESEARCH DENSITY

Coming Up Next

The Future of Open Source & R&D

Continuing the Narrative

Having established the importance of foundation models, the conversation now shifts to their trajectory. As open source capabilities accelerate, the strategic question becomes: where does defensible value actually live?

Commoditization & The Ecosystem Moat

"Research is just a head start. The long-term value is the ecosystem you build around it."

The 4-Year Horizon

A consensus is forming: base model differences will become negligible. Whether in two years or four, narrative and generation will commoditize.

The Shift

Value moves from the "Model" to the "Product Layer"—connecting business logic, workflows, and specific interfaces.

The Defensibility

Technology advantages last 6–12 months. Real defensibility comes from brand, distribution, and integration ecosystems.

The "Buy vs. Build" Rule

How do you decide between waiting for research or building a product hack?

The 3-Month Thumb Rule

If a product fix takes < 3 months, build it immediately. If longer, wait for the underlying research model to improve.

Milestone Predictions

Present Day
Narration & content generation converging on quality.
~12 Months
Passing the Turing Test

Conversational AI becomes indistinguishable from human interaction in customer support contexts.
~24 Months
Real-Time Dubbing

Seamless, low-latency translation and conversation across languages.

Agent-Side Optimization

Benchmarking the new "Scryvy 2" (Gen-2) Speech-to-Text Model.

*Tested on 'Flowers' Benchmark Target: < 150ms Latency

"Most advantages in technology... they aren't infinitely defensible. They allow you to build momentum and scale for a period of time. That's powerful, but it's not a 'forever' answer."

Looking Ahead

As technical barriers fall and latency drops to imperceptible levels, we move from "tools" to "entities."
Next: The Future of AI Companions.

Moving beyond the technical architecture of open source and R&D strategies, the conversation shifts to the application layer—specifically, how these models will integrate into our daily lives, schools, and homes.

The Age of Companions

From "Jarvis" utility to the classroom of the future.

Archetype Debate

Social Pal vs. Super Pilot

The Social Companion

Solving loneliness, emotional reciprocation, and constant chat. (The Guest is less excited here.)

The "Jarvis" Utility

"I have a super assistant, super pilot... Someone that understands me, tells me what's relevant, opens the blinds, and plays music straight away."

Timeline Projection

01
Decade of Agents Dictation and voice become the primary OS. Devices recede into pockets; technology acts on your behalf.
02
Decade of Robots Voice becomes the critical input/output interface for embodied AI in the physical world.

The Hybrid Classroom

The guest predicts a split model for future education to maintain human social skills.

"Maybe there's a cool version where you have Richard Feynman or Albert Einstein deliver those lecture notes... It’ll be sick."

Key Takeaway

Voice is not just an input method; it is the bridge to a "super pilot" lifestyle and personalized education on a massive scale. The technology fades; the interaction remains.

End of Segment

Proceed to Conclusion