The Future of Voice AI: Agents, Dubbing, and Real-Time Translation with ElevenLabs Co-Founder Mati Staniszewski
Redefining the Human Signal.
Mati Staniszewski and ElevenLabs have achieved what was previously thought impossible: emotional fidelity in synthetic speech. From a cold start in 2022 to a $300M run rate, they aren't just building tools; they are rewriting the interface of the future.
Mati Staniszewski
Co-founder & CEO, ElevenLabs
Location: London / Global Remote
In just 3 years of operation.
Employees across London, NYC, Warsaw, SF, & Tokyo.
Creators and developers building on the platform.
Revenue Composition: A Balanced Ecosystem
DataThe Origin Story
"If you watch a movie in Polish... all the voices, male or female, are narrated with one single character. It is a terrible experience."
The frustration with "flat delivery" in dubbing sparked the quest for an AI that could preserve original emotion and intonation across languages.
Voice as the Ultimate Interface
We have spent decades adapting ourselves to machines—typing on glass, clicking mice, staring at screens. Mati argues this interface is fundamentally broken.
The future isn't about better keyboards; it's about returning to humanity's oldest interface: speech. Whether it's interacting with robots, smartphones, or immersive media, the friction of the screen is disappearing. ElevenLabs isn't just generating audio; they are building the conversational layer for the entire internet.
Coming Up Next
Building the Machine: Research vs. Product →
The Architecture of Invention
We leave the metrics of scale to examine the engine room. How does a startup sequence the delicate dance between pure research and consumer-ready product?
The "Robotic" Barrier
The initial hurdle wasn't scale—it was quality. Early attempts to use existing market models failed because the output was simply "not good speech."
"We realized pretty quickly that the models that existed just produced such a robotic speech that people didn't want to listen to it."
Market Pull vs. Visionary Push
Development sequencing based on user demand vs. internal conviction.
The Lab Structure
The foundational layer. Solving the core issue of human-sounding narration via TTS.
The orchestration layer. Linking Knowledge + LLM + TTS + STT for interaction.
The expansion layer. Responding to creator needs for licensed background audio.
The North Star
"The full Babel Fish idea from Hitchhiker's Guide to the Galaxy... breaking down language barriers."
Dubbing wasn't requested by the market—it was built because the future demanded it.
We have the structure. We have the labs. But as we move from static narration to dynamic agents, the challenge shifts.
Continuing the Thread
While research and product development build the engine, the interface remains deeply human. We now pivot from the technical architecture to the subjective art of perception and application.
The Voice Sommelier
& The Agentic Future
Buyers aren't machine learning scientists. They don't want benchmarks; they want a feeling. The industry is shifting from static text-to-speech to dynamic, context-aware "Voice Sommeliers" who curate audio identities, paving the way for fully autonomous agents in government and commerce.
The "Sommelier" Approach to AI
The transcript highlights a critical gap: standard ML evals cannot measure "brand fit." The solution is the Voice Sommelier—a human-in-the-loop expert who pairs enterprise needs with sonic texture.
"We have a voice sommelier... That person is like a voice coach, has an incredible voice themselves, and will partner to help you find the right branding."
Dynamic Personalization
The future isn't one voice fits all. It is dynamic adaptation: A high-energy persona for a morning briefing, a calm, slower cadence for the elderly, or a soothing tone for evening reading.
Case Study: Demographic Tuning
Based on Japan/Korea client data: Optimizing delivery for distinct user bases.
From Static Support to Immersive Agents
Proactive Commerce
Example: Meesho (India)
Moving beyond "Where is my refund?" to full shopping assistants that navigate catalogs, suggest gifts, and manage checkouts via voice widgets.
Living IP
Example: Epic Games
Static characters become interactive. Millions of players engaging live with Darth Vader in Fortnite, creating a scalable, personalized narrative experience.
Elite Tutoring
Example: Chess.com / Masterclass
Learning from the masters, not just watching them. Interactive negotiation practice with Chris Voss or chess analysis with Magnus Carlsen.
The Agentic State
Example: Ukraine
The most ambitious goal: A fully digital ministry. Proactive citizen engagement, benefits navigation, and education reform run by AI agents.
"It sounds like a big ambitious goal... but the crazy thing is, they are so ahead in actually doing that."
To execute visions as complex as an "Agentic Government," the choice of infrastructure becomes existential.
From Capabilities to Strategy
Having established what these agents can do—from customer support to internal training—the conversation shifts to the boardroom dilemma. For the Global 2000, the question isn't just about voice quality anymore; it's about architectural philosophy. Do you hire a consultant, buy a point solution, or partner with a platform?
The Enterprise Decision Matrix
As enterprises look to deploy rich voice interactions, the landscape fractures into three distinct paths. The host posits a choice between consulting giants like Palantir, point solutions like Sierra, or platform technology companies.
The speaker, drawing on his Palantir background, delineates the ElevenLabs approach: it is not a "one-pointed solution." Instead, it functions as an open infrastructure designed to sprawl across an organization—powering support, sales, and internal training simultaneously.
"If you are looking to deploy across a plethora of different experiences... then we are the right solution."
The Vendor Landscape
Consulting (e.g., Palantir)
Best for: Wider digital transformation journeys requiring massive resource allocation.
Point Solutions (e.g., Sierra)
Best for: Specific, contained use cases where an "out of the box" agent is needed immediately.
Platform + Forward Deployed (ElevenLabs)
Best for: Multi-modal deployment across departments (Sales + Support + Training) with custom engineering support.
The Global Audio Elite
Estimated count of top-tier researchers capable of architectural breakthroughs in audio.
~100
Total Researchers Globally
Why Giants Don't Win
The conversation tackles the "no priors" assumption: Why can a startup compete with Google or OpenAI? The answer lies in focus. While the labs prioritize general scale, audio requires specific architectural breakthroughs rather than just raw compute.
The speaker reveals a startling metric: the pool of researchers capable of pushing the frontier in audio is incredibly small—perhaps only 50 to 100 people globally. By concentrating ~10 of these minds under one roof and obsessing over the product layer—integrations, latency, control—ElevenLabs claims to beat the generalists on benchmarks.
Coming Up Next
The Future of Open Source & R&D
Continuing the Narrative
Having established the importance of foundation models, the conversation now shifts to their trajectory. As open source capabilities accelerate, the strategic question becomes: where does defensible value actually live?
Commoditization & The Ecosystem Moat
"Research is just a head start. The long-term value is the ecosystem you build around it."
The 4-Year Horizon
A consensus is forming: base model differences will become negligible. Whether in two years or four, narrative and generation will commoditize.
The Shift
Value moves from the "Model" to the "Product Layer"—connecting business logic, workflows, and specific interfaces.
The Defensibility
Technology advantages last 6–12 months. Real defensibility comes from brand, distribution, and integration ecosystems.
The "Buy vs. Build" Rule
How do you decide between waiting for research or building a product hack?
The 3-Month Thumb Rule
If a product fix takes < 3 months, build it immediately. If longer, wait for the underlying research model to improve.
Milestone Predictions
-
Present Day
Narration & content generation converging on quality.
-
~12 Months
Passing the Turing Test
Conversational AI becomes indistinguishable from human interaction in customer support contexts.
-
~24 Months
Real-Time Dubbing
Seamless, low-latency translation and conversation across languages.
Agent-Side Optimization
Benchmarking the new "Scryvy 2" (Gen-2) Speech-to-Text Model.
"Most advantages in technology... they aren't infinitely defensible. They allow you to build momentum and scale for a period of time. That's powerful, but it's not a 'forever' answer."
Looking Ahead
As technical barriers fall and latency drops to imperceptible levels, we move from "tools" to "entities."
Next: The Future of AI Companions.
Moving beyond the technical architecture of open source and R&D strategies, the conversation shifts to the application layer—specifically, how these models will integrate into our daily lives, schools, and homes.
The Age of Companions
From "Jarvis" utility to the classroom of the future.
Archetype Debate
Social Pal vs. Super Pilot
The Social Companion
Solving loneliness, emotional reciprocation, and constant chat. (The Guest is less excited here.)
The "Jarvis" Utility
"I have a super assistant, super pilot... Someone that understands me, tells me what's relevant, opens the blinds, and plays music straight away."
Timeline Projection
-
01
Decade of Agents Dictation and voice become the primary OS. Devices recede into pockets; technology acts on your behalf.
-
02
Decade of Robots Voice becomes the critical input/output interface for embodied AI in the physical world.
The Hybrid Classroom
The guest predicts a split model for future education to maintain human social skills.
"Maybe there's a cool version where you have Richard Feynman or Albert Einstein deliver those lecture notes... It’ll be sick."
Key Takeaway
Voice is not just an input method; it is the bridge to a "super pilot" lifestyle and personalized education on a massive scale. The technology fades; the interaction remains.
End of Segment
Proceed to Conclusion