Mike Litman
The Audio Layer
A TALK BY MIKE LITMAN

The Audio Layer.

Ten beliefs about the foundational layer of the next decade. From four voice agents in production.

Audio is the layer.

WHY NOW

Every decade has its interface primitive. Audio is ours.

1990s · DESKTOP 2000s · THE BROWSER 2010s · THE TOUCHSCREEN 2020s · AUDIO
BUT VOICE HAS BEEN PROMISED BEFORE

Pre-LLM, voice was command-response.
Post-LLM, voice is conversation.
That is the leap.

SIRI · 2011 ALEXA · 2014 ASSISTANT · 2016 LLMS · THE UNLOCK
WHAT I BUILT INSIDE THE LAYER

Four voice agents. Four verticals. All in production.

FIRST ORDER · DISH DISCOVERY WITH MOSHI · PHONE ANSWERING BUGGY SMART · ACCESSIBILITY QUEUE INDEX · QUEUE LENGTHS
01
BELIEF 01

Audio is not a feature.
It is the layer.

Every computing cycle has an interface primitive. The 90s had the desktop. The 2000s had the web. The 2010s had the touchscreen. The 2020s have audio. This is structural, not incremental.

DESKTOP BROWSER TOUCHSCREEN AUDIO
02
BELIEF 02

Voice is the entry point.
Audio is the territory.

The market includes music generation, dubbing, audiobooks, podcasts, gaming, hardware. Companies that play voice-only miss the layer.

VOICE AGENTS MUSIC AI DUBBING PODCASTS GAMING HARDWARE
03
BELIEF 03

Every company becomes a voice company.
Some choose. Most don't.

The same way every company became mobile in 2010 and social in 2014. The choice is whether you build the architecture or inherit someone else's.

2010 · MOBILE 2014 · SOCIAL 2026 · VOICE
04
BELIEF 04

The winners won't call themselves voice AI.
They will call themselves infrastructure.

ElevenLabs calls itself an AI audio company. Cartesia calls itself real-time multimodal. OpenAI calls it Realtime, not Voice. Watch the language, not the demos.

ELEVENLABS · AUDIO INTELLIGENCE CARTESIA · REAL-TIME MULTIMODAL OPENAI · REALTIME
05
BELIEF 05

The intelligence and the voice are different races.

Two markets, not one. The intelligence race runs at OpenAI, Anthropic, Google, xAI. The voice race runs at ElevenLabs, Cartesia, Hume, Sesame. They mix-and-match. Strategy decisions that conflate them lose money.

LLM RACE VOICE RACE MIX-AND-MATCH WINS
BUT WHAT ABOUT GPT-4O

GPT-4o speaks. It does not own the voice race.

Voice cloning, voice libraries, language depth, B2B integration, regulatory compliance. The voice layer is more than the model that ships in a chatbot.

06
BELIEF 06

Sub-200ms is the threshold.
Above it is software.
Below it is presence.

Latency defines the category. Above 500ms it feels like software. Below 200ms it feels like presence. The companies closing that gap are building a moat measured in milliseconds.

500ms · SOFTWARE 200ms · PRESENCE THE GAP IS THE MOAT
07
BELIEF 07

Whoever owns the orchestration layer owns everything above and below it.

Every voice product is a pipeline. STT in. LLM in the middle. TTS out. Whoever owns the middle owns the integration with both ends. Vapi, LiveKit, Pipecat, ElevenLabs Agents are racing for it.

STT → LLM → TTS VAPI LIVEKIT PIPECAT ELEVENLABS AGENTS
08
BELIEF 08

Authenticity is the regulation that builds the moat.

Deepfakes, watermarking, consent, provenance. The EU AI Act is already enforcing labelling. The companies that solve "who said this and was it real" become infrastructure for the regulated world.

EU AI ACT C2PA STANDARD WATERMARKING PROVENANCE
09
BELIEF 09

Hardware is where the audio layer becomes permanent.

Earbuds, voice-native devices, ambient speakers, automotive. The AirPods generation grew up with audio as their primary computing interface. Software lives in OS shifts. Hardware lives forever.

AIRPODS AMBIENT SPEAKERS VOICE-NATIVE WEARABLES AUTOMOTIVE
10
BELIEF 10

The application layer pays the bills.
The infrastructure layer captures the wealth.

Every platform shift concentrates value at the infrastructure layer eventually. AWS captured more than the SaaS companies it enabled. NVIDIA captured more than every AI app combined. The audio infrastructure companies are building the AWS of audio.

AWS LESSON NVIDIA LESSON AUDIO INFRASTRUCTURE
MY VERDICT

If you ask me where I place the bet today, it is on ElevenLabs.

Four voice agents in production. All built on it. The platform race I am watching. Not picking. Yet.

THIS IS THE FOUNDING MANIFESTO

The Audio Layer is the publication tracking these beliefs as they play out.

THE PUBLICATION THE PODCAST THE REPORT THE INDEX
A TALK BY MIKE LITMAN

Audio is the layer.
Build accordingly.

mikelitman.me · hello@mikelitman.me

Continue from slide ?