A TALK BY MIKE LITMAN

The Audio Layer.

Ten beliefs about the foundational layer of the next decade. From four voice agents in production.

Audio is the layer.

WHY NOW

Every decade has its interface primitive. Audio is ours.

1990s · DESKTOP 2000s · THE BROWSER 2010s · THE TOUCHSCREEN 2020s · AUDIO

BUT VOICE HAS BEEN PROMISED BEFORE

Pre-LLM, voice was command-response.
Post-LLM, voice is conversation.
That is the leap.

SIRI · 2011 ALEXA · 2014 ASSISTANT · 2016 LLMS · THE UNLOCK

WHAT I BUILT INSIDE THE LAYER

Four voice agents. Four verticals. All in production.

FIRST ORDER · DISH DISCOVERY WITH MOSHI · PHONE ANSWERING BUGGY SMART · ACCESSIBILITY QUEUE INDEX · QUEUE LENGTHS

01

BELIEF 01

Audio is not a feature.
It is the layer.

Every computing cycle has an interface primitive. The 90s had the desktop. The 2000s had the web. The 2010s had the touchscreen. The 2020s have audio. This is structural, not incremental.

DESKTOP BROWSER TOUCHSCREEN AUDIO

02

BELIEF 02

Voice is the entry point.
Audio is the territory.

The market includes music generation, dubbing, audiobooks, podcasts, gaming, hardware. Companies that play voice-only miss the layer.

VOICE AGENTS MUSIC AI DUBBING PODCASTS GAMING HARDWARE

03

BELIEF 03

Every company becomes a voice company.
Some choose. Most don't.

The same way every company became mobile in 2010 and social in 2014. The choice is whether you build the architecture or inherit someone else's.

2010 · MOBILE 2014 · SOCIAL 2026 · VOICE

04

BELIEF 04

The winners won't call themselves voice AI.
They will call themselves infrastructure.

ElevenLabs calls itself an AI audio company. Cartesia calls itself real-time multimodal. OpenAI calls it Realtime, not Voice. Watch the language, not the demos.

ELEVENLABS · AUDIO INTELLIGENCE CARTESIA · REAL-TIME MULTIMODAL OPENAI · REALTIME

05

BELIEF 05

The intelligence and the voice are different races.

Two markets, not one. The intelligence race runs at OpenAI, Anthropic, Google, xAI. The voice race runs at ElevenLabs, Cartesia, Hume, Sesame. They mix-and-match. Strategy decisions that conflate them lose money.

LLM RACE VOICE RACE MIX-AND-MATCH WINS

BUT WHAT ABOUT GPT-4O

GPT-4o speaks. It does not own the voice race.

Voice cloning, voice libraries, language depth, B2B integration, regulatory compliance. The voice layer is more than the model that ships in a chatbot.

06

BELIEF 06

Sub-200ms is the threshold.
Above it is software.
Below it is presence.

Latency defines the category. Above 500ms it feels like software. Below 200ms it feels like presence. The companies closing that gap are building a moat measured in milliseconds.

500ms · SOFTWARE 200ms · PRESENCE THE GAP IS THE MOAT

07

BELIEF 07

Whoever owns the orchestration layer owns everything above and below it.

Every voice product is a pipeline. STT in. LLM in the middle. TTS out. Whoever owns the middle owns the integration with both ends. Vapi, LiveKit, Pipecat, ElevenLabs Agents are racing for it.

STT → LLM → TTS VAPI LIVEKIT PIPECAT ELEVENLABS AGENTS

08

BELIEF 08

Authenticity is the regulation that builds the moat.

Deepfakes, watermarking, consent, provenance. The EU AI Act is already enforcing labelling. The companies that solve "who said this and was it real" become infrastructure for the regulated world.

EU AI ACT C2PA STANDARD WATERMARKING PROVENANCE

09

BELIEF 09

Hardware is where the audio layer becomes permanent.

Earbuds, voice-native devices, ambient speakers, automotive. The AirPods generation grew up with audio as their primary computing interface. Software lives in OS shifts. Hardware lives forever.

AIRPODS AMBIENT SPEAKERS VOICE-NATIVE WEARABLES AUTOMOTIVE

10

BELIEF 10

The application layer pays the bills.
The infrastructure layer captures the wealth.

Every platform shift concentrates value at the infrastructure layer eventually. AWS captured more than the SaaS companies it enabled. NVIDIA captured more than every AI app combined. The audio infrastructure companies are building the AWS of audio.

AWS LESSON NVIDIA LESSON AUDIO INFRASTRUCTURE

MY VERDICT

If you ask me where I place the bet today, it is on ElevenLabs.

Four voice agents in production. All built on it. The platform race I am watching. Not picking. Yet.

THIS IS THE FOUNDING MANIFESTO

The Audio Layer is the publication tracking these beliefs as they play out.

THE PUBLICATION THE PODCAST THE REPORT THE INDEX

A TALK BY MIKE LITMAN

Audio is the layer.
Build accordingly.

mikelitman.me · hello@mikelitman.me

The publication → More talks →

The Audio Layer.

Audio is not a feature.It is the layer.

Voice is the entry point.Audio is the territory.

Every company becomes a voice company.Some choose. Most don't.

The winners won't call themselves voice AI.They will call themselves infrastructure.