Ten beliefs about the foundational layer of the next decade. From four voice agents in production.
Audio is the layer.
Every decade has its interface primitive. Audio is ours.
Pre-LLM, voice was command-response.
Post-LLM, voice is conversation.
That is the leap.
Four voice agents. Four verticals. All in production.
Every computing cycle has an interface primitive. The 90s had the desktop. The 2000s had the web. The 2010s had the touchscreen. The 2020s have audio. This is structural, not incremental.
The market includes music generation, dubbing, audiobooks, podcasts, gaming, hardware. Companies that play voice-only miss the layer.
The same way every company became mobile in 2010 and social in 2014. The choice is whether you build the architecture or inherit someone else's.
ElevenLabs calls itself an AI audio company. Cartesia calls itself real-time multimodal. OpenAI calls it Realtime, not Voice. Watch the language, not the demos.
Two markets, not one. The intelligence race runs at OpenAI, Anthropic, Google, xAI. The voice race runs at ElevenLabs, Cartesia, Hume, Sesame. They mix-and-match. Strategy decisions that conflate them lose money.
GPT-4o speaks. It does not own the voice race.
Voice cloning, voice libraries, language depth, B2B integration, regulatory compliance. The voice layer is more than the model that ships in a chatbot.
Latency defines the category. Above 500ms it feels like software. Below 200ms it feels like presence. The companies closing that gap are building a moat measured in milliseconds.
Every voice product is a pipeline. STT in. LLM in the middle. TTS out. Whoever owns the middle owns the integration with both ends. Vapi, LiveKit, Pipecat, ElevenLabs Agents are racing for it.
Deepfakes, watermarking, consent, provenance. The EU AI Act is already enforcing labelling. The companies that solve "who said this and was it real" become infrastructure for the regulated world.
Earbuds, voice-native devices, ambient speakers, automotive. The AirPods generation grew up with audio as their primary computing interface. Software lives in OS shifts. Hardware lives forever.
Every platform shift concentrates value at the infrastructure layer eventually. AWS captured more than the SaaS companies it enabled. NVIDIA captured more than every AI app combined. The audio infrastructure companies are building the AWS of audio.
If you ask me where I place the bet today, it is on ElevenLabs.
Four voice agents in production. All built on it. The platform race I am watching. Not picking. Yet.
The Audio Layer is the publication tracking these beliefs as they play out.
Audio is the layer.
Build accordingly.
mikelitman.me · hello@mikelitman.me