Audio is the Layer: 10 Beliefs on Voice AI and the Next Decade

I have built four voice agents in production this year. They call London restaurants for dish recommendations. They answer phones for places that cannot afford a receptionist. They map pram accessibility across more than a thousand venues. They check queue lengths at the spots people are trying to get into. Different verticals, different failure modes, different customers. The same primitive underneath: a synthetic voice having a real conversation with a stranger, end to end, no human in the loop.

Somewhere around the third one, the pattern stopped looking like four separate experiments and started looking like one structural fact about where computing is going. Audio is not a feature being added to existing products. It is the layer that the next generation of products is being built on top of. Most companies have not noticed yet, which is the only reason there is any window left to notice.

Every era has its primitive

The 1990s had the desktop. The 2000s had the web. The 2010s had the touchscreen. Each one redefined what an interface was, which products mattered, and where value concentrated. The pattern is consistent enough to be boring: a new interface primitive arrives, every category re-platforms onto it within a decade, and the companies that controlled the previous primitive find themselves answering to the companies that own the new one.

The 2020s have audio. That claim sounds large, and the natural response is the natural response: voice has been promised before. Siri shipped in 2011. Alexa in 2014. Google Assistant in 2016. None of them produced the platform shift their launches implied. Why is this time different?

The leap

Pre-LLM, voice was command-response. You spoke, it parsed, it returned a structured answer from a finite menu of intents. The interface looked like conversation but the substance was the same as a search box: type a query, get a result. The reason Siri felt like a parlour trick after the third question is that there were only ever a few hundred questions it could actually answer.

Post-LLM, voice is conversation. The same model that can write your email, debug your code and reason about a legal contract is now also speaking. The voice layer inherits the intelligence layer. That is a structural break, not a quality improvement. It is the same kind of break that happened when smartphones replaced feature phones, or when broadband replaced dial-up. The thing that previously did not work suddenly works, and you cannot un-work it. That is the leap.

Voice is the entry point. Audio is the territory.

Most coverage of this market collapses to "voice agents". Voice agents are the visible tip. The full layer spans music generation, dubbing and localisation, audiobook production, podcast intelligence, gaming audio, ambient computing, and hardware as audio interface. The companies that play voice-only miss the layer. The companies that understand the layer is wider than voice are positioning for what comes next.

You can see this in the language the leading companies use. ElevenLabs calls itself an AI audio company, not a voice AI company. Cartesia describes itself as real-time multimodal. OpenAI shipped Realtime, not Voice. Watch the language, not the demos. The winners are not framing themselves as voice products. They are framing themselves as infrastructure.

The bifurcation almost everyone misses

The single most important strategic insight in this market is that the intelligence race and the voice race are different races, running in parallel, with different leaders, different moats, and different timelines.

The intelligence race runs at OpenAI, Anthropic, Google, xAI. They are competing to build the most capable language model. The voice race runs at ElevenLabs, Cartesia, Hume, Sesame. They are competing to build the most capable audio expression layer. The two races touch but they are not the same race. The companies winning each one are not the same companies. The moats look different. The customers buy them separately and compose them.

Which raises the obvious objection: GPT-4o speaks. Doesn't that collapse the voice race into the intelligence race? It does not. Voice cloning, voice libraries, language depth, B2B integration, regulatory compliance, the ability to swap voices without re-training a model, the ability to ship without OpenAI as a dependency. The voice layer is more than the model that ships in a chatbot. The teams making good strategic decisions in this space are the ones who treat them as separate races and compose the best of each.

Sub-200ms is the threshold

Latency is the metric that defines whether voice AI feels like software or feels like presence. Above half a second, you are talking to a machine that is thinking. Around the two-hundred-millisecond threshold that voice UX practitioners cite for natural conversation, you are having a conversation. The companies closing that gap are building a moat measured in milliseconds. It is the page-load-speed of this category, except the experience is not "is this fast" but "is this real".

This is also where the orchestration layer earns its strategic position. Every voice product is a pipeline: speech in, language model in the middle, voice out. Whoever owns the middle owns the integration with both ends, the routing decisions, the latency optimisation, the reliability surface, and ultimately the customer relationship. Vapi, LiveKit, Pipecat, ElevenLabs Agents are racing for that position. The platform that wins the developer ecosystem here will own the audio layer for a generation, in the same way AWS came to own the cloud not by being the most innovative service but by being the default substrate.

Authenticity, hardware, and where the wealth concentrates

Three more beats are worth holding before the conclusion.

Authenticity is the regulation that builds the moat. Deepfakes, watermarking, consent frameworks, provenance standards. The EU AI Act is already enforcing labelling for AI-generated content. The companies that solve "who said this and was it real" become infrastructure for the regulated world. C2PA. Watermarking. Voice cloning consent. This is not a compliance burden; it is a product opportunity that creates a defensible position for the companies that take it seriously early.

Hardware is where the audio layer becomes permanent. Earbuds, voice-native devices, ambient speakers, automotive interfaces. The AirPods generation grew up with audio as a constant computing surface. Software lives in operating system shifts. Hardware lives forever. Every previous interface era was finalised by a hardware product that made it inescapable: the PC, the iPhone, the touchscreen tablet. The audio era's hardware moment is still ahead of us, which is part of what makes this an open market.

Underneath all of it, the same economics that have governed every platform shift: the application layer pays the bills, the infrastructure layer captures the wealth. AWS became more valuable than most of the SaaS companies it enabled. NVIDIA became more valuable than the AI application companies that depend on it. The audio infrastructure companies are building the AWS of audio. The companies that look like products today will look like substrate in five years.

Where I am placing the bet

If you ask me where I am putting my own time and money inside the audio layer right now, the answer is ElevenLabs. Four voice agents in production. All built on it. The voice race already has a leader, and the leader is treating itself as infrastructure rather than as a product. That is the right positioning for what this category becomes.

The platform race I am watching, not picking. Yet. Orchestration is where the next twelve months of category-shaping decisions will be made, and there are credible plays from multiple directions. I would rather be honest about that than fake a conviction I do not have.

What this is

This essay is the long-form version of The Audio Layer manifesto, the founding deck for a new publication of the same name. The publication tracks these beliefs as they play out: the companies, the funding rounds, the regulatory moves, the latency milestones, the use cases, the things that get built on the layer that we do not have names for yet.

The category is moving fast enough that any single piece of writing is a snapshot. The point of the publication is to keep the snapshot current, edition after edition, with a framework that holds. The framework is the ten beliefs in the deck. The publication is what you read between editions of those beliefs being right or wrong.

Audio is the layer. Build accordingly.