Mike Litman
The Harness, Not the Model
VOICE AI · 2026

The Harness,
Not the Model

Why competitive edge in voice AI has never been the model.

01
THE ASSUMPTION

Everyone bets on the model.

GPT. Gemini. Claude. Llama: off-the-shelf APIs you call from a line of code, not custom-trained systems. Pick the most capable one, build on top, ship before competitors make their choice. That is the dominant playbook. Models are where the benchmarks live, where the press releases go, where the funding announcements focus.

The assumption feels right. It is also the wrong bet.

The story everyone tells

Model choice
is the differentiator.

Pick the most capable model. Build around it. Ship before competitors choose theirs. The model is the product.

Benchmark wins drive product decisions
Model announcements become product announcements
Switching models means starting over
What actually happens

Models converge.
The harness doesn't.

Every frontier model provider closes the capability gap within 12 to 18 months. What was a moat becomes a config value. The model is a commodity. The harness is not.

Model quality is a floor, not a ceiling
Business logic cannot live inside a prompt
Production data compounds; model weights do not
WHY NOW

Harness value up.
Model advantage down.

2023: making it work. 2024: making it reliable. 2025: making it scale. 2026: does it hold in production every day without you watching?

20222023202420252026
Harness value
Model advantage

"I built my first voice agent in late 2023. It was a demo. What I ship now is not. The difference is not the model."

Scale AI · 2026
"
The model generates the possibility. The harness decides what's real.
DAVID CAMPBELL · Scale AI
DEFINITION

The harness is everything around the model that makes it usable in production.

Input control. Execution rules. Output filters. Retry logic. Business constraints. State management. The model is one component. The harness is the architecture.

In Buggy Smart, the harness is a rules file that runs before every call. The model never sees it. But without it, the model calls venues at midnight and calls them again the next day.

INPUT CONTROL EXECUTION RULES OUTPUT FILTERS
THE FAILURE MODE

Without a harness,
you have a demo.

Skip the harness and the cost is not lower quality: it is no product at all. A responsive, impressive demo that fails silently at 11pm on a Thursday. My first voice agent called venues at 10pm. Called venues that had already answered. Called the same venue twice in a week. The model performed perfectly. The product was a mess.

The failure was not in the model. It was in everything I had not built yet.

01
HARNESS 01 · BUGGY SMART

Calls London pubs. Classifies buggy-friendliness.

The model does one thing: classify a transcript as yes, difficult, or no. Everything else is the harness. No calls after 6pm. No re-calling answered venues. Nightly re-scoring pipeline. 50-call daily cap. Classification fed back into the live map each night.

1,180
Venues mapped
10,731+
Conversations logged
49%
Pass rate (yes)
Daily
Re-scoring cadence
YES 584 · 49% DIFFICULT 501 · 42% NO 96 · 8%
02
HARNESS 02 · FIRST ORDER

Calls London restaurants for dish recommendations.

Buggy Smart proved a voice agent could classify at scale. First Order pushed into harder territory: real-time conversation with a goal and no guaranteed outcome. Prompt versioned and frozen at v2.1. Success rate tracked across every call. The harness found the ceiling the model alone could not: roughly 15% of calls return a real recommendation (estimated across ~200 calls). That number tells you more about real-world AI performance than any benchmark.

~15% of calls return
a real recommendation
(estimated · ~200 calls)
~15% SUCCESS RATE PROMPT v2.1 FROZEN FULL CALL LOG
03
HARNESS 03 · WITH MOSHI

AI phone answering for UK restaurants.

First Order showed what one agent could do with one repeatable mission. With Moshi had to do the same for thousands of venues, each with different rules: booking systems, hours, policies. The harness is what makes one model serve all of them. Per-restaurant config, booking logic, tier differentiation. The model is constant. The harness adapts.

6,088 London venues
in the dataset
6,088-VENUE DATASET THREE PRICING TIERS PER-VENUE CONFIG
THE PATTERN

Three harnesses. Three shared rules.

01 · State
State lives outside the model
Venue scores, call history, business rules. None of it lives in a prompt. All of it informs what the model receives.
Persistent data layer
02 · Rules
Business rules override the LLM
No calls after 6pm. No re-calling answered venues. Per-venue config. These constraints do not ask the model for permission.
Hard-coded constraints
THE OBJECTION

"Platforms like Vapi and Retell already give you the harness. Just plug in the model and go."

THE ANSWER

Platforms give you
a generic harness.

Your competitive moat requires a specific one. Vapi and Retell abstract away infrastructure -- and in doing so, they take ownership of your production data. Every call you make teaches their platform. Not yours. When you build your own harness, the learning compounds in your system. The moat is yours.

Use platforms to prototype. Build a harness to compete.

BUILT BY ME · ALONE

Three live agents. No team. No VC.

These are not mockups or demos. Each has made real calls to real phone numbers and returned real answers. Buggy Smart runs every morning at 10:30am. First Order has a full call log. Moshi has a live website and three pricing tiers.

The model is not the hard part. The harness is where the hours went.

3 LIVE VOICE AGENTS 10,731+ CONVERSATIONS 1,180 VENUES MAPPED
THE ARCHITECTURE

Every harness has three layers.

Layer 1 · Input
Input control
What the model receives. Context selection, data enrichment, constraint injection. Garbage in, garbage out is a harness failure, not a model failure.
What gets in
Layer 2 · Execution
Execution control
What the model is allowed to do. Tool permissions, action gates, retry policies. The model proposes; the harness decides.
What it can do
THE MOAT

Production learning is the part you cannot buy.

Every call Buggy Smart makes teaches the harness. Version one classified everything as yes or no. Version six has a whole tier for "difficult" venues: pubs that welcome buggies on weekdays but not weekends. The model never changed. The harness learned it from 1,180 real calls. No competitor can buy that data. No model update can replicate it.

"The thing that made it real was forcing a 'difficult' category: the model wanted yes or no, but production taught me there's a whole middle tier, and that insight lives in the harness, not the model."

Mike Litman · Builder, Buggy Smart

IF YOU ARE BUYING

Ask about the harness,
not the model.

The right questions: What are your business rules? How do you handle failures? What have you learned from production that changed the system? What did version two look like versus version one?

Every voice API vendor I evaluated while building these products led with the model. None could answer the harness questions. If a vendor leads with the model, they are selling a demo. The harness is the proof.

IF YOU ARE BUILDING

Three things before you touch the model.

None of this is free. A harness takes weeks to build right. But a demo that breaks in production is not cheaper: it is just more expensive later.

01 · Rules first
Map your constraints
What can your system never do? What must it always do? Write these down before you write a single prompt. Business rules are not an afterthought.
Before any code
02 · Log everything
Build collection from day one
Production data is your moat. Every call, every output, every failure. The harness you ship in six months is only as good as what you collected on day one.
From first call
The prediction · Voice AI · 2028

The harness
is the moat.

In two years, every team will have a capable model. The differentiators will be the ones whose harness spent those two years in production, collecting what no one else's did. The moat is the data. The data comes from shipping.

Continue from slide ?