Mike Litman
AI Agents in Production
A TALK BY MIKE LITMANWORLD CUP 2026INSTRUCTIONS AS INFRASTRUCTURE
AI AGENTS IN
PRODUCTION
What a World Cup opening night proved · 11 June 2026
WHO IS TALKING

A strategist, not an engineer.

Fifteen-plus years in brand and culture. I have never written production code by hand. Everything in this product was built by AI coding agents working in long sessions, often several in parallel, often overnight. It runs live, with real users and a real business partner.

15+ YEARS BRAND AND CULTURE ZERO LINES WRITTEN BY HAND LIVE PRODUCT, REAL USERS
THE MACHINE

A machine running on promises.

Beat the Gaffer is a World Cup prediction game with an AI pundit whose picks are public and permanent. For weeks the scoring engine had never scored a real match. The leaderboard had never moved. Opening night was the moment every promise got called in at once.

72
Group games
~200
Players with live predictions
One
AI pundit, record public
Zero
Real matches ever scored
22:41, 11 JUNE
141

Mexico 2-0 South Africa lands in the database. One scheduled function fires and settles 141 predictions in a single run. Nothing breaks, nothing double-counts, nobody gets emailed by accident. The Gaffer calls the exact score and banks five points.

01
THE LANDMINE

The guard that had always been right.

Deep in the scoring engine's full-time write sat a guard. Perfectly correct for as long as scores only appeared at full time. The evening live in-play scores shipped, a score now existed at half time. That one guard would have silently blocked every final result of the tournament.

// the landmine, right since the day it was written
.is('home_score', null)  // only write if no score exists yet
02
THE FIX

Found, reworked and pinned, hours before kickoff.

The guard was reworked the same night to check match status, not score presence, and pinned with tests that fail if the behaviour changes by a byte. Its first ever production execution was the World Cup opener. It worked first time.

STATUS-BASED WRITE PINNED BY TESTS FIRST RUN: THE OPENER

Not luck. Not skill. Conventions.

THE STANDARD WORRY, AND IT IS THE RIGHT ONE

How do you trust code you did not write?

Reviewed by nobody. Shipped at speed. Written by agents working in parallel, on a live product with real users and a real business partner. Every instinct from traditional software says this should end badly.

MY ANSWER

"I don't trust the code; I trust the rules it has to pass through before it reaches me."

INSTRUCTIONS AS INFRASTRUCTURE
"
You do not review every line. You build the rules that make every line reviewable.
THE THESIS · Treat the rules with the seriousness an engineering organisation gives its deploy pipeline
How most people steer agents

Vibes and prompts.

Guidance that lives in a chat box and dies when the session ends.

"Be careful" typed into a prompt
Rules that depend on who is watching
Every new session starts from zero
Instructions as infrastructure

Written, versioned, enforced.

Rules that survive between sessions and apply to whichever agent shows up next. Engineering teams have always had conventions. The new part: the rules replace the review, and a non-engineer can write them.

Conventions in files every session must read
Hooks that block the deploy, not advice
A changelog the next agent inherits

The rules are boring. That is the point.

The working instruction set · Beat the Gaffer, World Cup 2026
01.
Status, not score
Every surface showing a result checks the match is FINISHED, never just that a score exists.
02.
Flags default off
Every new feature ships behind a flag that starts switched off.
03.
No unattended sends
Nothing emails or notifies a real user without a human approving the send.
04.
Scoring pinned by tests
Tests fail if scoring behaviour changes by a single byte.
05.
"Done" needs evidence
Every claim of done requires pasted proof: the deploy log, the live response, the database row.
06.
An honest bible
Every session ends by writing what actually happened, including what was not verified.
THE SYSTEM

How a rule becomes infrastructure.

EVERY SESSION
Agents read the rules first
BEFORE TOUCHING ANYTHING
ENFORCEMENT
Hooks block, not advise
THE DEPLOY STOPS ON A VIOLATION
THE RECORD
An honest bible, every session
EVERY FAILURE FEEDS THE NEXT RULE
THE INSTRUCTION SET
Written, versioned rules
A CHANGELOG OF EVERYTHING THAT WENT WRONG
THE DAY BEFORE KICKOFF

The rule did the remembering.

When live scores got the green light, an agent swept every file that reads match results for places assuming "score exists" meant "match finished". Forty-six were already safe: a convention written weeks earlier, applied by every agent in every session since.

48
Result surfaces audited
2
Fixes needed, both cosmetic
46
Already safe, by convention
Weeks
Since the rule was written
THE SAME NIGHT, THE OTHER LEDGER

Opening night handed over the failures too.

01 · External
The feed lagged
The free data feed ran around two and a half hours behind the match, by design. A single point of failure no internal convention can fix.
Paid tier bought before midnight
02 · Internal
Built is not live
Two features turned out fully built, polished and tested, but invisible: nothing fed them data and nothing linked to them.
New rule within the hour

The instruction set is not a manifesto. It is a changelog of everything that has ever gone wrong.

The question has stopped being "can the model do the work?" It can. The question is what surrounds the model.

03
THE LAYER THAT MATTERS

Judgement, encoded.

The gates, the defaults, the evidence rules: that layer is not technical. It is brand, trust and taste decisions, encoded. It is how a strategist runs an engineering organisation staffed by agents: one human, doing the judgement.

The least glamorous discipline matters most: not claiming things until proven. An honest record lets a fresh agent, or a fresh human, pick up the work cold. Most organisations cannot do this with people. With machines, you can build it in.

NO UNATTENDED SENDS EVIDENCE BEFORE "DONE"
THE MORNING AFTER

Then it repeated.

Next morning, South Korea beat Czechia 2-1. The upgraded feed tracked the score live mid-match, the database followed it in real time, and the full set of predictions was scored about ten minutes after the final whistle. Not a one-off. A system.

139
Predictions scored
~10 min
After the final whistle
Live
Score tracked mid-match
1 of 2
Gaffer exact scores so far
A TALK BY MIKE LITMAN

Opening night proved the conventions hold. Now compounding gets its turn.

103 matches to go. A knockout rule waiting for its first penalty shootout. An AI pundit who will not shut up. The machine meets reality every day for five weeks.

mikelitman.me · hello@mikelitman.me

Continue from slide ?