AI Agents in Production

A TALK BY MIKE LITMANWORLD CUP 2026INSTRUCTIONS AS INFRASTRUCTURE

AI AGENTS IN
PRODUCTION

What a World Cup opening night proved · 11 June 2026

WHO IS TALKING

A strategist, not an engineer.

Fifteen-plus years in brand and culture. I have never written production code by hand. Everything in this product was built by AI coding agents working in long sessions, often several in parallel, often overnight. It runs live, with real users and a real business partner.

15+ YEARS BRAND AND CULTURE ZERO LINES WRITTEN BY HAND LIVE PRODUCT, REAL USERS

THE MACHINE

A machine running on promises.

Beat the Gaffer is a World Cup prediction game with an AI pundit whose picks are public and permanent. For weeks the scoring engine had never scored a real match. The leaderboard had never moved. Opening night was the moment every promise got called in at once.

72

Group games

~200

Players with live predictions

One

AI pundit, record public

Zero

Real matches ever scored

22:41, 11 JUNE

141

Mexico 2-0 South Africa lands in the database. One scheduled function fires and settles 141 predictions in a single run. Nothing breaks, nothing double-counts, nobody gets emailed by accident. The Gaffer calls the exact score and banks five points.

01

THE LANDMINE

The guard that had always been right.

Deep in the scoring engine's full-time write sat a guard. Perfectly correct for as long as scores only appeared at full time. The evening live in-play scores shipped, a score now existed at half time. That one guard would have silently blocked every final result of the tournament.

// the landmine, right since the day it was written
.is('home_score', null) // only write if no score exists yet

02

THE FIX

Found, reworked and pinned, hours before kickoff.

The guard was reworked the same night to check match status, not score presence, and pinned with tests that fail if the behaviour changes by a byte. Its first ever production execution was the World Cup opener. It worked first time.

STATUS-BASED WRITE PINNED BY TESTS FIRST RUN: THE OPENER

Not luck. Not skill. Conventions.

THE STANDARD WORRY, AND IT IS THE RIGHT ONE

How do you trust code you did not write?

Reviewed by nobody. Shipped at speed. Written by agents working in parallel, on a live product with real users and a real business partner. Every instinct from traditional software says this should end badly.

MY ANSWER

"I don't trust the code; I trust the rules it has to pass through before it reaches me."

INSTRUCTIONS AS INFRASTRUCTURE

"

You do not review every line. You build the rules that make every line reviewable.

THE THESIS · Treat the rules with the seriousness an engineering organisation gives its deploy pipeline

How most people steer agents

Vibes and prompts.

Guidance that lives in a chat box and dies when the session ends.

"Be careful" typed into a prompt

Rules that depend on who is watching

Every new session starts from zero

Instructions as infrastructure

Written, versioned, enforced.

Rules that survive between sessions and apply to whichever agent shows up next. Engineering teams have always had conventions. The new part: the rules replace the review, and a non-engineer can write them.

Conventions in files every session must read

Hooks that block the deploy, not advice

A changelog the next agent inherits

The rules are boring. That is the point.

The working instruction set · Beat the Gaffer, World Cup 2026

01.

Status, not score

Every surface showing a result checks the match is FINISHED, never just that a score exists.

02.

Flags default off

Every new feature ships behind a flag that starts switched off.

03.

No unattended sends

Nothing emails or notifies a real user without a human approving the send.

04.

Scoring pinned by tests

Tests fail if scoring behaviour changes by a single byte.

05.

"Done" needs evidence

Every claim of done requires pasted proof: the deploy log, the live response, the database row.

06.

An honest bible

Every session ends by writing what actually happened, including what was not verified.

THE SYSTEM

How a rule becomes infrastructure.

EVERY SESSION

Agents read the rules first

BEFORE TOUCHING ANYTHING

ENFORCEMENT

Hooks block, not advise

THE DEPLOY STOPS ON A VIOLATION

THE RECORD

An honest bible, every session

EVERY FAILURE FEEDS THE NEXT RULE

THE INSTRUCTION SET

Written, versioned rules

A CHANGELOG OF EVERYTHING THAT WENT WRONG

THE DAY BEFORE KICKOFF

The rule did the remembering.

When live scores got the green light, an agent swept every file that reads match results for places assuming "score exists" meant "match finished". Forty-six were already safe: a convention written weeks earlier, applied by every agent in every session since.

48

Result surfaces audited

2

Fixes needed, both cosmetic

46

Already safe, by convention

Weeks

Since the rule was written

THE SAME NIGHT, THE OTHER LEDGER

Opening night handed over the failures too.

01 · External

The feed lagged

The free data feed ran around two and a half hours behind the match, by design. A single point of failure no internal convention can fix.

Paid tier bought before midnight

02 · Internal

Built is not live

Two features turned out fully built, polished and tested, but invisible: nothing fed them data and nothing linked to them.

New rule within the hour

03 · Process

Three collisions

Parallel agent sessions collided three times in one evening, including one wiping a config file another had to restore.

Each one became a written rule

The instruction set is not a manifesto. It is a changelog of everything that has ever gone wrong.

The question has stopped being "can the model do the work?" It can. The question is what surrounds the model.

03

THE LAYER THAT MATTERS

Judgement, encoded.

The gates, the defaults, the evidence rules: that layer is not technical. It is brand, trust and taste decisions, encoded. It is how a strategist runs an engineering organisation staffed by agents: one human, doing the judgement.

The least glamorous discipline matters most: not claiming things until proven. An honest record lets a fresh agent, or a fresh human, pick up the work cold. Most organisations cannot do this with people. With machines, you can build it in.

NO UNATTENDED SENDS EVIDENCE BEFORE "DONE"

THE MORNING AFTER

Then it repeated.

Next morning, South Korea beat Czechia 2-1. The upgraded feed tracked the score live mid-match, the database followed it in real time, and the full set of predictions was scored about ten minutes after the final whistle. Not a one-off. A system.

139

Predictions scored

~10 min

After the final whistle

Live

Score tracked mid-match

1 of 2

Gaffer exact scores so far

A TALK BY MIKE LITMAN

Opening night proved the conventions hold. Now compounding gets its turn.

103 matches to go. A knockout rule waiting for its first penalty shootout. An AI pundit who will not shut up. The machine meets reality every day for five weeks.

mikelitman.me · hello@mikelitman.me

Read the essay → Play Beat the Gaffer → All decks →