AI Agents in Production: What a World Cup Opening Night Proved

Last night at 8pm, Mexico kicked off against South Africa and my software met reality for the first time.

Beat the Gaffer is a World Cup prediction game I built with AI coding agents: a couple of hundred players, all 72 group games, and an AI pundit called the Gaffer whose picks are public and permanent. For weeks it had been a machine running on promises. The scoring engine had never scored a real match. The leaderboard had never moved. The morning damage report had never had any damage to report. Opening night was the moment every one of those promises got called in at once.

At 10.41pm the final whistle's result landed in the database. One scheduled function fired, wrote the score, and settled 141 predictions in a single run. The damage maths matched the point totals exactly. The result-day share card rendered the real 2-0. The Gaffer, smugly, had called the exact score and banked five points. Nothing broke, nothing double-counted, nobody got emailed by accident.

Here is the part worth writing down: the scoring engine's full-time write had been changed that same evening, hours before its first ever real use. A guard deep in the code said "only write the final score if no score exists yet". Perfectly correct for as long as scores only appeared at full time. That evening I shipped live in-play scores, which meant a score would now exist at half time, which meant that guard would have silently blocked every final result of the tournament. We found it, reworked it, pinned it with tests, and its first production execution was the World Cup opener. It worked first time.

That is not a story about luck, and honestly it is not a story about skill either. It is a story about conventions.

I am a strategist by training, not an engineer. Fifteen years in brand and culture, and I have never written production code by hand. Everything in this product was built by AI coding agents working in long sessions, often several in parallel, often overnight. The standard worry about that way of working is the right one: how do you trust code you did not write, reviewed by nobody, shipped at speed, on a live product with real users and a real business partner?

My answer has become a thesis I call instructions as infrastructure. You do not review every line. You build the rules that make every line reviewable, and you treat those rules with the same seriousness an engineering organisation treats its deploy pipeline. Not vibes, not prompts, not "be careful". Written, versioned, enforced instructions that survive between sessions and apply to whichever agent shows up next.

In this product the rules are boring, and that is the point. Every surface that displays a result must check the match status is FINISHED, never just that a score exists. Every new feature ships behind a flag that defaults to off. Nothing sends an email or a push notification unattended; a human approves every send. Scoring behaviour is pinned by tests that fail if a byte changes. Every claim of "done" requires pasted evidence: the deploy log, the live response, the database row. And every session ends by writing an honest account of what actually happened, including what was not verified, into a project bible the next session must read before touching anything.

141

Predictions settled in a single run

Result-reading surfaces audited

103

Matches still to come

Opening night was the audit of whether any of that compounds. The day before kickoff, when we decided to add live scores, an agent swept all 48 files that read match results to find places that assumed "score exists" meant "match finished". It found two, both cosmetic. Forty-six surfaces were already safe, not because anyone remembered to check them that day, but because a convention written weeks earlier had been applied by every agent in every session since. That is what infrastructure means. The rule did the remembering.

The same night handed me the failure cases too, which is how you know the lesson is real. The data feed I relied on lagged two hours behind the actual match, a single point of failure no internal convention can fix; I upgraded to the paid tier before midnight. Two features turned out to be fully built, polished and tested, but invisible, because nothing fed them data or linked to them. Built is not the same as live. And running several agent sessions in parallel produced three collisions in one evening, including one session wiping a config file another had to restore. Each became a new written rule within the hour. The instruction set is not a manifesto; it is a changelog of everything that has ever gone wrong.

You do not review every line. You build the rules that make every line reviewable.

I think this is roughly what working with AI looks like for the next few years, and not just for code. The interesting question has stopped being "can the model do the work?" It can. The question is what surrounds the model: the gates, the defaults, the evidence requirements, the honest record-keeping. That layer is not technical. It is judgement, encoded. Deciding that no AI sends a message to a real user without human approval is not an engineering decision; it is a brand decision, a trust decision, a taste decision. The people who are good at that layer are not necessarily the people who are good at writing code, which is why a strategist can now run what is effectively a small engineering organisation, staffed by agents, with one human doing the judgement.

The discipline I have come to rate most is the least glamorous one: not claiming things. The audit credit for the morning damage report stayed unclaimed last night because no real user has lived a real morning with it yet. The marketing email that points to it is still a draft. The project bible says so, plainly, where the next session will read it. An honest record is what lets a fresh agent, or a fresh human, pick up the work cold and trust the ground they are standing on. Most organisations cannot do this with people. It turns out you can build it into a working method with machines, if you write it down and enforce it.

The tournament has just started. There are 103 matches left, a knockout-stage scoring rule waiting for its first penalty shootout, and an AI pundit whose public record now stands at one exact score from one game, which he will absolutely not shut up about. The machine will meet reality every single day for the next five weeks. Opening night proved the conventions hold. Now compounding gets its turn.

This essay is also a 19-slide talk: AI Agents in Production.

Enjoyed this?