How My Sites Started Healing Themselves

Two days. That is how long it took me to notice 394 venues had vanished from one of my sites.

It was Saturday 23 April. I had been on a long afternoon of small fixes, the sort of work where every tab is open and no thought is fully finished. At 14:49 UTC I ran a manual deploy on the Buggy Smart project. The terminal said "deploy complete". Netlify said "production deploy is live". My eyes moved on to the next thing.

Nothing about that command was wrong. Everything about my local copy was wrong. The JSON file I had just shipped contained 753 venues. The one I had just overwritten contained 1,147. The map I had spent weeks growing dropped 394 pins to the floor in a single keystroke, and not one log file complained.

I noticed because I was about to mention the venue count in conversation. Hovered on the press page (still claiming 1,140), tabbed to the live map (showing 753), and felt my stomach do a small thing.

The watcher I should have built earlier

That afternoon was the prompt for a longer thought. Not "how did this happen" (I knew immediately how it happened: I was the entire incident) but "what would have caught it." The honest answer, as a solo founder running 25 small projects, was nothing. Nothing was watching. The site had been silently broken for two days and would have stayed broken until someone outside my head pointed at it.

So I built the watcher.

The first layer was the hourly check. Every hour, a small script compares the live venue count to a 7-day rolling maximum stored in git. If today drops more than 5% below that maximum, two things happen. First, an alert. Second, more important, an automatic restore: the script walks back through Netlify's deploy history, finds the most recent snapshot where the count was healthy, and re-publishes it. The whole thing takes under ten seconds. It runs forever, free, on GitHub Actions. If the April 23 incident happened tonight, the map would be back to 1,150 venues by morning and a notification would be waiting on my phone. The system fixes itself first, tells me second.

The system fixes itself first, tells me second.

The second layer is a daily backfill. Once a day a script walks through the last 30 Netlify deploy snapshots, builds a union of every venue ID that has ever appeared on the live map, and restores any IDs missing from current. It is the safety net for slower drift, the kind that does not trigger a 5% alarm but still bleeds the dataset over weeks. Live wins for venues still present, so latest classifications are preserved. Snapshots only contribute the venues that are no longer there at all.

The third layer is a pre-commit hook on the data file itself. If you try to commit a version of the venue JSON smaller than the version on disk, the commit refuses. The mistake I made on April 23 is now physically impossible to repeat without an explicit override.

Three layers, three different timescales, three different failure modes caught. The hourly catches sudden drops. The daily catches slow ones. The pre-commit catches the keystroke. None of them would have been enough on their own. Together they make the April 23 incident a class of bug that cannot reach production for more than an hour without auto-recovery.

Once you start looking, the pattern is everywhere

That was Buggy Smart. Then I started looking at the others.

The Pattern, my daily AI-curated podcast, had its own incident waiting to be noticed. A listener called Cicely messaged on a commute about a missing episode. Internal monitoring showed green. The CI ran every day. The deploy logs were clean. But the actual feed Spotify reads had been missing the morning episode for six weeks. The CI step that wrote the new file to git had been silently succeeding while the downstream publish step silently failed. By the time Cicely asked, six weeks of episodes were not where they should have been.

The lesson

Monitoring that only watches your own systems misses what your users experience. The site was healthy by every internal measure the whole time. It took a listener on a commute to catch what the infrastructure missed.

The fix on The Pattern looks different from Buggy Smart but rhymes. Per-deploy verification that hits the actual subscriber URL. Five retries before failing. A daily watchdog at noon that checks the last three editions and regenerates from the JSON archive if any 404. A recovery cascade for missing audio: try local git first, then Netlify deploy history, only alert if both fail. Feed regression detection, so if the published episode count drops below expected, the system regenerates from archive automatically.

Once you start looking, the pattern is everywhere. Queue Index has a baseline snapshot that gets restored if the new pipeline produces fewer venues than the live one. First Order rolls back to live recommendations if the regenerated set looks worse. The portfolio site itself runs a post-deploy smoke test with three retries on exponential backoff. EVERYWEAR marks optional pipeline steps as non-fatal, so the newsletter still sends if a brand-page generation step breaks.

Six projects, four different mechanisms, one shared principle: the site notices first. The human gets the alert after the recovery, not before.

SRE-grade reliability without an SRE team

This matters more than it sounds. Solo founders cannot afford an on-call rota. We cannot afford monitoring teams. We cannot afford the vendor stack a real engineering organisation uses to keep things upright. What we can afford, and what AI has unlocked in the last twelve months, is the pattern itself. The same agent that helps build the pipeline writes its self-healing layer. SRE-grade reliability without an SRE team is suddenly available to anyone who can name what they want watched.

The model is not "AI does the work, human checks it." It is "AI does the work, then watches the work." The human contribution is diagnostic curiosity: deciding what counts as broken, where the silent failure modes hide, what shape the recovery takes. AI writes the code. You design the questions.

The one rule worth starting with

The verdict, if you are deciding which of these patterns to build first, is this. Forget hourly watchdogs. Forget cascade logic. Start with the one rule that would have prevented every incident I have actually had: keep yesterday's version on standby.

Keep yesterday's version on standby. The site cannot stay broken longer than 24 hours, because yesterday is always there.

Every site keeps a copy of how it looked yesterday. Every day, an automatic check looks at today. If today is meaningfully worse, the site swaps itself back to yesterday and pings you to investigate. That is the whole pattern. No matter what broke. No matter why. The site cannot stay broken longer than 24 hours, because yesterday is always there.

Every disaster I have actually had was a silent rollback the wrong way around. Today worse than yesterday, and no one noticing for weeks. The rule means the site notices first.

If you want the longer version, the full talk is here: the maturity model, the four-pattern taxonomy, the economics, the failure modes, and the close that makes this a thesis rather than a checklist. Twelve minutes if you read it slowly.

But the watcher you actually need fits on one page. Build that one this week. The rest is layering.

The watcher I should have built earlier

Once you start looking, the pattern is everywhere

SRE-grade reliability without an SRE team

The one rule worth starting with

Enjoyed this?