Six Cultural Capital Labs projects that fix themselves before anyone notices.
Nobody woke up. Nobody opened a laptop. Nobody touched the system.
By morning the data was already restored, the cause was already logged, and the human was already informed, after the fact.
This talk is about how that becomes the default, when you build solo.
Not "the alert fires faster." Not "we have better dashboards." The fix runs without you.
Monitoring is half the job. Healing is the other half.
A manual netlify deploy from local pushed a stale 753-venue JSON over a live 1,147-venue version of the Buggy Smart map.
394 pins disappeared from the public map. Their audio with them.
The press page kept claiming 1,140 venues. The map showed 753. Both were "live".
Time taken for the gap to be noticed.
Solo founders cannot afford an on-call rota. So the system has to notice for itself.
| WHAT BROKE | Manual deploy with stale local JSON |
| DETECTION | Human, by accident, two days later |
| DATA LOST | 394 venue pins + 343 audio files |
| RECOVERY | 90 minutes once started, via Netlify deploy archives + ElevenLabs conversation_id |
| FIX | Three independent self-healing layers + a deploy wrapper |
| LESSON | Every manual path is a future incident. |
| LAYER 1 · HOURLY WATCHDOG | Compares live to a 7-day high-water mark. If live drops more than 5%, alerts AND auto-restores the most recent healthy Netlify deploy. |
| LAYER 2 · DAILY BACKFILL | Walks the last 30 deploy snapshots, unions all venue IDs ever published, restores any missing from current live. |
| LAYER 3 · PRE-COMMIT HOOK | Refuses any local commit that shrinks the JSON. The mistake gets stopped at source. |
No single layer would have caught the April 23 incident on its own. Together they make it impossible.
The Pattern, a daily AI-curated podcast. Cicely, a listener, messaged on a commute about a missing episode. The site was healthy by every internal measure. The episode hadn't shown up in her feed for six weeks.
Monitoring that only watches your own systems misses what your users experience.
A CI step that writes to a git repo is not a deploy. The healthy CI run hid a silent feed-publishing failure for six weeks.
| Per-deploy verification | Hits the actual subscriber URL. Not the API. Retries 5x. |
| Daily watchdog at noon | Independent workflow. Checks last 3 editions. Regenerates HTML from JSON archive if 404. |
| Recovery cascade for audio | Local git, then Netlify deploy archive, then alert. Never auto-spend on regeneration. |
| Feed regression detection | If episode count drops below expected, regenerate from archive automatically. |
Two incidents, two systems, same architectural pattern.
I have shipped 25+ live projects across Netlify, Vercel, and Cloudflare. Almost all started at 0–1. Three are now at 3.
Pick the level your blast radius warrants. Most solo projects only need one. The most public ones combine all four.
Roll a max over a recent window. If today drops more than a threshold below it, alert AND auto-restore the most recent healthy snapshot.
Before any pipeline runs, snapshot the current production data. If the new run produces a worse result, restore the snapshot before it ships.
Most failures are transient. APIs blink. CDNs lag. A second attempt 30 seconds later usually works. Three attempts with exponential delay catches almost everything.
Mark optional steps as non-fatal. The newsletter still sends if the brand page generation fails. The map still loads if a single audio file 404s. The pipeline keeps going.
| START HERE · 1 HOUR | Verify the actual subscriber URL after every deploy. Not the API. Not health.json. The URL your users hit. Retry 5x. |
| DAY 2 · 3 HOURS | Daily independent watchdog. Different schedule from the main pipeline. Different checks. Designed to catch what the pipeline missed. |
| DAY 3 · 1 DAY | Auto-heal the most common failure mode you have actually seen. Regenerate from your authoritative source. Commit. Redeploy. |
| ADVANCED · WEEK+ | Recovery cascade. Try cheap source, fall back to expensive source, only then alert. |
These four stages took me one Saturday afternoon for Buggy Smart.
Every day, if today looks worse than yesterday, fewer things, missing content, broken pages, the site swaps itself back. The site notices. The site fixes itself. You investigate later.
Every disaster I have actually had was a silent rollback the wrong way around: today worse than yesterday, and no one noticing for weeks. This rule means the site notices first.
| Buggy Smart | Hourly watchdog with auto-rollback, daily snapshot backfill, pre-commit hook |
| The Pattern | Per-deploy URL verification, daily watchdog, recovery cascade, feed regression detection |
| Queue Index | Baseline snapshot + catastrophic-drop restoration + suspicious-drop alert |
| First Order | Pipeline rollback to live recommendations on regression |
| Portfolio | Pre-deploy sanity checks + post-deploy smoke tests with exponential backoff |
| EVERYWEAR | Graceful degradation, non-fatal step continuation |
Six projects, four patterns, one principle: the data layer fixes itself.
A year ago, this same April 23 incident would have lost 394 venues permanently. No deploy snapshots checked. No conversation IDs harvested. No watchdog. The map would have resumed at 753 and nobody would have known what was missing.
Most indie projects are still that version. Silent data loss as the steady state.
Heals into a worse state. Auto-rollback restores yesterday's data over today's correct content. Only heal what you can verify is broken; never overwrite something newer.
High-water mark goes stale. If your archive corrupts, the watchdog will "fix" the live site to match the corruption. Cross-check expected against multiple sources before regenerating.
I almost did this twice while building the Buggy Smart watchdog. The first draft would have happily restored a 753-venue snapshot over the 1,150 live one if I had flipped one inequality.
Self-healing without these guards is just automated foot-shooting.
Every case study here is read-mostly: a map, a feed, a guide. The data is generated by the system, not entered by users. Restoring an older snapshot loses nothing the system cannot regenerate.
Write-heavy systems need different patterns. Stateful transactions, customer data, payments, race conditions, all require append-only logs, transactional rollback, or explicit replay.
Different mechanics, same principle: the system catches its own breakage and restores itself before a human notices.
| ONE SRE | £100K / year |
| HOURLY MANUAL CHECK, 25 PROJECTS | ~£36K / month 25 projects × 24 checks/day × 30 days × ~2.5 min @ £48/hr |
| THE BUGGY SMART WATCHDOG | £0 ongoing · <10s per check free on GitHub Actions, fully managed |
The pattern is undeniable once a solo founder sees the math.
Imagine pitching Mumsnet on the morning of 23 April. The press page claims 1,140 verified venues. The live map shows 753.
The cost of the incident is not the engineering. It is the press call you cannot get back.
Self-healing is brand protection, not just ops.
Heal first.
Alert only for failures the system cannot fix.
Alerts require a human to notice, understand, and act, at 10pm, after a long day. The system can do most of it.
The same agent that builds the pipeline writes its self-healing layer. Solo founders now get SRE-grade reliability without an SRE team.
I have never hired an SRE. I run 25+ live projects. The Buggy Smart 3-layer system was built and deployed in one session.
AI writes the watchdog. You decide what counts as broken. Diagnostic curiosity is the human contribution.
import subprocess, json
from pathlib import Path
URL = "https://yoursite.com/page.html"
ARCHIVE = Path("data/page.json")
OUTPUT = Path("deploy/page.html")
status = subprocess.run(["curl","-sI",URL], capture_output=True, text=True)
if "200" not in status.stdout.split("\n")[0]:
data = json.loads(ARCHIVE.read_text())
OUTPUT.write_text(render_page(data))
subprocess.run(["git","add","-f",str(OUTPUT)])
subprocess.run(["git","commit","-m","watchdog auto-heal"])
subprocess.run(["git","push"])
A distilled version of what runs on buggysmart.app every hour. Run it on cron daily. Layer up from there.
The data layer fixes itself.
The human gets the alert
after the recovery, not before.
mikelitman.me · hello@mikelitman.me