A TALK BY MIKE LITMAN

Self-Healing Solo Projects

Six Cultural Capital Labs projects that fix themselves before anyone notices.

02:14 BST · SAT 25 APR 2026

An alert fired at 2am.

Nobody woke up. Nobody opened a laptop. Nobody touched the system.

By morning the data was already restored, the cause was already logged, and the human was already informed, after the fact.

This talk is about how that becomes the default, when you build solo.

DEFINITION

Self-healing means the system fixes itself before a human notices.

Not "the alert fires faster." Not "we have better dashboards." The fix runs without you.

Monitoring is half the job. Healing is the other half.

THE INCIDENT

23 April 2026, 14:49 UTC.

A manual netlify deploy from local pushed a stale 753-venue JSON over a live 1,147-venue version of the Buggy Smart map.

394 pins disappeared from the public map. Their audio with them.

The press page kept claiming 1,140 venues. The map showed 753. Both were "live".

2 days

Time taken for the gap to be noticed.

Solo founders cannot afford an on-call rota. So the system has to notice for itself.

POSTMORTEM

Anatomy of the incident.

WHAT BROKE	Manual deploy with stale local JSON
DETECTION	Human, by accident, two days later
DATA LOST	394 venue pins + 343 audio files
RECOVERY	90 minutes once started, via Netlify deploy archives + ElevenLabs `conversation_id`
FIX	Three independent self-healing layers + a deploy wrapper
LESSON	Every manual path is a future incident.

WHAT WE BUILT · BUGGY SMART

Three independent layers.

LAYER 1 · HOURLY WATCHDOG	Compares live to a 7-day high-water mark. If live drops more than 5%, alerts AND auto-restores the most recent healthy Netlify deploy.
LAYER 2 · DAILY BACKFILL	Walks the last 30 deploy snapshots, unions all venue IDs ever published, restores any missing from current live.
LAYER 3 · PRE-COMMIT HOOK	Refuses any local commit that shrinks the JSON. The mistake gets stopped at source.

No single layer would have caught the April 23 incident on its own. Together they make it impossible.

SECOND CASE STUDY · THE PATTERN

A listener caught what the infrastructure missed.

The Pattern, a daily AI-curated podcast. Cicely, a listener, messaged on a commute about a missing episode. The site was healthy by every internal measure. The episode hadn't shown up in her feed for six weeks.

Monitoring that only watches your own systems misses what your users experience.

A CI step that writes to a git repo is not a deploy. The healthy CI run hid a silent feed-publishing failure for six weeks.

WHAT WE BUILT · THE PATTERN

Four guards on the same site.

Per-deploy verification	Hits the actual subscriber URL. Not the API. Retries 5x.
Daily watchdog at noon	Independent workflow. Checks last 3 editions. Regenerates HTML from JSON archive if 404.
Recovery cascade for audio	Local git, then Netlify deploy archive, then alert. Never auto-spend on regeneration.
Feed regression detection	If episode count drops below expected, regenerate from archive automatically.

Two incidents, two systems, same architectural pattern.

A FIVE-LEVEL MODEL

Where most projects sit.

LEVEL 0 No monitoring. Most weekend projects.

LEVEL 1 Manual checks. Most indie projects.

LEVEL 2 Alerts only. You find out, you fix.

LEVEL 3 Auto-recovery. The system fixes itself, alerts you after.

LEVEL 4 Predictive. The system prevents the issue before it happens.

I have shipped 25+ live projects across Netlify, Vercel, and Cloudflare. Almost all started at 0–1. Three are now at 3.

FOUR PATTERNS

The taxonomy.

Pick the level your blast radius warrants. Most solo projects only need one. The most public ones combine all four.

01

PATTERN 01

High-water mark detection.

Roll a max over a recent window. If today drops more than a threshold below it, alert AND auto-restore the most recent healthy snapshot.

BUGGY SMART 7-DAY WINDOW 5% THRESHOLD HOURLY TICK

02

PATTERN 02

Baseline snapshot restoration.

Before any pipeline runs, snapshot the current production data. If the new run produces a worse result, restore the snapshot before it ships.

QUEUE INDEX FIRST ORDER PRE-PIPELINE

03

PATTERN 03

Retry with backoff.

Most failures are transient. APIs blink. CDNs lag. A second attempt 30 seconds later usually works. Three attempts with exponential delay catches almost everything.

PORTFOLIO POST-DEPLOY SMOKE 3 ATTEMPTS

04

PATTERN 04

Graceful degradation.

Mark optional steps as non-fatal. The newsletter still sends if the brand page generation fails. The map still loads if a single audio file 404s. The pipeline keeps going.

EVERYWEAR NON-FATAL TASKS

WHERE TO START

Effort × blast-radius reduction.

START HERE · 1 HOUR	Verify the actual subscriber URL after every deploy. Not the API. Not health.json. The URL your users hit. Retry 5x.
DAY 2 · 3 HOURS	Daily independent watchdog. Different schedule from the main pipeline. Different checks. Designed to catch what the pipeline missed.
DAY 3 · 1 DAY	Auto-heal the most common failure mode you have actually seen. Regenerate from your authoritative source. Commit. Redeploy.
ADVANCED · WEEK+	Recovery cascade. Try cheap source, fall back to expensive source, only then alert.

These four stages took me one Saturday afternoon for Buggy Smart.

IF YOU ONLY DO ONE THING

Keep yesterday's version
on standby.

Every day, if today looks worse than yesterday, fewer things, missing content, broken pages, the site swaps itself back. The site notices. The site fixes itself. You investigate later.

Every disaster I have actually had was a silent rollback the wrong way around: today worse than yesterday, and no one noticing for weeks. This rule means the site notices first.

SIX LIVE EXAMPLES

Across the portfolio.

Buggy Smart	Hourly watchdog with auto-rollback, daily snapshot backfill, pre-commit hook
The Pattern	Per-deploy URL verification, daily watchdog, recovery cascade, feed regression detection
Queue Index	Baseline snapshot + catastrophic-drop restoration + suspicious-drop alert
First Order	Pipeline rollback to live recommendations on regression
Portfolio	Pre-deploy sanity checks + post-deploy smoke tests with exponential backoff
EVERYWEAR	Graceful degradation, non-fatal step continuation

Six projects, four patterns, one principle: the data layer fixes itself.

WHAT IT LOOKS LIKE WITHOUT

The counter-example.

A year ago, this same April 23 incident would have lost 394 venues permanently. No deploy snapshots checked. No conversation IDs harvested. No watchdog. The map would have resumed at 753 and nobody would have known what was missing.

Most indie projects are still that version. Silent data loss as the steady state.

THE HONEST CAVEAT

Self-healing has its own failure modes.

Heals into a worse state. Auto-rollback restores yesterday's data over today's correct content. Only heal what you can verify is broken; never overwrite something newer.

High-water mark goes stale. If your archive corrupts, the watchdog will "fix" the live site to match the corruption. Cross-check expected against multiple sources before regenerating.

I almost did this twice while building the Buggy Smart watchdog. The first draft would have happily restored a 753-venue snapshot over the 1,150 live one if I had flipped one inequality.

Self-healing without these guards is just automated foot-shooting.

WHERE THIS DOESN'T APPLY

Snapshot healing is for generated data.

Every case study here is read-mostly: a map, a feed, a guide. The data is generated by the system, not entered by users. Restoring an older snapshot loses nothing the system cannot regenerate.

Write-heavy systems need different patterns. Stateful transactions, customer data, payments, race conditions, all require append-only logs, transactional rollback, or explicit replay.

Different mechanics, same principle: the system catches its own breakage and restores itself before a human notices.

THE ECONOMICS

Why solo founders should care.

ONE SRE	£100K / year
HOURLY MANUAL CHECK, 25 PROJECTS	~£36K / month 25 projects × 24 checks/day × 30 days × ~2.5 min @ £48/hr
THE BUGGY SMART WATCHDOG	£0 ongoing · <10s per check free on GitHub Actions, fully managed

The pattern is undeniable once a solo founder sees the math.

THE OTHER COST

What the incident actually cost.

Imagine pitching Mumsnet on the morning of 23 April. The press page claims 1,140 verified venues. The live map shows 753.

The cost of the incident is not the engineering. It is the press call you cannot get back.

Self-healing is brand protection, not just ops.

Heal first.
Alert only for failures the system cannot fix.

Alerts require a human to notice, understand, and act, at 10pm, after a long day. The system can do most of it.

THE THESIS

AI is the on-call engineer
you couldn't hire.

The same agent that builds the pipeline writes its self-healing layer. Solo founders now get SRE-grade reliability without an SRE team.

I have never hired an SRE. I run 25+ live projects. The Buggy Smart 3-layer system was built and deployed in one session.

AI writes the watchdog. You decide what counts as broken. Diagnostic curiosity is the human contribution.

YOUR FIRST WATCHDOG · 30 LINES

Check one URL. Heal one file. Push.

import subprocess, json
from pathlib import Path

URL = "https://yoursite.com/page.html"
ARCHIVE = Path("data/page.json")
OUTPUT = Path("deploy/page.html")

status = subprocess.run(["curl","-sI",URL], capture_output=True, text=True)
if "200" not in status.stdout.split("\n")[0]:
    data = json.loads(ARCHIVE.read_text())
    OUTPUT.write_text(render_page(data))
    subprocess.run(["git","add","-f",str(OUTPUT)])
    subprocess.run(["git","commit","-m","watchdog auto-heal"])
    subprocess.run(["git","push"])

A distilled version of what runs on buggysmart.app every hour. Run it on cron daily. Layer up from there.

The data layer fixes itself.
The human gets the alert
after the recovery, not before.

mikelitman.me · hello@mikelitman.me

Buggy Smart, the case study →