Over coffee during a quarterly server audit at one of our managed clients, we asked a question we now ask everywhere: when did you last actually restore something from backup? The question reads innocently. The answer was even more innocent — 'we've never lost anything, so we've never needed to.' Backups were running. Cron logs were green. That was supposed to be the end of it.
It was the start of a weekend of disaster-recovery drills. Some of it worked. A lot of it didn't.
The question that lit the fuse
The client had two kinds of backup: a nightly `rsync` of files to a NAS, mirrored once a day to S3 Glacier; and a `pg_dump` of the database to the same bucket. The scheme looked clean. The CTO had Datadog screenshots showing a flat green 'backup OK' line for eighteen months without interruption. Properly documented backups. Properly never restored.
On Friday afternoon we proposed an informal drill. Four scenarios: restore a single file, restore one database into staging, restore a full server image into an isolated VM, and — the last one — a cross-test, where the person taking notes wasn't the one who set the backup up.
What we found in 14 hours
- The IAM key the S3 lifecycle policy used to pull objects back from Glacier for a restore had been rotated nine months earlier. The new one had never been wired into the restore script. Backups stored fine. A restore couldn't pull a single file.
- The NAS mount worked. Out of a million files, 94% restored fine. The remaining 6% were corrupt — `rsync --inplace` over a network-mounted NAS had been overwriting half-files and sometimes dying mid-write.
- `pg_dump` produced files. Restoring into a staging Postgres dropped on a role conflict — `pg_dump` had been running as a different user than the one owning the tables. Postgres checks that on restore. Nobody had ever known, because nobody had ever restored.
- The full disk image restored. It took 11 hours via Glacier expedited retrieval, which cost more than the client expected. And the restored server couldn't activate the OS licence key — it was bound to the MAC address of the old hardware.
By Monday morning they had a backup setup that technically ran but couldn't restore one in four important scenarios. That's a 25% chance of full-blown disaster on the day they'd actually need it.
A backup you've never tested is just hope on a disk. Sometimes not even that.
The four checks we now run every quarter
- Restore a random file across the full chain, exactly the way a junior would do it at three in the morning. From S3, through Glacier expedited retrieval, through an IAM key that actually answers. Target: success in under 30 minutes.
- Restore the latest `pg_dump` into an isolated staging Postgres. Target: full schema + indexes + foreign keys loaded without error in under 60 minutes.
- Restore the full disk image into a temporary VM with no network. Target: bootable system with accessible application layer in under 4 hours. (This one takes longest, but the difference between 4 and 24 hours matters.)
- A calendar entry for next quarter. No check exists until it has a date.
We billed the client two days of work after that weekend. We almost certainly saved them one — possibly two — production outages that would otherwise have shown up the same month. That's the best cost-to-outcome ratio we know in IT. It's part of how our managed IT service works by default — but we'll help set it up for clients who run their own infrastructure too.