An on-prem backup nobody had tested in two years

Over coffee during a quarterly server audit at one of our managed clients, we asked a question we now ask everywhere: when did you last actually restore something from backup? The question reads innocently. The answer was even more innocent — 'we've never lost anything, so we've never needed to.' Backups were running. Cron logs were green. That was supposed to be the end of it.

It was the start of a weekend of disaster-recovery drills. Some of it worked. A lot of it didn't.

The question that lit the fuse

The client had two kinds of backup: a nightly `rsync` of files to a NAS, mirrored once a day to S3 Glacier; and a `pg_dump` of the database to the same bucket. The scheme looked clean. The CTO had Datadog screenshots showing a flat green 'backup OK' line for eighteen months without interruption. Properly documented backups. Properly never restored.

On Friday afternoon we proposed an informal drill. Four scenarios: restore a single file, restore one database into staging, restore a full server image into an isolated VM, and — the last one — a cross-test, where the person taking notes wasn't the one who set the backup up.

What we found in 14 hours

The IAM key the S3 lifecycle policy used to pull objects back from Glacier for a restore had been rotated nine months earlier. The new one had never been wired into the restore script. Backups stored fine. A restore couldn't pull a single file.
The NAS mount worked. Out of a million files, 94% restored fine. The remaining 6% were corrupt — `rsync --inplace` over a network-mounted NAS had been overwriting half-files and sometimes dying mid-write.
`pg_dump` produced files. Restoring into a staging Postgres dropped on a role conflict — `pg_dump` had been running as a different user than the one owning the tables. Postgres checks that on restore. Nobody had ever known, because nobody had ever restored.
The full disk image restored. It took 11 hours via Glacier expedited retrieval, which cost more than the client expected. And the restored server couldn't activate the OS licence key — it was bound to the MAC address of the old hardware.

By Monday morning they had a backup setup that technically ran but couldn't restore one in four important scenarios. That's a 25% chance of full-blown disaster on the day they'd actually need it.

A backup you've never tested is just hope on a disk. Sometimes not even that.

The four checks we now run every quarter

Restore a random file across the full chain, exactly the way a junior would do it at three in the morning. From S3, through Glacier expedited retrieval, through an IAM key that actually answers. Target: success in under 30 minutes.
Restore the latest `pg_dump` into an isolated staging Postgres. Target: full schema + indexes + foreign keys loaded without error in under 60 minutes.
Restore the full disk image into a temporary VM with no network. Target: bootable system with accessible application layer in under 4 hours. (This one takes longest, but the difference between 4 and 24 hours matters.)
A calendar entry for next quarter. No check exists until it has a date.

We billed the client two days of work after that weekend. We almost certainly saved them one — possibly two — production outages that would otherwise have shown up the same month. That's the best cost-to-outcome ratio we know in IT. It's part of how our managed IT service works by default — but we'll help set it up for clients who run their own infrastructure too.

The question that lit the fuse

What we found in 14 hours

The IAM key the S3 lifecycle policy used to pull objects back from Glacier for a restore had been rotated nine months earlier. The new one had never been wired into the restore script. Backups stored fine. A restore couldn't pull a single file.

The NAS mount worked. Out of a million files, 94% restored fine. The remaining 6% were corrupt — `rsync --inplace` over a network-mounted NAS had been overwriting half-files and sometimes dying mid-write.

`pg_dump` produced files. Restoring into a staging Postgres dropped on a role conflict — `pg_dump` had been running as a different user than the one owning the tables. Postgres checks that on restore. Nobody had ever known, because nobody had ever restored.

The full disk image restored. It took 11 hours via Glacier expedited retrieval, which cost more than the client expected. And the restored server couldn't activate the OS licence key — it was bound to the MAC address of the old hardware.

By Monday morning they had a backup setup that technically ran but couldn't restore one in four important scenarios. That's a 25% chance of full-blown disaster on the day they'd actually need it.

A backup you've never tested is just hope on a disk. Sometimes not even that.

The four checks we now run every quarter

Restore a random file across the full chain, exactly the way a junior would do it at three in the morning. From S3, through Glacier expedited retrieval, through an IAM key that actually answers. Target: success in under 30 minutes.

Restore the latest `pg_dump` into an isolated staging Postgres. Target: full schema + indexes + foreign keys loaded without error in under 60 minutes.

Restore the full disk image into a temporary VM with no network. Target: bootable system with accessible application layer in under 4 hours. (This one takes longest, but the difference between 4 and 24 hours matters.)

A calendar entry for next quarter. No check exists until it has a date.