Engineering · 22 April 2026

Why we restore every backup, every week

A schedule that completes is not a backup that works. What we found when we started actually trying the snapshots we had been quietly producing for months.

By Rahul·8 minute read·All posts

Editor’s note. Verified restores are a roadmap feature, not a shipped one — the agent and storage pipeline exist; the scheduled verifier on the control plane does not yet. This essay is the working principle that shapes how we are building it.

For the first six months of dbcrate’s development, we had a perfectly respectable backup pipeline that nobody had ever asked to give anything back. Files appeared in a bucket on schedule. The control plane recorded their sizes and their checksums. The dashboard showed a calm row of green ticks. By every measure we had bothered to define, the product worked.

Then, on a quiet Thursday in November, we sat down and tried to restore one. It did not work. Neither did the one before it. Neither did the one before that.

What we expected to find

We expected, broadly, two things. First, that the backups would mostly restore, because the path that wrote them was the path we had spent months on. Second, that the failures, when they came, would be the kind of failure you can write a useful paragraph about: a corrupted block, a missing extension, a version mismatch.

What we found instead was that nearly a third of the snapshots we tried over the following week did not restore cleanly into a fresh Postgres of the matching major version. They restored into something, generally. Just not the database we had asked them to restore into.

What we actually found

The failures, in rough order of frequency:

Extension drift. A backup taken on a host with pg_stat_statements 1.10 would not restore on a host with 1.9. We had not been recording extension versions. We are now.
Role and privilege gaps. Roles existed on the source host that were not part of the database’s logical contents. The restore completed. The application could not connect.
Locale assumptions. A snapshot from a host with en_GB.UTF-8 as a default collation behaved subtly wrongly on a verifier set up with C.UTF-8. The data was right. The index orderings were not.
Silent truncation. One backup, exactly one, was missing fourteen rows from a partition we had not realised was being filtered out by an old --exclude-table-data flag in someone’s notes.
The boring ones. Disks filling, agents drifting on time, a flaky upload that resumed in the wrong place. The kinds of thing one does, in fact, expect.

What we changed

We made the verification mandatory. Every backup that the agent produces is, within a week of being produced, restored into a clean Postgres of the matching major version on disposable infrastructure, with the same extensions installed at the same versions. We run integrity checks (row counts, foreign-key fan-out, index reachability). We run the customer’s own validation queries. The result is recorded with the same dignity as the backup itself.

A backup that has never been restored is not a backup. It is a hopeful file.

The week after we made this change, the dashboard stopped showing a calm row of green ticks. Roughly one row in seven was, instead, an honest amber. We considered this an improvement.

What it costs

It costs the price of a Postgres for the duration of the verification. For most of our beta cohort, that works out to under a dollar per backup per week, on commodity cloud. We think this is approximately the right amount of money to spend on knowing whether your data is recoverable. We have not, so far, met anyone who, having looked at the bill, has chosen to turn the verification off.

The thing we have not solved

Verification, as we do it, tells you that the snapshot can be restored, and that your validation queries return the answers you expect on the restored copy. It does not tell you that the application built on top of the database will run correctly against it. That is a category of test that lives in your CI, not ours, and we do not currently have a clean way to bridge the two. We are thinking about it. If you have opinions, write in.

If you run a self-hosted Postgres and the verified-restore story is the one you wish someone else would tell, the work is happening here. We will write when there is something worth showing, and not a moment sooner.

— 30 —