How I store my files and why you should not rely on fancy tools for backup

Published on 2021-07-27.

Have you ever lost important data? I have. I learned doing backups the hard way after I lost a entire book I had just finished writing!

Table of contents

The year was 2000-something when I had just finished writing a book I had been working on for a couple of years. Somehow I managed to delete all the work I had done.

I don't think I have ever felt as I did in that very moment when I realized what just happened. It was like the world stopped moving. "Nooooooooo!", I shouted, while I desperately jumped around in utter disbelief. "This cannot be happening! This cannot be happening! What did I just do!?"

Well, it did happen, I did manage to delete everything and I had to re-write the entire book from scratch!

Despite the fact that I learned the hard way, and despite the fact that everyone around me thinks I am both paranoid and crazy regarding backups, I have still managed to occasionally lose data once or twice since I lost my book. I am telling you this because I think it's important to share experience and to learn from mistakes.

Over the years I have tried different approaches and used different tools, but for about the last ten years nothing has changed much and the method I implement has become pretty solid.

Why having a solid backup strategy really matters

Doing real backup involves a lot of serious consideration, but the most important thing is to have a solid strategy.

This basically means that you have created a workflow in which regular backup of important data is an integral part of that workflow. It matters because once you have the strategy in place it becomes second nature and you seldom have to think about it.

Even though storage space is relatively cheap nowadays I don't believe in backing up everything. It really is only the important data that require backup. Data which - if you lose it - would affect you negatively in some way. A friend of mine keeps all his data around, even the non-important data, but I personally prefer to "clean house" once and a while.

The 3-2-1 rule

The 3-2-1 rule of backup is a good minimum part of a solid strategy.

When you keep your backup copies of data both locally and offsite you increase the protection in the event of any unforeseen event or disaster.

Use a timetable

How often you backup is based upon your assessment of how much data you are willing to lose.

You can do backups manually or you can integrate some kind of automation. I do both.

Test your strategy

Once you have a backup strategy planned out, you need to test it. If you don't test it you will never know if it is really going to work out.

Simulate everything from a simple hard drive failure, to a bug in the encryption algorithm, to something like a natural disaster.

Don't skip this part. I have seen people implement all kinds of solutions only to manage to lose all data because they weren't prepared and once they had to deal with a recovery problem they messed up.

Test your strategy and prepare for recovery.

Step 1 - I keep all important data on a ZFS network share

I have a bunch of different computers I use for different tasks and they run different operating systems. I have tried various options to keep data synchronized between machines, such as utilizing revision control, using rsync, unison, duplicity and other tools, but I eventually settled on something really simple. I run a ZFS network attached storage server. The storage server runs 24/7 and is mountable using both Samba, NFS and SSHFS.

The storage server does other things as well, so it's almost never completely idle. I have attached a couple of spinning drives to the box that spins down when they are not doing anything. A ZFS mirror is running on these drives with multiple datasets. The sole purpose of these drives is to serve as a storage media that is shared between multiple computers. A cronjob is running which automatically creates snapshots using zfs-auto-snapshot. I have daily, weekly and monthly snapshots.

I then utilize the storage server in two different ways.

  1. I have a ZFS network share called something as simple as data which I mount inside my home directory on all the computers I use. In this data directory I have all my important files such as notes, letters, documents, pictures, videos, etc. Everything is organized in directories according to subject or purpose. This is not only a really simple way of keeping these files in sync across all machines, but the files are at the same time duplicated across multiple drives via ZFS and they are also snapshotted. Should one of the drives in the mirror fail I can easily exchange it with another.
  2. I have all my dotfiles in my home directory in Git. I push this repository to a dataset on the storage server. These dotfiles are kept physically in Git across all machines because I like to have an edit history and I like to be able to run a machine without having to mount the network storage if I don't need it (one of the reasons why I don't just symlink those files). If I need to do a fresh OS install they are easily and quickly cloned. While ZFS can do a basic diff on files, it cannot do a detailed edit history.

Step 2 - I have a ZFS backup server

I do regular backups to a ZFS backup server running in a 3 way mirror.

The backup server is not running 24/7, but is only turned on when needed and my basic backup schedule is to run the server at the end of each important work day. The only exception to this is when I absolutely know that I haven't done anything that requires a backup. If I am in doubt, I run the backup.

I use rsync to backup files that doesn't belong on the shared network storage, such as email and bookmarks, while ZFS send and receive is better utilized for larger datasets from the storage server.

Step 3 - I do regular backups to external media

At regular intervals I do a full backup of all the data located on the backup server unto external media for storage offsite.

I have a couple of external drives which also run in a ZFS mirror. These drives gets attached, mounted and then an incremental backup is done using rsync and/or ZFS send and receive.

If a lot of files has been deleted because they are no longer needed, I destroy the pool, create a fresh pool and do a full fresh backup. When the backup is done the drives are transported for physical storage at an external location.

The next time I need to do a full backup I use a second set of drives and perform the same routine. The old set of drives then get swapped out with the first set. This is to avoid having a single moment in time where there isn't an external offsite backup in existence.

Besides from the above I also do backup to yet another external drive, but this is just to keep an extra copy on a single drive handy.

You have to make it easy to make it work

In order to make everything work you have to make it as easy as possible, otherwise you won't do it regularly.

If your storage server is located on the attic and you have to physically walk up a ton of stairs to use it, you will never use it :)

Try to setup your equipment in a way that makes it easy to use. Keep your basic storage media for backup close by. The only exception to this rule is the storage media which you keep at an external location. It's important that it isn't too close to you, like at your neighbors house, but it also shouldn't require a 4 hours drive each time you want to make a backup :)

Don't trust cloud providers blindly

While I know of some really good cloud providers, such as rsync.net and Tarsnap, I recommend that you never trust cloud providers blindly.

Everything can look really nice "on paper" but you don't know what goes on behind the scenes. I have worked with a lot of different people and I have seen too much crazy shit to fully trust anyone with my important data. A cloud provider may have the best of intentions, but sometimes all it takes is a single grumpy employee or even a minor mistake to do a lot of damage.

That doesn't mean you shouldn't take advantage of some of the different solutions, both rsync.net and Tarsnap are really great and the more you can copy your data around the better it is. Just don't trust a cloud provider without taking additional steps to keep your data safe offline as well.

Free Git hosting such as GitHub, GitLab and others can also be utilized for data that you don't mind storing in public. GitLab and other providers does provide free private repositories, just don't rely fully on that.

You can also rent a cheap VPS and put encrypted data there, but don't trust the encryption because it will always have limitations and don't trust the VPS because it can easily break down. Only use such options for data that you don't mind "putting in the hands of other people".

Don't rely on "fancy" tools for backup

There exist some really cool open source backup solutions such as Borg, Restic and duplicity, but you should never rely solely on these "complex" solutions. These tools work really great, until they don't! In the past I have lost data to duplicity and other tools.

Once a complex tool breaks down and you suddenly have to work with some obscure binary file format or something else that nobody understands, you begin to cherish the more simple solutions.

One could argue that ZFS is complex as well, but that is on the filesystem level, a level on which you cannot avoid complexity no matter what you do.

I am not saying you shouldn't use something like Borg or Restic, but when it comes to backup and file management nothing beats simple and solid tools such as rsync, Tar and GnuPG for encryption.

When you're dealing with recovery of important data the last thing you want to worry about is layers of added complexity. The more complex a tool gets, the more difficult it becomes to handle data corruption or application bugs.

Only use encryption when it is really needed

While you might consider doing a full encryption for both your personal laptop and/or desktop, in case one of these gets stolen, you should avoid encryption on backup and storage when it really isn't needed because encryption adds yet another layer of complexity.

Not only does encryption during data recovery make everything much more difficult, but should you pass away, your family members might not have the skills required to access the data.

When you do utilize encryption, consider the difference between encrypting each file individually and then putting a lot of files into a single archive which you then encrypt. When everything is stored in a single encrypted archive, you risk losing more data should something go wrong with the encryption. Always validate your data using something like a checksum or other similar utilities.

Final notes

If you have never implemented a backup strategy and you doubt whether you actually need one, consider how you would feel if you lost all your data. If you wouldn't mind losing your data, you properly don't need backup. However, most people nowadays have at least pictures and videos of friends and family they care about. Perhaps you have important documents laying around somewhere too. In either case, it's much better to have a backup and not need it, than to need it and not have it.

When you store files on ZFS, ZFS not only ensures data integrity by protecting you against silent data corruption caused by data degradation, but it also helps you with power surges (voltage spikes), bugs in disk firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), and much more.

NOTE

ZFS without ECC memory is no worse than any other file system without ECC memory. Using ECC memory is recommended in situations where the strongest data integrity guarantees are required. Random bit flips caused by cosmic rays or by faulty memory can go undetected without ECC memory. Any filesystem will write the damaged data from memory to disk and be unable to automatically detect the corruption. Also note that ECC memory is often not supported by consumer grade hardware. And ECC memory is also more expensive. In any way you can run ZFS without using ECC memory, it's not a requirement. Just make sure to validate your data.

You don't have to rely on expensive hardware to run a ZFS storage or backup server. I have managed perfectly fine for a very long period of time to run a mirror on two external USB disks on a Marvell Armada 510 - 800 Mhz Cubox. It is generally not a good idea to run ZFS on external USB drives, however neither ZFS nor the little Cubox has ever failed me. I actually only switched it out for something else in order to get better performance.

I have also tested running the same setup with different models of Raspberry Pi's but I have had nothing but bad experience with external USB drives attached to Raspberry Pi's. It doesn't matter what version of Raspberry Pi it is, or how powerful your PSU is, they always seems to suffer from automatic and irregular un-mounting, random change or device mappings, and other strange errors.

FreeBSD has great support for a lot of hardware and it has a solid ZFS implementation. If you have an old computer laying around, perhaps you can use that either as a storage server or as a backup server.

In any case, unless you truly don't care about your data, you should adopt a solid backup strategy before it's too late ;)