Some personal notes on revision control and backup

Published on 2009-10-12.

This document is based upon some simple notes to myself. I have actually re-writtin this document a couple of times - each time I changed my directory structure, or considered another way to implement some kind of backup solution. Backup may be as simple as a copy or archive-file of a tree of files, or may take advantage of recording small differences since a previous backup, to make efficient use of storage space. This document is about my personal preferences with regard to backup solutions, revision control system, and directory structure.

Using a revision control system as a backup solution

For many years I have actually been using a revision control system (RCS) as a backup solution for many of my files. First I did it with CVS (works really bad with binary files) and then later, after deciding on which distributed revision control system to use, I did it with Mercurial.

Using a revision control system as a backup solution has both positive and negative side effects.

The negative:

The positive:

If you use a simple synchronization tool, it's impossible to get an accidentally deleted file back.

Since I switched to Mercurial, I have kept a lot of files on backup just by using the RCS and with great results.

As long as your non-text files such as PDF's, OpenOffice files, zipped files and other binary files doesn't change often, it's a perfectly valid backup solution.

That being said, if you really don't need revision control, why not just use some other backup solution that does incremental backup like rsync or Unison?

It all depends on the situation, what kind of files you are dealing with, and your specific needs.

If you simply use some archive tool to compress your files, you will waste a lot of disk space because the same files will go into the archive each time it is created.

Over the years I have tried several different backup solutions supporting incremental backup like Unison, rsync, rdiff-backup, Rsnapshot and Duplicity.

Some of these backup solutions provide a revision like file management, such as Duplicity, where it is possible to pull out specific revisions of a file that has been changed over time.

Except for Duplicity and Unison the other solutions either don't provide exactly what I need, or they do it in a way, that I don't like. Rsnapshot is a great tool, but it is supposed to be run as a cronjob whereas I mostly just want to do a manual backup.

Unison

Unison is a file synchronization tool and it is great as a tool for incremental backups, and I have been using it for about two years with nothing but great results, except for a few instances where I accidentally have deleted a semi-important file (my mistake, not the tool).

Unison allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other. Unlike simple mirroring or backup utilities, Unison can deal with updates to both replicas of a distributed directory structure. Updates that do not conflict are propagated automatically. Conflicting updates are detected and displayed and the user is asked for the right action.

With Unison you don't have to worry about several copies of the same file lying around.

The side effect of using Unison is that you cant get deleted files back.

Unison will synchronize the files between different computers. Changes to the files may be propagated either way, depending on the time stamps. You can change the files on either computer and Unison will help you resolve any conflicts you might create. While it is synchronizing it will prompt you with each file that has changed and ask you what to do with it so you can see if you are about to overwrite something important.

What's great is the support for incremental backup in order to avoid the same files being transferred again and again. Incremental backup means that only new and changed files gets into the backup, thus saving a lot of disk space and time.

Unison is great at keeping files synchronized between different computers. If to much time passes between synchronizations, you might forget which computer holds the newest version of a file, but Unison will make sure you can inspect such a problem manually before it make an attempt to synchronize the files.

Duplicity

Duplicity backup directories by producing encrypted (encryption can be disabled) tar-format volumes and uploading them to a remote or local file server. Because Duplicity uses librsync, the incremental archives are space efficient and only record the parts of files that have changed since the last backup. Because Duplicity uses GnuPG to encrypt and/or sign these archives, they will be safe from spying and/or modification.

Duplicity also includes the rdiffdir utility. Rdiffdir is an extension of librsync's rdiff to directories, it can be used to produce signatures and deltas of directories as well as regular files. These signatures and deltas are in GNU tar format.

Duplicity is a really great tool, but - as of writing - it is still beta, and I don't trust it.

I would use Duplicity on directories, where I don't need to keep a track record of things, but reading up on the bug reports, I think Duplicity currently is too error-prone to be used on anything important.

From the error log of Duplicity:

2009-08-01: Fixed 405734 duplicity fails to restore files that contain a newline character

That's a pretty serious error, but like the project states, it's still beta.

Once Duplicity has become more stable, I am sure it's going to be a great tool used by many people. I like Duplicity very much.

The way I do it

On both my desktop and laptop I have the following directory structure in my home directory:

bin
Desktop
Mail
mnt
rdata

Besides from the above (which occasionally may change) my home directory of course contains a lot of hidden directories and hidden files as well.

The mnt directory is where I mount NFS shares, Samba shares, and other directories. Mail is used by my email client and the bin and Desktop are obvious.

The rdata directory is where I keep the "relevant data". Data that I actually physically create, edit, store and delete.

The rdata directory looks like this:

rdata/
.idata
idata
cdata
pdata
odata
udata

I prefer using short names for directories (as long as they make sense to me in a way, that prevents me from forgetting what's in them).

The udata directory contains unimportant data such as music. I don't backup this directory, and I don't keep files synchronized between machines.

The pdata is my main repository directory and it contains different projects and documents that I actually need to keep track of. Each subdirectory is its own project and repository and it is automatically backup using Git.

The odata directory only contains outdated data.

The cdata directory contains all the current files that I am working on (not projects), and it looks like this:

cdata/
    code
    db
    doc
    priv
    proj
    web

The idata directory contains private files such as letters and stuff and the "i" is short for "important".

Because of the private nature of the files in the idata directory, I encrypt everything using encfs hence the .idata directory.

On my backup server I have created a similar directory structure, and I then use Unison or rsync to keep the idata directories in sync between my backup server and my desktop and laptop.

Here's an example of how I manually backup/synchronize between directories:

$ mount mnt/backup
$ unison -auto rdata/idata ~/mnt/backup/rdata/idata

Because I use encfs on the backup server as well, the data stays encrypted on both ends. There are many other ways to do achieve this, but this way makes me remember to be careful.

Some time ago I used to simply synchronize the encrypted directories and that would prevent me from having to use encfs on the backup server, but if you do that, it's impossible to see what files you are dealing with in case of a problem (unless you don't encrypt the filenames).

When I work with revision control related projects, I create my projects where they are supposed to work. If the project at hand is a web application I will create the repository directly on my web server in the relevant directory.

I will then create a subdirectory on my desktop/laptop in pdata and clone the repository from the webserver.

I do this because I found that I wasn't very good with committing changes atomically, since I never really made the context switch between editing one specific file and another, and I would then find a week went by, and I had 50 changes that should really have been committed separately, but I forgot to do so.

When I work directly on the web server inside the specific repository, I tend to remember to commit each change whenever it is made.

Whenever I have done some work, I enter the directory on my desktop/laptop and pull the changes from the original repository and then update the local one with the changes.

This also carries the benefit that if an idea suddenly pops up, but I don't want to switch on the web server and start working, I can edit the files locally and then push the changes back to the web server repository and update it (in order to get the working copies) once it gets turned on.

Here's how my pdata could look like:

some-web-app/
    /.git/
another-web-app/
    /.git/
some-testing-stuff/
    /.git/
some-cpp-app/
    /.git/
another-cpp-app/
    /.git/
some-haskell-project/
    /.git/
...

I also use Git as a simple backup tool on my cdata and odata directories.

The reason why I do this is because a synchronization tool cant bring back files deleted by accident, and I sometimes need to be able to do that (eventhough it very rarely happens).

Most of the expired data I keep in odata, I rarely touch, but I still like to have the files around. Loosing a file one day because I suddenly feel the need to reorganize the directory structure isn't an option and Git prevents that.

I create the remote repository on my backup server as bare repositories. I then clone the repositories unto my desktop or laptop, create files in it, add, commit and push.

What about working copies when using a revision control system as a backup tool?

RCS software will never incorporate other people's changes, nor make your own changes available to others, until you explicitly tell it to do so (using the update command, both on Mercurial and Git). The most common way with regard to shared repositories is that they don't have working copies at all, but when you are sharing with yourself and the purpose is backup, it might be wise to remember to update the repository on the backup machine in order to get the working copies too.

If you trust your revision control system, you don't need to keep the working copies on the backup repository, but if something goes wrong, and the revision control software, for example, gets updated and isn't completely backward compatible, having the working copies in the repository on the backup machine is wise. Otherwise you might not be able to restore your files.

On both Git and Mercurial its possible to create a hook in order to get the receiving repository to update automatically after receiving a push.

On Mercurial you add the following section to your repo/.hg/hgrc

[hook]
changegroup = hg update

After this you may push-and-forget. The changegroup hook is activated once for each push/pull/unbundle, unlike the commit hook, which is run once for each changeset.

Personally though, I update manually, and because I am using NFS to mount my remote repository, it's easy. If however you are pushing data over SSH, it can become rather tiresome to have to login simply to do an update each time, in which case the hook comes in handy. Rule of thumb when using RCS as a backup tool: After a push/pull always update!

Exporting backup to remote location

As an extra precaution I like to keep a copy of everything on a remote location.

Since a remote location cannot be trusted (ever) I always encrypt every directory even if it doesn't contain anything private.

Some time ago, and after doing a lot of testing, I decided to use Duplicity as my remote backup solution because it both provides encryption (using GnuPG) and incremental backup, but as mentioned earlier I don't use Duplicity on a regular basis.

In a couple of tests with Duplicity I ended up loosing data.

It is possible to disable encryption and Duplicity will then only gzip the files. If a problem arises it is possible to manually extract the files using Tar.

From time to time I manually use Tar to create some compressed archives of all my directories, and I then encrypt these archives using GnuGP.

The solution is time consuming because it doesn't support incremental backup.

For now however I manage to do pretty well using Tar and GnuGP and I just erase the old copies once the new ones has been made. This is just an extra precaution when time permits it.