Backup in the 21st Century
The term backup (in the context of computers) dates back to the time when most users had all their files on a multi-user server which was managed by a system administrator.
Then came the flood of personal computers in the 80's and 90's. These were used in conjunction with servers, so many users kept copies of their important files on the servers. If they did backup their PC, it was mainly to avoid re-installing the software on it and configuring it again.
In the 21st century we see a move towards more personalised use of computers and digital storage --- laptops and portable disks of various kinds are freely used. As a consequence we need to re-think what we wish to achieve with a backup and how best to achieve our goals.
In this document we examine all these questions and make some recommendations of a theoretical kind. Practical decisions and choices will depend on the context. A worked example may be added at a later date.
Why do a backup?
Here are some of the different tasks that a traditional backup has been used for:
- Recovery from catastrophic failure of the hardware.
- Recovery from accidental deletion.
- Recovery of older versions of data.
- Recovery of archived data.
In the traditional setup it was convenient to group all these tasks under one head because:
- Clearly some mechanism for disaster recovery was needed; so it was nice that other tasks could be grouped along with it.
- Disk failures were not that uncommon and software for managing disks (file system software) was not very good at error recovery.
- Users were only required to write documents and programs while system administrators were supposed to manage the system --- which included the storage media. Broadly speaking, the users and system administrators were two disjoint classes of people.
- Storage media was expensive enough and difficult enough to access that users could not expect to keep (multiple) copies of all their files without the aid of a system administration infrastructure.
- It was harder to keep track of file integrity and so it was easier to just "backup everything".
However, things have changed a lot since then; specifically the last four points above need to be re-examined. Some additional points that need to be considered nowadays are:
- Users have a lot of data that can easily re-generated or downloaded.
- Users have access to many different computers and need to keep files synchronised between these computers.
- Users have a lot of private/secret data on their computers.
Let us first concentrate on the "disaster recovery" aspect of backup without worrying about the other "side benefits".
Backups work as follows:
Backups are taken periodically with some periodicity BP. A predetermined number N of such backups is saved.
- In the case of irretrievable damage (to all the data or to some smaller data set), recovery is managed as follows:
- The most recent known-to-be-good backup is restored.
- Successively more and more recent backups are restored until the damage to the data set is noticed. If necessary, the data set is further divided into subsets if there is partial damage.
- The last observed good data set is restored.
The actual procedure is further complicated by the fact that most backups are "incremental". A "full backup" is taken with periodicity FxBP for some F. The remaining backups save incremental changes since the most recent full backup. This means that the first step in the restoration procedure is to restore the most recent known-to-be-good full backup.
Some questions to answer
Clearly it is important to recognise which full backup can be considered "good". Moreover, given that we do want to do some work and do not want to spend all our time and storage space performing backups , we need to accept that there will be some amount of lost time and data. The important questions then are:
- How long does it take to recognise a disaster?
- How many days of work can we afford to lose?
- Which data can we afford to lose?
Let us examine the answer to each question in turn.
The answer to the first question tells us how long we need to save backups. If we can confidently say that more than a week of "creeping errors" will not pass without notice, then we only need to save enough backups to ensure that we can restore the system to what it was one week ago.
The answer to the second question will decide how frequently we need to take a backup. If we do not wish to lose more than a day's worth of work, then we need to take a backup at least once a day. Even so we could lose such changes if there was some creeping error we did not notice.
The answer to the final question will decide how much storage space the backup requires. If we decide that we want to restore the entire system from backup then the backup medium needs to have at least the same capacity that our "live" system has; in fact it will need more.
The Numbers Game
Let us call the answer to the second question the "Rewind Time" (RT). The backup period BP should be no more than RT so that we save those changes to our data set which are at least RT old.
Let us call the answer to the first question our "Disaster Discovery Time" (DDT). The oldest backup that we need to save must have age at least DDT so that we have time between that backup and now to discover a creeping error. In other words we need to save M backups where M is at least DDT/BP. The M-th previous backup will then be the last known-to-be-good backup.
Now it usually takes too long to backup the entire system with period BP, so most backups are incremental. Note that an incremental backup is useless without the full backup which it is an increment of. Suppose that every F-th backup is a full backup, then we need to save N full backups where N is at least M/F. We will also store every incremental backup in between so we will actually save NxF backups of which N are full backups. The N-th previous full backup will then be the last known-to-be-good full backup.
Let S denote the size of the data-set that we want to backup --- then that is the size of a full backup. Let S/K be the fraction of this data that is changed during one backup period BP on the average. In K periods the change in the data set will be S so we may as well take a full backup at this point or earlier; so we can assume K is at least F which is the actual number of periods after which we take a full backup. The total size T of our backup is
T = NxS+Nx(F-1)xS/K = NxSx(1+(F-1)/K).
which is between NxS and 2NxS.
Reducing Backup Costs
To reduce the cost (time and space) of backups we must thus ask ourselves is whether we can find ways to:
Decrease DDT --- discover errors sooner.
Increase RT --- be able to redo more work.
Decrease S --- reduce the size of mutable and valuable data.
Increase K --- keep changes as small as possible.
The system administrator has some techniques available to reduce DDT and a few techniques to reduce S. However, most of the techniques to reduce these two values and the techniques to increase RT and K depend on users.
In the traditional multi-user server system, a system administrator could not really enforce good practices and had to use indirect means like "quotas" to make storage and backups easier to manage.
In the case of a desktop or laptop there is only one user of the system and who often plays a "double-role" as the system administrator. This user then has every reason make life easier for the system administrator by organising files in a way that eases backup and storage management.
What the System Administrator can do
The Disaster Discovery Time can be reduced by preserving the integrity of static files and checking the integrity of files that change only rarely.
The size of the backup can be reduced by specifying which areas of the system contain files which can change (mutable files). A separate one-time bootable backup of the system can be used to aid the recovery process.
Preserving Static Files
There are ways that you can prevent files that are static from being modified.
- You can keep the files on a read-only block device. Even a block device that is not physically read-only can be marked as such by the kernel.
You can put these files on a read-only file-system. It is possible to work with /usr as a read-only file-system.
- You can use file-system attributes and access control lists to prevent files from being written. Another alternative is to use SELinux policies to declare various files as "immutable".
- You can also use stackable copy-on-write systems like "unionfs" or "aufs" to ensure that all changes go to a different file-system. This also help to locate changes to files that are mostly static.
Checking File Integrity
There are number of tools nowadays that can "aide" system integrity testing. Tools like aide, tripwire, samhain and fcheck can periodically check for changed files in specified directories and inform the system administrator of these changes. To check against changes to upstream installations on can use debsum and cruft on Debian systems or rpm on RPM systems.
In addition, using version control to manage '/etc' ensures that one can periodically check to see what changes have been made to vital system configuration files.
System Area Backup Size
It is a good idea to take a bootable backup of the system areas on a read-only medium like a CD/DVD. This eases recovery in the case of drastic failures. The later full and incremental backups can only be of mutable system areas like '/etc' and those mentioned below.
Mutable System Areas
The heirarchy of files below '/tmp' and '/var' contains files that are created during system installation and use. Often these files are not essential to the "base" functionality of the system. As a rule, therefore, these files can be excluded from backup.
However, there could be directories there that are would take quite a lot of time to regenerate and so it is worth backing them up at least once. For example, '/var/lib/dpkg' contains all the packaging information on a Debian system and is worth retaining; in fact on a system where packages are not removed or installed this can be treated like a static system area. Other areas which are worth backing up are data-base areas or version control repositories which are often also kept in '/var'. These often change regularly and need to backed up exactly like user areas.
In particular, the system administrator should create a list of which directories in such mutable system heirarchies should be backed up and how each one of them should be treated (static vs. mutable).
Good System Administration Practices
It is generally a good idea to follow the recommended best practices for managing a system; for Debian systems these are explained in the Debian Reference. This ensures that you are can keep track of exactly which changes you made to the "default" upstream installation of your system and helps you to recover a system "from scratch" if it is required.
Often system administrators just install software "to test it" and then remove it. It is nowadays easy to create sandboxes like "chroot"s or "vserver"s or even run emulators to do this. This helps preserve the integrity of your system over longer periods and reduce the size of the backup required.
The user area is where most of the changes are taking place on the system. Users can assist the backup by organising their files in a way that makes it easy to decide which files need a backup.
A user can do various things to reduce the number of things that would count as a disaster. For example:
- By using delete-not-purge tools for file management users can ensure that accidentally deleting a file is not a disaster.
- By using version control tools to keep track of file changes users can ensure that an incorrect file modification does not turn into a disaster.
- By keeping off-line archives of important documents users can ensure that "cleaning out cruft" or "clearing up disk space" does not turn into a disaster.
Damage Control and Temporary Files
One way to limit damage is to carefully separate precious data from temporary, transient or automatically generated data. The output of a program (like a simulation) could be quite precious until the data is processed so there is also a need for a "semi-transient" category; on the other hand, if such data is not processed then it looses value after a while since it can be regenerated again (by running the simulation slowly) instead of storing it. Similarly, instead of deleting files from the precious area one could move them to a "trash box" in a semi-transient area where data older than a certain age gets purged automatically.
So one way the user can assist backup is clearly demarcating files according to the age for which they need to be backed up.
A related but not identical technique is to use "version control" for all precious files which are being changed --- think of a program or document that you are writing. Using version control makes it easier for the user to keep track of the precise changes made to such files so that possible disasters are discovered early and can be rewound easily. Most modern version control tools also compress and checksum files as they are committed. This means that integrity is preserved and size is reduced as well. Finally, since version control repositories are usually part of the "mutable system areas", these are automatically backed up.
An equally important way the user can assist backup is to designate some files (or directories) as archival data. These are like "static user files" in the sense that the user feels that these files are not going to change. This reduces the time taken for backup since one need not check these files for changes during the periodic backup. All such files could go onto the read-only medium that is used for the full bootable system backup and so would be recoverable "in perpetuity". If appropriate, some of these files may be put on public archives so that we need not consider them at all!
Rewind Time Increase
Most users would not like to accept an increase in the amount of work which they would be willing to re-do. After all we would ideally not like to lose even a minute of work. However, by doing our work in an organised way we can try to ensure that it can be re-done if necessary.
The more systematically one works, the easier it is to redo the work if required. For example, if you write a program and document the code as you write it, you will recall the program more clearly. Similarly, if you write a document using structural markup then it is easier to concentrate on (and so later recall) the content.
Some creative artists would not care to rewind their work at all. In some cases is not really necessary to redo the work in exactly the same way, one merely wants to achieve the same results.
In any case, one needs to be able to balance the effort of taking frequent backups with the mental effort of keeping things sufficiently organised so that the work can be re-done if necessary. This will give us an acceptable estimate of RT.