What makes a good backup solution?
Part 1 of 4
Critical Elements of a Backup Plan
It seems that every month or two I see reference to a large web site going downs because they had a catastrophic failure and either they didnâ€™t have backups or their backups were not sufficient. Of course itâ€™s not just web sites but they make the news. In this series of articles I will look at the elements of a good backup plan. These all work together to form a whole so no single element is the important one. Together they will hopefully create a reliable plan of action. I will include a number of different recovery scenarios in this document in the hopes of helping you to create a backup/recovery solution that is appropriate for your organization. Some reference will be made to disaster recovery but only to the extent that is specifically relates to backup and recovery.
First letâ€™s start with the critical elements of a good backup solution in no particular order.
- Frequency: Backups should be taken as often as is necessary to assure that the loss of data from a critical system loss or multiple system loss is not business crippling. This is going to vary from business to business. A good place to start is with a daily backup since a dayâ€™s lost productivity is not catastrophic for most companies but that is not a one size fits all solutions. In a company that is doing real-time processing of orders or similar data that will be far too infrequent. For systems that have few and infrequent data changes daily backups may be a waste of resources.
- Testing: Backups must be tested on a regular basis. In an ideal world this will meant that an entire system or multiple systems will be recovered from backup to test your recovery process. In the real world testing is often limited to recovery of a few files. If you only recover a few files be certain that all backup types and methods are tested across all classes of data. In other words recovering a file here and there for your file server isnâ€™t sufficient. Test recoveries for all of you systems. Also make sure that databases, e-mail, application data and any other class of data is recoverable. Keep in mind if you are not doing complete system recovery testing you are not protected as well as you would be if you were. That risk may be acceptable for your organization when balanced against the cost of a full system recovery test. As long as the risk has been analyzed you have done your due diligence. In reality, with VMs being relatively quick, cheap and easy to spin up the days of "it's too expensive to do a full test" are probably over for all but the smallest companies. If you have a warm or hot site the recovery should be tested at that site to make sure that all equipment is still working as expected and backups written at the live site can be read at the backup site.
- Backups should be rotated offsite: Part of the point of backing up our data is being able to recover from a disaster such as fire/tornado/flood etc. If your backups are sitting right next to the system that is being backed up then you have not really gained much protection. It is shocking how often this is overlooked. Offsite will have different meanings depending on budget and what disasters you want to protect against. Like everything itâ€™s a cost benefit analysis. The backups should be geographically dispersed enough that a natural disaster that hits your primary site will not also hit the location that holds you backups. There are regulations that define geographically dispersed for some regulated industries. For non regulated industries this will come down to what a given company is comfortable with and how much money they are willing to spend. A bank vault several blocks away may be sufficient for some companies while more risk averse companies will probably want to use a data archiving company that will haul their backups several miles away to a hardened facility. Even these solutions can break down as we saw with the Iron Mountain flooding during Hurricane Katrina. An online backup geographically dispersed with multiple states separating the backups from the live data is probably the safest bet for this type of risk but as we will explore later online backups create their own risks that must be mitigated.
- There should be offline backups. This helps to protect against risks such as computer criminals that may destroy or alter your data. If your only backups are online you must make the assumption that when someone manages to compromise your system they will also have control of the backups and will change or destroy them at the same time they change or destroy the live data. It is also much easier to create multiple levels of archival backups if you have an offline backup solution in place along with the online backup. We will discuss revisions next.
- Multiple revisions/points in time: Invariable you will eventually have a situation where someone loses a critical file that has not recently been accessed. Having multiple backup revisions will help to make sure that you have the ability to go back in time and recover these lost files. The trick to this is deciding how long to keep your rotation period. Backup media is relatively expensive so if you keep too long of a backup rotation you are wasting money but if you make the rotation to short you may not be able to recover files that were accidentally deleted but not noticed for a period of time. The time interval will vary somewhat with the system and data being backed up. I usually consider two weeks an absolute bare minimum and this is often too short a rotation.
- Archival backups: Archival backups are a little different. Archival backups are simply a backup set that is pulled from rotation on a given timeframe (monthly, quarterly, yearly) to assure that you know exactly what things looked like at that point in time. Archives are often useful in recovering critical lost files that are not noticed for a long period of time because they are only used during a certain part of the year. Archival backups are not absolutely necessary but they are certainly a good idea.
- Multiple media sets: Multiple sets of media are critical. First of all you will need at least two sets so one can be on site while the other is offsite. Secondly multiple media sets give you some protection against data loss due to bad media. Obviously you should be testing you media but sometimes things get missed. Having multiple sets gives you multiple chances to get a good recovery. Multiple media sets are not necessarily the same thing as multiple revisions because multiple backup points can be put on the same media if the media is large enough.
- Backups must be accessible: Backups are useless if you canâ€™t get to them. Ask yourself these questions. How quickly can I get to my offsite backups? Do I have the hardware available to use my backups in the event of a catastrophic failure? Do I have to call for a return of my offsite backups to recover an accidentally deleted file? How long does that take? Do I have all my backups onsite during this period? You get the idea. Accessibility is a balancing act. There is no generic â€œrightâ€ answer just the right answer for your organization.
- Spread you backups across media types: Ideally your backups should be spread across multiple media. You might use backups or snapshots to disk held for a day to a few weeks for recovery of lost files or simple lost systems and backups to tape for disaster recovery and longer term recovery. Using different medias also protects against certain types of hardware failure. As an example a tape drive may begin to create backups that have sporadic corruption that is missed in testing. A second media that does not use any shared hardware (different controller cards etc.) will help mitigate this.
- Document the backup and recovery process: It is important that there are written documents explaining how to do a recovery under every conceivable scenario. The recovery process should be tested in as close to a real world situation as possible. If there are multiple administrators who may have to do the recovery make sure they all know how to do a recovery. Everyone who might have to do a recovery should practice. Of course if we are talking about a small company there may be no backup for the administrator who does backups. In that case the written backup and restore plan needs to be even better so that a consultant who has no deep knowledge can come in and do a recovery without having to make guesses.
- Replace your backup media: Any backup media will have a finite number of read/write cycles before you start to experience failures. Modern tape media will have a fairly long life but tapes do wear out. My rule of thumb is that as soon as I see any hardware errors on a tape it gets degaussed, shredded and replaced with a new tape. Thatâ€™s a little excessive but losing your data will lead to a very bad day. If you have excessive dirt, humidity ranges, temperatures etc. you will see shorter than promised tape life.
- Local copies of backups: I really like to have a local copy of all data for doing fast recovery without having to recover backups from offsite. I have already mentioned this before but it is worth saying again. If you have to call for backup media to come back from offsite to do a recovery you are asking for longer than ideal recovery times and potentially having all your backups onsite. If you can manage an online backup for simple recoveries that is idea. I will discuss this in more detail later.
- What gets backed up: Know what gets backed up and communicate that to the end users. Do you only backup servers? Do you only back up certain file stores? Do you back up PCs? All of these are questions that will have different answers in different environments. The important thing is that the end user knows what to expect. I often tell my end users if they have files on their PC that get corrupted or lost there is nothing I can do. If a file on a server gets corrupted or lost I have the ability to recover the files (within a time window) and I will go to exceptional lengths to do so. Maybe in your environment even files that are backed up will only be recovered under certain circumstances to minimize administrator overhead. Consider these issues, set your expectations, and follow through.
- Automation: Automate your backups. If you want to guarantee that backups wonâ€™t happen on a regular and predictable basis do them manually.
- Review the backup set: Your environment is probably dynamic. On a regular basis review that is being backed up to make sure those changes have not caused new data to be ignored by the backup jobs.
- Donâ€™t forget most of the restores operations you do will be recovery of accidentally deleted files not recovery of entire systems in a disaster scenario. Plan and test for the worst but make sure that the day to day recoveries can be achieved a quickly and painlessly as possible.
All four parts of this article: one, two, three and four