What makes a good backup solution? (1 fo 4)

What makes a good backup solution?
Part 1 of 4
Critical Elements of a Backup Plan

It seems that every month or two I see reference to a large web site going downs because they had a catastrophic failure and either they didn’t have backups or their backups were not sufficient. Of course it’s not just web sites but they make the news. In this series of articles I will look at the elements of a good backup plan. These all work together to form a whole so no single element is the important one. Together they will hopefully create a reliable plan of action. I will include a number of different recovery scenarios in this document in the hopes of helping you to create a backup/recovery solution that is appropriate for your organization. Some reference will be made to disaster recovery but only to the extent that is specifically relates to backup and recovery.

First let’s start with the critical elements of a good backup solution in no particular order.

  • Frequency: Backups should be taken as often as is necessary to assure that the loss of data from a critical system loss or multiple system loss is not business crippling. This is going to vary from business to business. A good place to start is with a daily backup since a day’s lost productivity is not catastrophic for most companies but that is not a one size fits all solutions. In a company that is doing real-time processing of orders or similar data that will be far too infrequent. For systems that have few and infrequent data changes daily backups may be a waste of resources.
  • Testing: Backups must be tested on a regular basis. In an ideal world this will meant that an entire system or multiple systems will be recovered from backup to test your recovery process. In the real world testing is often limited to recovery of a few files. If you only recover a few files be certain that all backup types and methods are tested across all classes of data. In other words recovering a file here and there for your file server isn’t sufficient. Test recoveries for all of you systems. Also make sure that databases, e-mail, application data and any other class of data is recoverable. Keep in mind if you are not doing complete system recovery testing you are not protected as well as you would be if you were. That risk may be acceptable for your organization when balanced against the cost of a full system recovery test. As long as the risk has been analyzed you have done your due diligence. In reality, with VMs being relatively quick, cheap and easy to spin up the days of "it's too expensive to do a full test" are probably over for all but the smallest companies. If you have a warm or hot site the recovery should be tested at that site to make sure that all equipment is still working as expected and backups written at the live site can be read at the backup site.
  • Backups should be rotated offsite: Part of the point of backing up our data is being able to recover from a disaster such as fire/tornado/flood etc. If your backups are sitting right next to the system that is being backed up then you have not really gained much protection. It is shocking how often this is overlooked. Offsite will have different meanings depending on budget and what disasters you want to protect against. Like everything it’s a cost benefit analysis. The backups should be geographically dispersed enough that a natural disaster that hits your primary site will not also hit the location that holds you backups. There are regulations that define geographically dispersed for some regulated industries. For non regulated industries this will come down to what a given company is comfortable with and how much money they are willing to spend. A bank vault several blocks away may be sufficient for some companies while more risk averse companies will probably want to use a data archiving company that will haul their backups several miles away to a hardened facility. Even these solutions can break down as we saw with the Iron Mountain flooding during Hurricane Katrina. An online backup geographically dispersed with multiple states separating the backups from the live data is probably the safest bet for this type of risk but as we will explore later online backups create their own risks that must be mitigated.
  • There should be offline backups. This helps to protect against risks such as computer criminals that may destroy or alter your data. If your only backups are online you must make the assumption that when someone manages to compromise your system they will also have control of the backups and will change or destroy them at the same time they change or destroy the live data. It is also much easier to create multiple levels of archival backups if you have an offline backup solution in place along with the online backup. We will discuss revisions next.
  • Multiple revisions/points in time: Invariable you will eventually have a situation where someone loses a critical file that has not recently been accessed. Having multiple backup revisions will help to make sure that you have the ability to go back in time and recover these lost files. The trick to this is deciding how long to keep your rotation period. Backup media is relatively expensive so if you keep too long of a backup rotation you are wasting money but if you make the rotation to short you may not be able to recover files that were accidentally deleted but not noticed for a period of time. The time interval will vary somewhat with the system and data being backed up. I usually consider two weeks an absolute bare minimum and this is often too short a rotation.
  • Archival backups: Archival backups are a little different. Archival backups are simply a backup set that is pulled from rotation on a given timeframe (monthly, quarterly, yearly) to assure that you know exactly what things looked like at that point in time. Archives are often useful in recovering critical lost files that are not noticed for a long period of time because they are only used during a certain part of the year. Archival backups are not absolutely necessary but they are certainly a good idea.
  • Multiple media sets: Multiple sets of media are critical. First of all you will need at least two sets so one can be on site while the other is offsite. Secondly multiple media sets give you some protection against data loss due to bad media. Obviously you should be testing you media but sometimes things get missed. Having multiple sets gives you multiple chances to get a good recovery. Multiple media sets are not necessarily the same thing as multiple revisions because multiple backup points can be put on the same media if the media is large enough.
  • Backups must be accessible: Backups are useless if you can’t get to them. Ask yourself these questions. How quickly can I get to my offsite backups? Do I have the hardware available to use my backups in the event of a catastrophic failure? Do I have to call for a return of my offsite backups to recover an accidentally deleted file? How long does that take? Do I have all my backups onsite during this period? You get the idea. Accessibility is a balancing act. There is no generic “right” answer just the right answer for your organization.
  • Spread you backups across media types: Ideally your backups should be spread across multiple media. You might use backups or snapshots to disk held for a day to a few weeks for recovery of lost files or simple lost systems and backups to tape for disaster recovery and longer term recovery. Using different medias also protects against certain types of hardware failure. As an example a tape drive may begin to create backups that have sporadic corruption that is missed in testing. A second media that does not use any shared hardware (different controller cards etc.) will help mitigate this.
  • Document the backup and recovery process: It is important that there are written documents explaining how to do a recovery under every conceivable scenario. The recovery process should be tested in as close to a real world situation as possible. If there are multiple administrators who may have to do the recovery make sure they all know how to do a recovery. Everyone who might have to do a recovery should practice. Of course if we are talking about a small company there may be no backup for the administrator who does backups. In that case the written backup and restore plan needs to be even better so that a consultant who has no deep knowledge can come in and do a recovery without having to make guesses.
  • Replace your backup media: Any backup media will have a finite number of read/write cycles before you start to experience failures. Modern tape media will have a fairly long life but tapes do wear out. My rule of thumb is that as soon as I see any hardware errors on a tape it gets degaussed, shredded and replaced with a new tape. That’s a little excessive but losing your data will lead to a very bad day. If you have excessive dirt, humidity ranges, temperatures etc. you will see shorter than promised tape life.
  • Local copies of backups: I really like to have a local copy of all data for doing fast recovery without having to recover backups from offsite. I have already mentioned this before but it is worth saying again. If you have to call for backup media to come back from offsite to do a recovery you are asking for longer than ideal recovery times and potentially having all your backups onsite. If you can manage an online backup for simple recoveries that is idea. I will discuss this in more detail later.
  • What gets backed up: Know what gets backed up and communicate that to the end users. Do you only backup servers? Do you only back up certain file stores? Do you back up PCs? All of these are questions that will have different answers in different environments. The important thing is that the end user knows what to expect. I often tell my end users if they have files on their PC that get corrupted or lost there is nothing I can do. If a file on a server gets corrupted or lost I have the ability to recover the files (within a time window) and I will go to exceptional lengths to do so. Maybe in your environment even files that are backed up will only be recovered under certain circumstances to minimize administrator overhead. Consider these issues, set your expectations, and follow through.
  • Automation: Automate your backups. If you want to guarantee that backups won’t happen on a regular and predictable basis do them manually.
  • Review the backup set: Your environment is probably dynamic. On a regular basis review that is being backed up to make sure those changes have not caused new data to be ignored by the backup jobs.
  • Don’t forget most of the restores operations you do will be recovery of accidentally deleted files not recovery of entire systems in a disaster scenario. Plan and test for the worst but make sure that the day to day recoveries can be achieved a quickly and painlessly as possible.

All four parts of this article: one, two, three and four

Comments

Question about retention

In common practise backup media are re-used, overwriting backup data.
In this way most of the times day and week backups are lost.
What does this mean for the usability to retain month, quarter or year backups?
There is no garantee that your lost files are on these backups, so why should I do this?

RE: Question about retention

You misunderstand when speaking of archiving the idea is that the backups (monthly, quarterly, yearly, indefinite whatever) come out of or are not in the rotation so they are not overwritten. This is why they are archival.

Some companies use a separate set of media for archival backups on their own rotation. For example if you needed to keep quarterly backups for a year you could have 4 sets of tapes that you make quarterly archival backups on and then rotate through those.

Another option I have seen is pulling and replacing the current backup set and making it the "archival set". For example, if you need to keep quarterly backups every quarter you would pull whatever tape set ran the previous night and mark that as archival and replace it with new tapes. The major disadvantage to this is that instead of using tapes that are rarely written to (once a year in my previous example) therefor have little wear on them you are archiving tapes that may already have significant wear and tear that could cause recovery issues. If you have a good tape rotation policy this is not an issue but, if you run tapes until they fail this is not an optimal archival policy. Then again running tapes until they fail is also suboptimal. Obviously in using this method the backup that would be kept must be a full backup.

This is just a point in time backup. For this to be relatively failsafe you must have your archival period set the same as or lower than your standard rotation period. Otherwise there is a gap in which files can be lost because they did not exist at the time of the last archive but they have been overwritten on the standard rotation so if you have a four week rotation you will also need an archive every four weeks for maximum recoverability. Having said that this gets expensive and something is better than even if there are gaps. Ultimately the archive policy comes down to how long it will take for someone to realize that a file went missing or was corrupted. This question must be balanced against the cost of tapes to provide a full backup set. (I’ll add this paragraph to the article. I think it needs to be explicit.)

Not everyone needs an archival process. I have worked places where if you needed something recovered that was older than the standard rotation time you were just out of luck. Actually I think this is probably the standard. Archiving just adds an additional level of recoverability.
I didn’t use the word retention in this portion of the article so I hope I am guessing correctly at what you meant.

Daniel

Question about retention

I did not say that archival backups are overwritten but the dayly and weekly tapes are normaly re-used, the daytape let say next week and the week tape, next month or so. Let me explain my question with an example:
Suppose I create a file on tuesday in week 1, modify it several times after that in the following weeks in that month and just before the end of that month, the file is accidently deleted or corrupted. I think this file is not on the monthly backup.
After 6 months I discover that the file is missing and try to find the file in the archival backup, but can't find it.
So I think there is no garantee that I can restore my file although I have invested in archiving monthly, quarterly and yearly backups. It is just luck being able te restore, so when this is the case why would i do this when it is only luck?
Or is there a way to prevent this (I cannot save every day backup during a year)?

RE: Question about retention

Backup whether archival or day to day are a point in time snapshot. You are correct in the case that you describe there is a whole for a file that is created and deleted within a short period of its creation. Much the same as if I do daily backup and created and delete a file on the same day it will never be backup up. That does not make daily backups useless. This archive whole will be as big as your archival time. So having said that it really comes down to whether you prefer to make a best effort at being able to restore files that were deleted a period of time larger than your normal tape rotation or say it is not perfect so you prefer to do nothing. That’s really a personal preference. As I said in the main article archival backups are not mandatory just an extra layer of protection. If you go back to your basic IT principals you lay layers of protection down. No single layer may be perfect but with a layered approach you eliminate or mitigate the issues that you can eliminate or mitigate.

If you really want a system that will keep a copy of every file that has ever been created you would need to have a solution that copies files from disk to disk and keeps versions every time a file updates. Then your exposure is only as large as your copy period (daily, hourly, etc.) You would still need to do traditional backups of some sort to protect against hardware failure or major data center failures (fire, flood earthquake). Potentially this copied data then needs to be backed up to protect it from loss. Whether you decide to do that ultimately comes down to how failure proof do you want your solution to be? In the event of a catastrophic loss of the disk to disk data (the server backed up to) is it acceptable to just start over and loose the recovery points or is it worth the cost of protection for that data set as well? The nice part is even if you do need that additional protection you can do it with just straight backups because nothing ever gets deleted and changes create a new file. If you do backups to a geographically dispersed Internet location you could replicate between two hosting facilities and have a pretty robust solution. Using Something like Amazon’s S3 can take care of the geographic replication for you. I have used “Super Flexible File Synchronizer” for this in the past. This is not a cheap option but it can be done inexpensively relative to the level of protection you are buying. Ultimately it comes down to how important is your data? I will explain this solution in more detail in a separate article.