fluidBlog

August 20, 2007

Website Backup Strategy

Filed under: Backups, Deployment, Drupal, Fedora, Linux — trekr @ 1:46 pm

Introduction

In previous posts, I’ve covered installation of Rails and Drupal, and also some rudimentary security. But running a successful site requires a strategy for upgrades and backups. In this post I’m going to introduce some initial concepts for a web site backup strategy that will also touch on upgrades. I’ll be using a Drupal CMS powered site as an example. The principles should apply for other LAMP based web sites as well. In subsequent posts, I’ll be getting into the how-to, to implement the strategy. I’ll start by stating the desired goals for the backup strategy.

Goals

  • Simple; Use existing tools
  • Automatic; Set up as a cron job
  • Secure; Don’t introduce new security risks
  • Efficient; Minimize space, bandwidth, and restoration time

I find that it’s helpful to categorize the types of files that make up a typical Drupal web site.

  • Drupal source files
  • Contributed modules and themes
  • Custom versions of the above
  • Files uploaded to the site
  • The database

Source, Contributed source, and custom source

Drupal source files, contributed modules and themes, and custom versions of the latter will not change often and it makes sense to maintain them in a revision control repository.

You have lots of choices for a revision control system. I’ve decided to use git based on an excellent article on Version Control Blog. I like git because the separation of core Drupal from the contributed modules and the custom code seems more intuitive then branching in other systems.

Uploaded Files

Files uploaded to the site, also will not often change once uploaded. However, they may be deleted and there will always be new files to deal with. Uploaded files will be added and deleted by users using the CMS, so they will be outside of revision control system. Because of the administrative burden of adding and deleting uploaded content to revision control, it doesn’t make sense to try to keep them in a repository. If you have a novel solution that solves this problem I’d love to hear from you.

In Drupal, uploaded files are stored in a directory /files making it easy to backup just that directory. The /files directory is an excellent candidate for a novel rotataing backup system sometimes known as snapback. Snapback is based on the work of Mike Rubel and others. The basic idea is that for files that haven’t changed over the backup horizon, hard links instead of copies are maintained. The hardl links significantly reduce the space and bandwidth required for the backups.

Luckily, there is an excellent Perl implementation of the snapaback strategy, called snapback2 from Perusian

Database

The database consists of structure and data. The structure of the database may not change very frequently however the data in several tables probably will. So it makes sense to backup up the database structure and data separately. By doing so, we can also avoid backing up data in cache, sessions, and watchdog tables. Like uploaded files, the database will be backed up periodically in a rotating system.

MySQL is a popular choice for LAMP based systems and Drupal, so we’ll assume a MySQL database and use the mysqldump program to create the backups.

Unfortunately the database backups will probably not benefit from the “snapback” technique. To see why, lets do some back-of-the-envelope calculations.

If g is the tar gzip compression factor, and p is the percentage of the change in the backup files, s is the total size of the files, k is the number of backups, then for snapback to use less space than compressed archives, the following must hold

sgk > spk + s(1-p)

Where s(1-p) is the size of the unchanged files of which there is only one copy due to hard links

Simplifying by dividing by s,

gk > pk + (1-p)

If we ignore (1-p) which is small when k is large, then it is clear that for the method to have benefit, the percent of changed files must be less then the compression factor.

g > p

In my testing, I’ve been seeing compression factors between .15 and .20 for mysqldump files. Its hard to imagine a database that changes less than 15-20%

On a directory where the files don’t change much, and g > p holds, then the break even number of backups, k is given by

k = ceil((1-p)/(g-p))

For example, assume the percent of change, p, is 5% and the compression factor, g, is 15%, then the break even number of backup copies, k, is 10. A typical backup scheme has 6 hourly, 7 daily, 4 weekly, and 12 monthly = 29 copies, call it thirty, so in this example there is about a 3X savings over compressed archives. This is why I like the snapback technique for the uploaded files.

Therefore, to implement the database backup, we’ll being using some simple bash scripts on the local backup server and the remote web server that execute mysqldump, archive and compress the output, and move it to the remote server securely.

Well, that wraps up the outline of our strategy. In my next series of posts, I’ll cover in detail what it takes to implement each part of our three-tiered backup strategy.

Part II Part III Part IV

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Hakota Design LLC