fluidBlog

August 22, 2007

Website Backup Strategies, Part II

Filed under: Backups, Deployment, Drupal, Fedora, Linux — trekr @ 2:33 pm

The second part of this series will take a look at revision control strategies for the core Drupal source.

CVS

Drupal is maintained in cvs. So cvs seems like a natural option. Pro Drupal Development covers this quite well in Chapter 21 “Development Best Practices”. There is a very short paragraph in the chapter about mixing SVN and CVS to deal with custom source.

Assuming cvs is installed, using CVS is as easy as

$ cd /var/www/html
$ /usr/bin/cvs \
> -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal checkout \
> -r DRUPAL-5 mysite

Which will create /var/www/html/drupal

And updates are as simple as

$ cd /var/www/html/drupal
$ /usr/bin/cvs update -dP -r DRUPAL-5

But if you customize anything within the core source, then you’ll start to run into some difficulty. For starters, you will need to store the custom code in a different repository. Remember the short paragraph in Pro Drupal Development about mixing CVS and SVN? If custom source is not stored in another repository, the custom changes will not survive the update.

I’m pretty sure that I’ll be customizing something, headers, logos, maybe some javascript to get transparent images to work properly in IE. So, for my purposes, I’m going to look for another approach.

SVN

Using a svn repository solves the problem of keeping core source and custom source in a single repository at the cost of a little additional work. In the case of Drupal, the source is not in SVN so we cannot simply checkout the source from the development team repository. If you are new to svn, a good resource is Mike Mason’s excellent book Pragmatic Version Control: Using Subversion(2nd Edition). Another good resource is Version Control with Subversion

The following workflow assumes everything is being done on a local machine, and the user has appropriate permissions.

Create a repository

$ mkdir /var/svn/repos
$ svnadmin create /var/svn/repos

Get the source

$ cd /usr/local/src
$ wget http://ftp.drupal.org/files/projects/drupal-5.1.tar.gz

Uncompress the archive

$ tar xvf drupal-5.1.tar.gz

Import the source into the repository

$ svn import --no-auto-props -m"initial import of drupal 5.1" \
> drupal-5.1 file:///var/svn/repos/vendorsrc/drupal/current

Create a tag

$ svn copy -m"Tag 5.1 vendor drop" \
> file:///var/svn/repos/vendorsrc/drupal/current \
> file:///var/svn/repos/vendorsrc/drupal/5.1

Create project

$ svn mkdir -m"Create project myproject" file:///var/svn/repos/myproject

Copy into main development branch

$ svn copy file:///var/svn/repos/vendorsrc/drupal/5.1 \
> file:///var/svn/repos/myproject/trunk

Check out the project to create a working directory

$ svn co file:///var/svn/svnrepos/myproject/trunk \
> /var/www/html/myproject

Ignore files/ directory

$ svn mkdir files files/images files/images/temp files/css files/color

$ svn propset svn:ignore "*" files/ files/images \
> files/images/temp files/css files/color

$ svn commit -m "ignore files/ and subdirectories content from now"

If you don’t set ignore on the files/ directory and its subdirectories then svn status will get pretty busy. You need to set the ignore property on files/ and each subdiretory in files/, for example, files/css, files/images, files/images/temp, etc.

Any custom changes we make can go into the same project respository.

When it’s time to upgrade Drupal to a new version, the basic update workflow is as follows:

  • checkout the current version of the core source in a working directory

  • make the working directory look like a pristine copy of the new version by copying, and svn adding or deleting as required. To make your life easier, use svn.load.dirs.pl

  • commit the new version and add a tag

  • merge the old and new versions in your project’s working directory make sure you use the tag because the branch moves, the tag is fixed

  • resolve any conflicts and commit the changes to your project

More details can be found at Vendor Branches

Finally, you may want to consider using svnsync to mirror your repository.

GIT

In part one, I mentioned the excellent article on Verion Control Blog. This is my preferred process for source control of Drupal powered sites. The article is a very comprehensive how-to, so I don’t need to duplicate it here.

The basic workflow involves creating lines of development for

  • drupal - contains the core source distribution
  • drupal-and-modules is a clone of drupal plus contributed modules and themes
  • drupal-production is a clone of drupal-and-modules with project customizations

The lines of development are chained together with cloning and changes propogate from drupal/ through drupal-production/.

I will point out one area of concern with the approach as detailed in the article. Exploding the latest version of Drupal in the drupal/ line of development does not deal with files or directories that may have been deleted in the latest version.

For more info on git, you should start with the documentation.

So there you have it, three approaches to keeping the Drupal source under revision control. I’ve stated my preference but everyone’s situation is different and you can choose.

In the next part, I’ll detail how to backup the files that are not under revision control, for example, the files/ directory. The final segment will address backing up the MySQL database.

Part I Part III Part IV

August 20, 2007

Website Backup Strategy

Filed under: Backups, Deployment, Drupal, Fedora, Linux — trekr @ 1:46 pm

Introduction

In previous posts, I’ve covered installation of Rails and Drupal, and also some rudimentary security. But running a successful site requires a strategy for upgrades and backups. In this post I’m going to introduce some initial concepts for a web site backup strategy that will also touch on upgrades. I’ll be using a Drupal CMS powered site as an example. The principles should apply for other LAMP based web sites as well. In subsequent posts, I’ll be getting into the how-to, to implement the strategy. I’ll start by stating the desired goals for the backup strategy.

Goals

  • Simple; Use existing tools
  • Automatic; Set up as a cron job
  • Secure; Don’t introduce new security risks
  • Efficient; Minimize space, bandwidth, and restoration time

I find that it’s helpful to categorize the types of files that make up a typical Drupal web site.

  • Drupal source files
  • Contributed modules and themes
  • Custom versions of the above
  • Files uploaded to the site
  • The database

Source, Contributed source, and custom source

Drupal source files, contributed modules and themes, and custom versions of the latter will not change often and it makes sense to maintain them in a revision control repository.

You have lots of choices for a revision control system. I’ve decided to use git based on an excellent article on Version Control Blog. I like git because the separation of core Drupal from the contributed modules and the custom code seems more intuitive then branching in other systems.

Uploaded Files

Files uploaded to the site, also will not often change once uploaded. However, they may be deleted and there will always be new files to deal with. Uploaded files will be added and deleted by users using the CMS, so they will be outside of revision control system. Because of the administrative burden of adding and deleting uploaded content to revision control, it doesn’t make sense to try to keep them in a repository. If you have a novel solution that solves this problem I’d love to hear from you.

In Drupal, uploaded files are stored in a directory /files making it easy to backup just that directory. The /files directory is an excellent candidate for a novel rotataing backup system sometimes known as snapback. Snapback is based on the work of Mike Rubel and others. The basic idea is that for files that haven’t changed over the backup horizon, hard links instead of copies are maintained. The hardl links significantly reduce the space and bandwidth required for the backups.

Luckily, there is an excellent Perl implementation of the snapaback strategy, called snapback2 from Perusian

Database

The database consists of structure and data. The structure of the database may not change very frequently however the data in several tables probably will. So it makes sense to backup up the database structure and data separately. By doing so, we can also avoid backing up data in cache, sessions, and watchdog tables. Like uploaded files, the database will be backed up periodically in a rotating system.

MySQL is a popular choice for LAMP based systems and Drupal, so we’ll assume a MySQL database and use the mysqldump program to create the backups.

Unfortunately the database backups will probably not benefit from the “snapback” technique. To see why, lets do some back-of-the-envelope calculations.

If g is the tar gzip compression factor, and p is the percentage of the change in the backup files, s is the total size of the files, k is the number of backups, then for snapback to use less space than compressed archives, the following must hold

sgk > spk + s(1-p)

Where s(1-p) is the size of the unchanged files of which there is only one copy due to hard links

Simplifying by dividing by s,

gk > pk + (1-p)

If we ignore (1-p) which is small when k is large, then it is clear that for the method to have benefit, the percent of changed files must be less then the compression factor.

g > p

In my testing, I’ve been seeing compression factors between .15 and .20 for mysqldump files. Its hard to imagine a database that changes less than 15-20%

On a directory where the files don’t change much, and g > p holds, then the break even number of backups, k is given by

k = ceil((1-p)/(g-p))

For example, assume the percent of change, p, is 5% and the compression factor, g, is 15%, then the break even number of backup copies, k, is 10. A typical backup scheme has 6 hourly, 7 daily, 4 weekly, and 12 monthly = 29 copies, call it thirty, so in this example there is about a 3X savings over compressed archives. This is why I like the snapback technique for the uploaded files.

Therefore, to implement the database backup, we’ll being using some simple bash scripts on the local backup server and the remote web server that execute mysqldump, archive and compress the output, and move it to the remote server securely.

Well, that wraps up the outline of our strategy. In my next series of posts, I’ll cover in detail what it takes to implement each part of our three-tiered backup strategy.

Part II Part III Part IV

August 8, 2007

Installing Drupal on Fedora 6

Filed under: Deployment, Drupal, Fedora, Installation, Linux — trekr @ 11:55 am

In this post I’m going to walk you through the steps to create a Drupal based web site from the ground up starting with a newly minted Fedora 6 slice from Slicehost.

Secure the Slice

The first steps are exactly the same as my previous post, Securing a new Fedora 6 Slice. If you followed the instructions in that post you have accomplished the following

  • changed root’s password
  • yum updated the base installation
  • yum installed sudo
  • created a user with sudo privileges
  • created a public key on your local machine and copied it to your slice
  • disabled password authentication via ssh
  • diabled challenge response authentication via ssh
  • changed the ssh port
  • disabled root login via ssh
  • installed denyhosts (optional)
  • installed and configured a firewall using iptables

Install Required Software

Now that the slice is more secure, we can install the software required by Drupal.

  • Apache Web Server, httpd
  • MySQL Server
  • PHP
  • GD
  • Sendmail

Login into the slice and yum install the following packages

$ /usr/bin/sudo /usr/bin/yum -y install \
> wget \
> tar \
> gzip \
> make \
> gcc \
> openssh-clients \
> mysql \
> mysql-server \
> php \
> php-mysql \
> php-devel \
> php-gd \
> gd \
> gd-devel \
> httpd  \
> sendmail \
> sendmail-mc \
> sendmail-cf

Start the mysqld server

$ /usr/bin/sudo /etc/init.d/mysqld restart

Ensure MySQL starts at boot

$ /usr/bin/sudo /sbin/chkconfig --add mysqld
$ /usr/bin/sudo /sbin/chkconfig --level 345 mysqld on

Secure Initial MySQL Accounts

see Securing the initial MySQL accounts

$  /bin/su
# /usr/bin/mysql -u root
mysql> SET PASSWORD FOR 'root'@'localhost' = PASSWORD('newpwd');
mysql> exit
# exit
$

Create a Drupal system user

$ /usr/bin/sudo /usr/sbin/useradd -r drupal

Create MySQL user account

see CREATE USER Sytnax

$ /bin/su
# mysql -p -u root
mysql> create user 'drupal'@'localhost';

Create the database for your Drupal site

mysql> create database mysite;
mysql> grant all on mysite.* to 'drupal'@'localhost';
mysql> exit;
# exit;
$

Download the Drupal software

see Download

Change directories to /usr/local/src

$ cd /usr/local/src

$ /usr/bin/sudo /usr/bin/wget \
> http://ftp.drupal.org/files/projects/drupal-5.2.tar.gz

$ /usr/bin/sudo /bin/tar xvf drupal-5.2.tar.gz

$ usr/bin/sudo /bin/cp -r drupal-5.2 /var/www/html/mysite

Make settings.php writeable by the web server user

$ cd /var/www/html/mysite

$ /usr/bin/sudo /bin/chown root.apache \
> /var/www/html/mysite/sites/default/settings.php

$ /usr/bin/sudo /bin/chmod g+w \
> /var/www/html/mysite/sites/default/settings.php

Make the files/ directory and its subdirectories writable by the web server user

$ /usr/bin/sudo /bin/mkdir files files/color files/css \
> files/images files/images/temp

$ /usr/bin/sudo /bin/chown root.apache files files/color files/css \
> files/images files/images/temp

$ usr/bin/sudo /bin/chmod g+w files files/color files/css \
> files/images files/images/temp

Set up cron

Add a shell script to /etc/cron.hourly

# !/bin/sh
# $Id: cron-curl.sh,v 1.3 2006/08/22 07:38:24 dries Exp $
curl  --silent --compressed http://mysite.com/cron.php

Get optional modules and themes

Setup Apache Web Server

edit /etc/httpd/conf/httpd.conf

#ServerName :80
ServerName www.mysite.com:80
#Listen 12.34.56.78:80
Listen your.slice.ip.addr:80

Add

<Files *.inc>
    Deny From All
</Files>
<Files *.class>
    Deny From All
</Files>
<Files MANIFEST>
    Deny From All
</Files>

See this article for tuning Apache for performance

Create a virtual host

Edit /etc/httpd/conf/httpd.conf

<VirtualHost hostname:80>
ServerAdmin webmaster@mysite.com
DocumentRoot /var/www/html/mysite
ServerName www.mysite.com

Options -Indexes +FollowSymLinks
ErrorLog logs/mysite-error_log
CustomLog logs/mysite-access_log combined
DirectoryIndex index.html index.html.var index.php

<Directory "/var/www/html/mysite">
  AllowOverride all
</Directory>
</VirtualHost>

Start the Web Server

$ /usr/sbin/apachectl configtest
$ /sbin/chkconfig --add httpd
$ /sbin/chkconfig --level 345 httpd on
$ /usr/sbin/apachectl start

Configure PHP

See Description of core php.ini directives

You may need to adjust uploadmaxfilesize postmaxsize There is a good security article on the Gallery2 site that is worth reading.

Install Drupal

Navigate to http://mysite.com/install.php and follow along with the online install. When finished, change permission on /var/www/html/mysite/sites/default/settings.php

$ /usr/bin/sudo /bin/chmod g-w /var/www/html/mysite/sites/default/settings.php

Setup Sendmail

Configure Linux Mail Servers is a comprehensive article, I’ll just hit the highlights.

Configure DNS correctly

Add the following records

  • mail pointing to your slice’s IP,
  • MX pointing to mail
  • TXT pointing to v=spf1 a mx -all

Configure /etc/resolv.conf

Add the following line above the line nameserver

    domain mysite.com

Configure /etc/hosts

127.0.0.1       mysite.com localhost.localdomain localhost

Configure /etc/sendmail.mc

Make sure sendmail is listening on all interfaces (0.0.0.0)

$ /bin/netstat -an | grep :25 | grep tcp
tcp 0 0 0.0.0.0:25 0.0.0.0:* LISTEN

Comment out DAEMON_OPTIONS in /etc/mail/sendmail.mc if it is only listening on loopback

dnl DAEMON_OPTIONS(\`Port=smtp,Addr=127.0.0.1, Name=MTA')

Make sure these lines are commented out to avoid having your server used to forward spam

dnl FEATURE(`accept_unresolvable_domains')dnl
dnl FEATURE(`relay_based_on_MX')dnl

Configure /etc/mail/access

Add your domain

# by default we allow relaying from localhost...
Connect:localhost.localdomain           RELAY
Connect:localhost                       RELAY
Connect:127.0.0.1                       RELAY
Connect:mysite.com                      RELAY

Configure /etc/mail/local-host-names

Add all aliases for your server

    mysite.com

Configure /etc/mail/virtusertable

Add email address/user pairs

    root@mysite.com myuser
    webmaster@mysite.com myuser
    postmaster@mysite.com myuser
    info@mysite.com myuser
    abuse@mysite.com myuser
    apache@mysite.com myuser

Configure /etc/aliases

Edit user that receives root’s email

    # Person who should get root's mail
    #root:           marc
    root             webmaster@mysite.com

Update /etc/sysconfig/iptables by adding

#Allow mail
-A INPUT -p tcp --dport 25 -j ACCEPT
-A OUTPUT -p tcp --dport 25 -j ACCEPT
-A INPUT -p tcp --dport 110 -j ACCEPT
-A OUTPUT -p tcp --dport 110 -j ACCEPT

Optionally Configure spam tools

see Configure Linux Mail Servers

Optionally set up POP3

If you want to read your mail using a client on your PC, you need to set up POP3.

$ /usr/bin/sudo /usr/bin/yum -y install dovecot
$ /usr/bin/sudo /sbin/chkconfig --add dovecot
$ /usr/bin/sudo /sbin/chkconfig --level 345 dovecot on
$ /usr/bin/sudo /etc/init.d/dovecot start

Edit /etc/dovecot.conf

    #protocols = imap imaps pop3 pop3s
    protocols = pop3

Configure your client to receive mail from mail.mysite.com

I don’t recommend that try to configure your slice to relay mail from your PC’s client software. Just use your ISP’s SMTP server to send mail. But if you insist, read this guide first.

Optionally install a mail client

Pine is a simple lightweight mail reader that you can use to read mail from a terminal session when you are logged on to your slice.

$ usr/bin/sudo /bin/rpm -ivh http://rpm.livna.org/livna-release-6.rpm

ensure enable=1 is set to enable=0 in the following files

    /etc/yum.repos.d/livna.repo
    /etc/yum.repos.d/livna-devel.repo
    /etc/yum.repos.d/livna-testing.repo

This will disable the livna repository for regular yum updates

Then, you can install Pine with:

$ /usr/bin/sudo /usr/bin/yum --enablerepo=livna install pine

July 25, 2007

The Hiring Process, Bayes Theorem and American Idol

Filed under: Business, Hiring, Management — trekr @ 8:00 am

It seems there is no shortage of blog posts about how to hire great workers. Most focus on the interview technique. If you are looking for a job, you should read as many of these blog posts as you can.

Here are two of the best ones …

In this post, I suggest that if you are trying to hire great workers, the process may be more critical to your success then your interviewing technique.

Your success in hiring great workers depends on how many great workers you have the opportunity to interview and your ability to interview.

Suppose you can identify the top 10% of any group 90% of the time. You get it wrong only 10% of the time. After you interview ten candidates, you’ve narrowed the field to one from the top 10, and one from the bottom 9. If you pick one now, you’re looking at a 50% chance that you pick correctly. You don’t like symmetry? Ok, you only get it wrong 5% of the time. Your odds of picking correctly have only improved to 2/3. Clearly, its best not to choose after a single evaluation.

What can be done to improve the odds of choosing correctly? You can either increase the number of great hires within the population of candidates you interview, or use a process of sequential multiple evaluations. This post will focus on the latter approach, improving the process. Unfortunately, everything we do to make our company attractive to great workers will make it attractive to everyone.

One way to improve your results is to make a serial sequence of decisions that narrows the field such that the number of qualified candidates remaining after each decision increases as a percentage of the total remaining candidates. In other words, the process has the effect of increasing the probability that any of the remaining candidates meets the criteria. A single elimination process is the easiest to illustrate.

In this post, I won’t address the merits of any particular evaluation technique. Pick any you like that is more effective than a coin toss. That’s not a flippant comment; you have to do better than 50/50 for the following process to work. Lou Adler has some interesting things to say about how to improve your interview techniques.

What does this look like in practice? It looks like American Idol. The candidate pool is evaluated and only candidates that pass the first evaluation continue. In successive rounds the evaluation focuses on different criteria that are increasingly more challenging and more relevant.

To illustrate, I’ll use another numerical example. Suppose the goal is to hire someone who is in the top 20% of qualified candidates. For the sake of argument assume all interviewers can pick winners 80% of the time and they mistakenly pick unqualified candidates only 20% of the time.

Because I like easy numbers, suppose 100 candidates are to be evaluated, and we need to pick one from the top 20.

After the first interview, 16 from the top 20% make it to the next round and 16 from the bottom 80% go forward as well. Sixty-eight are sent home. If a choice is made after the first interview, again, we only have a 50/50 chance of getting it right. After the second interview, 13 from the top 20% survive and 3 from the bottom 80% survive. One more round, and we’re done with over 94% probability that we pick a top 20% candidate. Pick any numbers you’d like, apply Bayes theorem, the results will point to the same process. In each iteration we are increasing the prior probability that a surviving candidate meets our criteria.

Most interviewing processes don’t work this way. Usually, a group interviews every candidate and a group decision is made. Invariably, the group is dominated by one or more influential members. Effectively, there is a single decision maker. But we’ve seen that even if an interviewer has a very good technique for selecting qualified candidates, she doesn’t have a very good chance of getting it right when the prior probability of a candidate meeting the criteria is low.

In the single elimination process I’ve described, it’s important that the interviewers do not discuss candidates prior to interviewing the candidate, or prior to making their decision. It’s equally important that they understand the math and don’t rely on the fact that the candidates have survived previous interviews.

What are the downsides to a single elimination process? No matter how you go about it, qualified candidates will not be selected. Therefore, it’s important that candidates understand the process and are treated respectfully.

Thanks to Greg Yut of Supply Beyond for reading and commenting on a draft of this post.

May 11, 2007

Changing a Drupal Site’s Domain

Filed under: Deployment, Drupal, Fedora, Installation, Linux — trekr @ 7:14 pm

I recently needed to change a Drupal installation from www.example.com to subdomain.example.com. Here’s how I did it. There is probably a shorter way, but these steps leave the current site untouched until you are sure the new one works.

  • Get ready
  • Clear Drupal cache
  • Dump current Drupal database to a backup file
  • Create new database and grant privileges
  • Run stream editor against db backup file to fix paths
  • Load new database
  • Ensure current Drupal site directory is updated in svn
  • Export current site from repository to new directory
  • Edit files/default/settings.php to point to new db
  • Configure Apache to use new directory
  • Enter a new A record in DNS
  • Test

To get ready, turn off css cache and make sure $base_url is commented out in your files/default/settings.php file. I also disabled clean URLs, not sure if it’s necessary. Your Drupal directory should already be under subversion control. Put up your maintenance page since we are going to clear cache.

The next step is to clear Drupal cache and dump the database to a backup file.

Put the following code in a page, and set up a menu to access it. Make sure you restrict access to admin for this menu. Don’t be lazy and do it from the command line - it’ll be faster this time and slower next.

db_query("DELETE FROM {cache} WHERE 1");
db_query("DELETE FROM {cache_filter} WHERE 1");
db_query("DELETE FROM {cache_menu} WHERE 1");
db_query("DELETE FROM {cache_page} WHERE 1");

Ok, for the command line purists, it’s like this

mysql> DELETE FROM cache WHERE 1;

Then dump the database to a backup file

$ /usr/bin/mysqldump dbname > mysite-backup.sql

Make the current site accessible again by taking down your maintenance page.

Create a new database and grant privileges

mysql> create database newdbname;
mysql> grant all on newdbname.* to 'drupaluser'@'localhost';

Before loading the new database we need to run a stream editor against the backup file and change any hard coded paths.

Check the backup file by grep’ing for the previous subdomain and path of the current install directory

$ /bin/grep -r mysite *

$ /bin/grep -r www *

Then edit something like this

$ /usr/bin/perl -pi.bak -e's/http:\/\/www.example/http:\/\/subdomain.example/g;
> s/\/html\/mysite/\/html\/mynewsite/g;' mysite-backup.sql

And load the new database

$ /usr/bin/mysql newdbname < mysite-backup.sql

My Drupal installation was in /var/www/html/mysite under subversion control. After cleaning up the working directory and checking in all changes, I exported mysite into a new working directory.

$ cd /var/www/html
$ /usr/bin/svn export file:///var/svn/repos/mysite mynewsite

Make sure the files/ directory is writable by the web server process owner.

Edit files/default/settings.php and change the name of the database, and make sure $base_url is commented out.

Configure a new virtual host in /etc/httpd/conf/httpd.conf for subdomain.example.com. Copy the <virtualhost> entry for www.example.com and change the ServerName, DocumentRoot, and log file name. If you have any Directory entries, change the path to your new directory. Check your configurations changes

$ /usr/sbin/apachectl configtest

And restart

$ /usr/sbin/apachectl graceful

Make sure you have an A record in DNS for subdomain.example.com

Finally test every page.

« Previous PageNext Page »

Hakota Design LLC