I needed some VMs to be available on a backup node, which I accomplished with the distributed remote block device, or DRBD. My host machine is Debian 6.

This post replaced an older one I made.

First install drbd:

aptitude -P install drbd8-utils

Then make some config files. First adjust /etc/drbd.d/global.conf (I only had to uncomment the notify rules):

global {
        usage-count yes;
        # minor-count dialog-refresh disable-ip-verification
common {
        protocol C;
        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
                # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
                # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
        startup {
                # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb;
                # The timeout value when the last known state of the other side was available.
                wfc-timeout 0;
                # Timeout value when the last known state was disconnected.
                degr-wfc-timeout 180;
        disk {
                # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
                # no-disk-drain no-md-flushes max-bio-bvecs   
        net {
                # snd‐buf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
                # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
                # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
        syncer {
                # rate after al-extents use-rle cpu-mask verify-alg csums-alg

Then I made a resource for my existing logical volume:

resource r0
  meta-disk internal;
  device /dev/drbd1;
    # The timeout value when the last known state of the other side was available.
    wfc-timeout 0;
    # Timeout value when the last known state was disconnected.
    degr-wfc-timeout 180;
    # This is recommended only for low-bandwidth lines, to only send those
    # blocks which really have changed.
    #csums-alg md5;
    # Set to about half your net speed
    rate 8M;
    # It seems that this option moved to the 'net' section in drbd 8.4.
    verify-alg md5;
    # The manpage says this is recommended only in pre-production (because of its performance), to determine
    # if your LAN card has a TCP checksum offloading bug. 
    #data-integrity-alg md5;
    # Detach causes the device to work over-the-network-only after the
    # underlying disk fails. Detach is not default for historical reasons, but is
    # recommended by the docs.
    # However, the Debian defaults in drbd.conf suggest the machine will reboot in that event...
    on-io-error detach;
    # LVM doesn't support barriers, so disabling it. It will revert to flush. Check wo: in /proc/drbd. If you don't disable it, you get IO errors.
  on top
    disk /dev/universe/lvtest;
  on bottom
    disk /dev/universe/lvtest;

Copy all config files to the slave machine (and write an rsync-script for it…).

I learned that Linux 3.1 now has write barriers enabled by default for ext3 (they already were for ext4). This causes bugs and IO errors with xen-blkfront, so that needs to be disabled:

# grep barrier /etc/fstab
/dev/xvda2 / ext3 barrier=0 0 1

I’ll see about finding out if there are bug reports and file them if necessary.

The drbd data is going to be written on the actual LV, so on the primary node, we need to make space (you can also grow the LV):

e2fsck -f /dev/universe/lvtest
resize2fs /dev/universe/lvtest 500M # or however big that's a tad smaller than the actual LV.
drbdadm create-md r0
drbdadm up r0

On the secondary node, make the device as well:

drbdadm create-md r0
drbdadm up r0

Then we can start syncing and re-grow it. On the primary:

drbdadm -- --overwrite-data-of-peer primary r0 # the -- is necessary because of weird option handling by drbdadm.
resize2fs /dev/drbd1

The logical volume has been converted from ext3 to drbd:

# mount /dev/universe/lvtest /mnt/temp
mount: unknown filesystem type 'drbd'

Then, it is recommended you create /etc/modprobe.d/drbd.conf with:

options drbd disable_sendpage=1

I don’t know what it does, but it’s recommended by the DRBD devices docs when you put Xen domains on DRBD devices.

In Xen, you can configure the disk device of a VM like this (actually, I learned that this doesn’t work with pygrub):

disk = [ 'drbd:resource,xvda,w' ]

Drbd has installed the necessary scripts in /etc/xen/scripts to support this. Xen will now automatically promote a drbd device to primary when you start a VM.

Bewarned: because of that, don’t put the VM in the /etc/xen/auto dir on the fallback node, otherwise whichever machine is faster will start the VM, preventing the other machine from starting it (because you can’t have two primaries).

Then, I noticed that Debian arranges it’s boot process erroneously, starting xemdomains before drbd. I comment on an old bug.

You can fix it by adding xendomains to the following lines in /etc/init.d/drbd:

# X-Start-Before: heartbeat corosync xendomains
# X-Stop-After:   heartbeat corosync xendomains 

Mdadm (software RAID) schedules monthly checks of your array. You can do that for DRBD too). You do that on the primary node with a cronjob in /etc/cron.d/:

42 0 * * 0    root    /sbin/drbdadm verify all

One last thing: the docs state that when you perform a verify and it detects an out-of-sync device, all you have to do is disconnect and connect. That didn’t work for me. Instead, I ran the following on the secondary node (the one I had destroyed with dd) to initiate a resync:

drbdadm invalidate r0