Hard disks are being released that abandon the long-established standard of 512 byte sectors. I just got two Western Digital WD15EARS ones, which uses 4 kB sectors. Western Digital Refers to this as “Advanced Format”. This poses some serious problems. I will describe those, and what I did to ‘solve’ it.

This article is as much for myself as for other people correcting me. So, if you see faults, let me know 🙂

First, you may want to read this IBM article and this LWN article. Reading these articles is important to understand this issue. Too summarize a bit, hard disks can’t really have 4 kB sectors, because then BIOS, bootloaders, operating systems, partition software and who knows what will just go beserk. So, they still report a sector size of 512 bytes. This emulation presents certain performance issues (explained in the links), which can only be avoided by aligning the filesystems on disk, which the IBM article explains well. In short, I configured fdisk to 224 heads and 56 sectors. This aligns the partitions properly. Do remember, however, to start at cylinder 2 for your first partition, otherwise there isn’t enough space for the bootloader.

Now, what happens if you don’t put a file system on your partition, but make a software RAID1+LVM out of it? It gets a lot more complicated to align the filesystem blocks this way, because of all the layers. I did the following:

First I created the aligned partition with fdisk on two disks. Then I made a RAID1 array out of it. Luckily, Linux MD has the ability to store the RAID superblock at the end of the partition. It can also store it at the beginning, but you shouldn’t do that! The RAID superblock is 256 bytes + 2 bytes per drive long, which means one logical sectors, which in turn means that the start of the MD partition is in the middle of a 4 kB sector on disk. You can use metadata format 1.0 or 0.9 to put the superblock at the end of the disk (see mdadm man page).

Then it’s time to create the logical volume. When preparing the RAID1 partition for use in the volume group, give pvcreate the –dataalignment 4096 option. Then, with “pvs -o +pe_start”, you should be able to see where the first PE (physical extent) starts. I accidentally created mine with alignment 8, but the first extent is at 136.00 kB (139264 bytes), which is divisable by 4096.

Then, I created the volume group with an extent size of 4MB. 4 MB is also divisible by 4096. Logical volumes are created as multiples of the extent size, so you can now create them at will, and they will be aligned.

When creating a RAID array that deals with striping, be sure to make the stripe size a multiple of 4 kB. I guess this also applies to logical volumes with striping.

What I still would like to know, is whether the file system journal is also made up of 4 kb blocks. Also, the RAID array’s write intent bitmap (if you have one) is also still unclear to me. Where is that stored? Does it write in multiples of 4k?