Here’s how to set up relatively cheap redundant iSCSI storage on Linux. The redundancy is achieved using LVM mirroring, and the storage servers consist of commodity hardware, running the OpenFiler Linux distribution, which expose their disks to the clients using iSCSI over Ethernet. The servers are completely separate entities, and the purpose of this mirroring is to keep the logical volumes available, even while one of the storage servers is down for maintenance or due to hardware failure.
Ultimately the disks of the iSCSI target servers will show up as normal SCSI disks on the client (/dev/sdb, /dev/sdc, …). The data will be moved across the network transparently. It is preferable to use multiple gigabit network interface cards on both the initiator and the target, and bond them together for reliability and speed gain (or use Device Mapper Multipath). A separate VLAN for iSCSI traffic is recommended for security and speed. By default, the traffic is not encrypted so your disk blocks can easily be sniffed using tcpdump.
I created identical logical volumes on both OpenFiler servers and mapped them to iSCSI targets. The iSCSI initiator (client) here is an Ubuntu 9.04 desktop.
Table of Contents
Install Open-iSCSI and map targets
On the client, install Open-iSCSI.
1 |
aptitude install open-iscsi |
Run the discovery to see available targets (the IP address is the address of one of the servers).
1 |
iscsiadm -m discovery -t st -p 192.168.1.115 |
You should get a target list as the output.
1 |
192.168.1.115:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv |
Map the target to a SCSI disk.
1 |
iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv -p 192.168.1.115 --login |
dmesg should now show a that a new SCSI disk was detected.
1 2 3 4 5 6 7 8 9 10 11 12 |
[600584.938727] scsi 2:0:0:0: Direct-Access OPNFILER VIRTUAL-DISK 0 PQ: 0 ANSI: 4 [600584.947903] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB) [600584.983070] sd 2:0:0:0: [sdb] Write Protect is off [600584.983074] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08 [600584.988064] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA [600584.989379] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB) [600584.989974] sd 2:0:0:0: [sdb] Write Protect is off [600584.989977] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08 [600584.991359] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA [600584.991363] sdb: unknown partition table [600585.008012] sd 2:0:0:0: [sdb] Attached SCSI disk [600585.008072] sd 2:0:0:0: Attached scsi generic sg2 type 0 |
You can now use the disk as a normal SCSI disk.
Discover the second storage server.
1 |
iscsiadm -m discovery -t st -p 192.168.1.120 |
Target found:
1 |
192.168.1.120:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv-2 |
Map the target.
1 |
iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv-2 192.168.1.120 --login |
Make persistent across reboots
The discovered nodes will automatically show up under /etc/iscsi/nodes. If you wish to make them available automatically after reboot, change the following line in the corresponding node file:
1 |
node.conn[0].startup = manual |
Change to:
1 |
node.conn[0].startup = automatic |
Partition with fdisk (optional)
I partitioned the disks with fdisk. This is optional, but I like to do it because it makes easier to detect the type of the disk just by checking the partition table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Disk /dev/sdb: 2147 MB, 2147483648 bytes 67 heads, 62 sectors/track, 1009 cylinders Units = cylinders of 4154 * 512 = 2126848 bytes Disk identifier: 0x32d429c4 Device Boot Start End Blocks Id System /dev/sdb1 1 1009 2095662 8e Linux LVM Disk /dev/sdc: 2147 MB, 2147483648 bytes 67 heads, 62 sectors/track, 1009 cylinders Units = cylinders of 4154 * 512 = 2126848 bytes Disk identifier: 0x9823ed68 Device Boot Start End Blocks Id System /dev/sdc1 1 1009 2095662 8e Linux LVM |
The LVM Part
Install Logical Volume Manager.
1 |
aptitude install lvm2 |
Create physical volumes and the volume group.
1 2 3 |
pvcreate /dev/sdb1 pvcreate /dev/sdc1 vgcreate vg0 /dev/sdb1 /dev/sdc1 |
Create a mirrored logical volume.
1 |
lvcreate --mirrors 1 --corelog --name testlv --size 512M vg0 |
Create filesystem and mount.
1 2 |
mke2fs -j /dev/vg0/testlv mount /dev/vg0/testlv /mnt/test |
Speed
Test read speeds.
1 |
hdparm -t /dev/mapper/vg0-testlv |
10 MB per second is about the max I can get with this test system which uses 100 Mbit/s ethernet.
1 2 |
/dev/mapper/vg0-testlv: Timing buffered disk reads: 32 MB in 3.22 seconds = 9.95 MB/sec |
On a production system, gigabit is a must (preferably multiple links bonded).
Status of the Mirrored Logical Volume
To check the status of the mirrored logical volume, run the command “lvs”:
1 2 |
LV VG Attr LSize Origin Snap% Move Log Copy% Convert testlv vg0 mwi-ao 512,00M 100,00 |
The Copy% will show the percentage of copied extents. 100% indicates the mirrors are synced. Whenever a mirror is out-of-sync and is being updated, the percentage will be less.
The commands “lvdisplay -m” and “pvdisplay -m” will show you a detailed map of the extents on the physical volumes:
lvdisplay -m
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
--- Logical volume --- LV Name /dev/vg0/testlv VG Name vg0 LV UUID 5ookbu-qJ9h-rzBA-D6Ek-mkH2-Vryc-EYYqvp LV Write Access read/write LV Status available # open 1 LV Size 512,00 MB Current LE 128 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 252:2 --- Segments --- Logical extent 0 to 127: Type mirror Mirrors 2 Mirror size 128 Mirror region size 512,00 KB Mirror original: Logical volume testlv_mimage_0 Logical extents 0 to 127 Mirror destinations: Logical volume testlv_mimage_1 Logical extents 0 to 127 |
pvdisplay -m
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
--- Physical volume --- PV Name /dev/sdb1 VG Name vg0 PV Size 2,00 GB / not usable 2,54 MB Allocatable yes PE Size (KByte) 4096 Total PE 511 Free PE 383 Allocated PE 128 PV UUID JINpaF-WiCp-sEH2-2PcK-bEvR-ht8j-mRAg05 --- Physical Segments --- Physical extent 0 to 127: Logical volume /dev/vg0/testlv_mimage_0 Logical extents 0 to 127 Physical extent 128 to 510: FREE --- Physical volume --- PV Name /dev/sdc1 VG Name vg0 PV Size 2,00 GB / not usable 2,54 MB Allocatable yes PE Size (KByte) 4096 Total PE 511 Free PE 383 Allocated PE 128 PV UUID V7dMTV-gWLe-gRWy-H7LU-7mwI-LsBu-2uIC7C --- Physical Segments --- Physical extent 0 to 127: Logical volume /dev/vg0/testlv_mimage_1 Logical extents 0 to 127 Physical extent 128 to 510: FREE |
Testing for failure
When the other iSCSI server was brought down, it took about two minutes before the iSCSI initiator gave up. After this, the mounted volume was working without problems. During the two-minute timeout countdown, some slowness and waiting was experienced.
After the iSCSI server was brought up again, the other half of the mirror was restored and synced automatically. In conclusion, I would say my mirrored logical volume can be thought of as highly available.
It seems that the timeout value can be set in the node configuration file (although I didn’t test it):
1 |
node.session.timeo.replacement_timeout = 120 |
Hi,
how did you end up with disabled read cache on sdb?
i’m trying to setup a cluster filesystem on iSCSI, and i can’t disable read cache so all servers see different data on the disk 🙁
thanks…
Bill, I have to admit I don’t know. That’s how it was by default. Perhaps a setting somewhere under /etc/iscsi…?
Hi,
I seem to have the same problem as Bill. I have two nodes sharing the same ISCSI device, and if node1 writes to that device (I used dd), node2 keeps reading from its cache. If I invalidate the cache manually (echo 3 > /proc/sys/vm/drop_caches) on node2, I see the correct content.
Strange enough, this only happens on my Linux Cluster based on VMware. On a bare metal Linux (same version and same Openfiler as ISCSI target) this does not happen.
Any suggestions ?
Many thanks in advance
Reinhard
Actually, this post describes a single host using dual storage servers for mirroring the same data. So it is storage server high availability, not clustering two nodes with a shared filesystem.
I guess you really do need to disable caching on the nodes if you are using more than one client node. I don’t really know how to do that – someone wiser could comment on this.
As a side note, I discovered that GFS works with dual client nodes, but performance is horrible when using LVM mirrored disks underneath (2 GFS nodes, 2 storage servers). I would like to test OCFS2 in this regard, but my test system is down, perhaps permanently, so I probably need to build a new one when I have some time. My guess is that DRBD mirroring is the way to go instead of LVM mirroring.
You absolutely NEED a clustered filesystem to use the same iSCSI disk (LUN) from more than one client. Look for GFS2 or OCFS2. You can make LVM VG on top of this shared iSCSI disk, but then You need the same for LVM metadada. Look for CLVM.
Where do the LVM knows, which of the two iscsi disks in a mirror is the one with the newest data? I mean if one of the iscsi disk get disconnected and come back after some time (or reboot), which disk is mirrored to the other?