Redundant iSCSI storage for Linux

5.00 avg. rating (91% score) - 1 vote

Here’s how to set up relatively cheap redundant iSCSI storage on Linux. The redundancy is achieved using LVM mirroring, and the storage servers consist of commodity hardware, running the OpenFiler Linux distribution, which expose their disks to the clients using iSCSI over Ethernet. The servers are completely separate entities, and the purpose of this mirroring is to keep the logical volumes available, even while one of the storage servers is down for maintenance or due to hardware failure.

Ultimately the disks of the iSCSI target servers will show up as normal SCSI disks on the client (/dev/sdb, /dev/sdc, …). The data will be moved across the network transparently. It is preferable to use multiple gigabit network interface cards on both the initiator and the target, and bond them together for reliability and speed gain (or use Device Mapper Multipath). A separate VLAN for iSCSI traffic is recommended for security and speed. By default, the traffic is not encrypted so your disk blocks can easily be sniffed using tcpdump.

I created identical logical volumes on both OpenFiler servers and mapped them to iSCSI targets. The iSCSI initiator (client) here is an Ubuntu 9.04 desktop.

Table of Contents

1 Install Open-iSCSI and map targets
2 Make persistent across reboots
3 Partition with fdisk (optional)
4 The LVM Part
5 Speed
6 Status of the Mirrored Logical Volume
7 Testing for failure
8 Links and references

Install Open-iSCSI and map targets

On the client, install Open-iSCSI.

aptitude install open-iscsi

1	aptitude install open-iscsi

Run the discovery to see available targets (the IP address is the address of one of the servers).

iscsiadm -m discovery -t st -p 192.168.1.115

1	iscsiadm -m discovery -t st -p 192.168.1.115

You should get a target list as the output.

192.168.1.115:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv

1	192.168.1.115:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv

Map the target to a SCSI disk.

iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv -p 192.168.1.115 --login

1	iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv -p 192.168.1.115 --login

dmesg should now show a that a new SCSI disk was detected.

[600584.938727] scsi 2:0:0:0: Direct-Access     OPNFILER VIRTUAL-DISK     0    PQ: 0 ANSI: 4
[600584.947903] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB)
[600584.983070] sd 2:0:0:0: [sdb] Write Protect is off
[600584.983074] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08
[600584.988064] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[600584.989379] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB)
[600584.989974] sd 2:0:0:0: [sdb] Write Protect is off
[600584.989977] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08
[600584.991359] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[600584.991363]  sdb: unknown partition table
[600585.008012] sd 2:0:0:0: [sdb] Attached SCSI disk
[600585.008072] sd 2:0:0:0: Attached scsi generic sg2 type 0

[600584.938727] scsi 2:0:0:0: Direct-Access OPNFILER VIRTUAL-DISK 0 PQ: 0 ANSI: 4

[600584.947903] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB)

[600584.983070] sd 2:0:0:0: [sdb] Write Protect is off

[600584.983074] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08

[600584.988064] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA

[600584.989379] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB)

[600584.989974] sd 2:0:0:0: [sdb] Write Protect is off

[600584.989977] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08

[600584.991359] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA

[600584.991363] sdb: unknown partition table

[600585.008012] sd 2:0:0:0: [sdb] Attached SCSI disk

[600585.008072] sd 2:0:0:0: Attached scsi generic sg2 type 0

You can now use the disk as a normal SCSI disk.

Discover the second storage server.

iscsiadm -m discovery -t st -p 192.168.1.120

1	iscsiadm -m discovery -t st -p 192.168.1.120

Target found:

192.168.1.120:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv-2

1	192.168.1.120:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv-2

Map the target.

iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv-2 192.168.1.120 --login

1	iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv-2 192.168.1.120 --login

Make persistent across reboots

The discovered nodes will automatically show up under /etc/iscsi/nodes. If you wish to make them available automatically after reboot, change the following line in the corresponding node file:

node.conn[0].startup = manual

1	node.conn[0].startup = manual

Change to:

node.conn[0].startup = automatic

1	node.conn[0].startup = automatic

Partition with fdisk (optional)

I partitioned the disks with fdisk. This is optional, but I like to do it because it makes easier to detect the type of the disk just by checking the partition table.

Disk /dev/sdb: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Disk identifier: 0x32d429c4

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1        1009     2095662   8e  Linux LVM

Disk /dev/sdc: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Disk identifier: 0x9823ed68

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1        1009     2095662   8e  Linux LVM

Disk /dev/sdb: 2147 MB, 2147483648 bytes

67 heads, 62 sectors/track, 1009 cylinders

Units = cylinders of 4154 * 512 = 2126848 bytes

Disk identifier: 0x32d429c4

Device Boot Start End Blocks Id System

/dev/sdb1 1 1009 2095662 8e Linux LVM

Disk /dev/sdc: 2147 MB, 2147483648 bytes

67 heads, 62 sectors/track, 1009 cylinders

Units = cylinders of 4154 * 512 = 2126848 bytes

Disk identifier: 0x9823ed68

Device Boot Start End Blocks Id System

/dev/sdc1 1 1009 2095662 8e Linux LVM

The LVM Part

Install Logical Volume Manager.

aptitude install lvm2

1	aptitude install lvm2

Create physical volumes and the volume group.

pvcreate /dev/sdb1
pvcreate /dev/sdc1
vgcreate vg0 /dev/sdb1 /dev/sdc1

pvcreate /dev/sdb1

pvcreate /dev/sdc1

vgcreate vg0 /dev/sdb1 /dev/sdc1

Create a mirrored logical volume.

lvcreate --mirrors 1 --corelog --name testlv --size 512M vg0

1	lvcreate --mirrors 1 --corelog --name testlv --size 512M vg0

Create filesystem and mount.

mke2fs -j /dev/vg0/testlv
mount /dev/vg0/testlv /mnt/test

1 2	mke2fs -j /dev/vg0/testlv mount /dev/vg0/testlv /mnt/test

Speed

Test read speeds.

 hdparm -t /dev/mapper/vg0-testlv

1	hdparm -t /dev/mapper/vg0-testlv

10 MB per second is about the max I can get with this test system which uses 100 Mbit/s ethernet.

/dev/mapper/vg0-testlv:
 Timing buffered disk reads:   32 MB in  3.22 seconds =   9.95 MB/sec

1 2	/dev/mapper/vg0-testlv: Timing buffered disk reads: 32 MB in 3.22 seconds = 9.95 MB/sec

On a production system, gigabit is a must (preferably multiple links bonded).

Status of the Mirrored Logical Volume

To check the status of the mirrored logical volume, run the command “lvs”:

  LV     VG   Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  testlv vg0  mwi-ao 512,00M                        100,00

1 2	LV VG Attr LSize Origin Snap% Move Log Copy% Convert testlv vg0 mwi-ao 512,00M 100,00

The Copy% will show the percentage of copied extents. 100% indicates the mirrors are synced. Whenever a mirror is out-of-sync and is being updated, the percentage will be less.

The commands “lvdisplay -m” and “pvdisplay -m” will show you a detailed map of the extents on the physical volumes:

lvdisplay -m

  --- Logical volume ---
  LV Name                /dev/vg0/testlv
  VG Name                vg0
  LV UUID                5ookbu-qJ9h-rzBA-D6Ek-mkH2-Vryc-EYYqvp
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                512,00 MB
  Current LE             128
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:2

  --- Segments ---
  Logical extent 0 to 127:
    Type		mirror
    Mirrors		2
    Mirror size		128
    Mirror region size	512,00 KB
    Mirror original:
      Logical volume	testlv_mimage_0
      Logical extents	0 to 127
    Mirror destinations:
      Logical volume	testlv_mimage_1
      Logical extents	0 to 127

--- Logical volume ---

LV Name /dev/vg0/testlv

VG Name vg0

LV UUID 5ookbu-qJ9h-rzBA-D6Ek-mkH2-Vryc-EYYqvp

LV Write Access read/write

LV Status available

# open 1

LV Size 512,00 MB

Current LE 128

Segments 1

Allocation inherit

Read ahead sectors auto

- currently set to 256

Block device 252:2

--- Segments ---

Logical extent 0 to 127:

Type mirror

Mirrors 2

Mirror size 128

Mirror region size 512,00 KB

Mirror original:

Logical volume testlv_mimage_0

Logical extents 0 to 127

Mirror destinations:

Logical volume testlv_mimage_1

Logical extents 0 to 127

pvdisplay -m

  --- Physical volume ---
  PV Name               /dev/sdb1
  VG Name               vg0
  PV Size               2,00 GB / not usable 2,54 MB
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              511
  Free PE               383
  Allocated PE          128
  PV UUID               JINpaF-WiCp-sEH2-2PcK-bEvR-ht8j-mRAg05

  --- Physical Segments ---
  Physical extent 0 to 127:
    Logical volume	/dev/vg0/testlv_mimage_0
    Logical extents	0 to 127
  Physical extent 128 to 510:
    FREE

  --- Physical volume ---
  PV Name               /dev/sdc1
  VG Name               vg0
  PV Size               2,00 GB / not usable 2,54 MB
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              511
  Free PE               383
  Allocated PE          128
  PV UUID               V7dMTV-gWLe-gRWy-H7LU-7mwI-LsBu-2uIC7C

  --- Physical Segments ---
  Physical extent 0 to 127:
    Logical volume	/dev/vg0/testlv_mimage_1
    Logical extents	0 to 127
  Physical extent 128 to 510:
    FREE

--- Physical volume ---

PV Name /dev/sdb1

VG Name vg0

PV Size 2,00 GB / not usable 2,54 MB

Allocatable yes

PE Size (KByte) 4096

Total PE 511

Free PE 383

Allocated PE 128

PV UUID JINpaF-WiCp-sEH2-2PcK-bEvR-ht8j-mRAg05

--- Physical Segments ---

Physical extent 0 to 127:

Logical volume /dev/vg0/testlv_mimage_0

Logical extents 0 to 127

Physical extent 128 to 510:

FREE

--- Physical volume ---

PV Name /dev/sdc1

VG Name vg0

PV Size 2,00 GB / not usable 2,54 MB

Allocatable yes

PE Size (KByte) 4096

Total PE 511

Free PE 383

Allocated PE 128

PV UUID V7dMTV-gWLe-gRWy-H7LU-7mwI-LsBu-2uIC7C

--- Physical Segments ---

Physical extent 0 to 127:

Logical volume /dev/vg0/testlv_mimage_1

Logical extents 0 to 127

Physical extent 128 to 510:

FREE

Testing for failure

When the other iSCSI server was brought down, it took about two minutes before the iSCSI initiator gave up. After this, the mounted volume was working without problems. During the two-minute timeout countdown, some slowness and waiting was experienced.

After the iSCSI server was brought up again, the other half of the mirror was restored and synced automatically. In conclusion, I would say my mirrored logical volume can be thought of as highly available.

It seems that the timeout value can be set in the node configuration file (although I didn’t test it):

node.session.timeo.replacement_timeout = 120

1	node.session.timeo.replacement_timeout = 120

Links and references

6 thoughts on “Redundant iSCSI storage for Linux”

bill says:

2009-08-14 at 17:27

Hi,

how did you end up with disabled read cache on sdb?

i’m trying to setup a cluster filesystem on iSCSI, and i can’t disable read cache so all servers see different data on the disk 🙁

thanks…

mikko says:

2009-08-17 at 01:28

Bill, I have to admit I don’t know. That’s how it was by default. Perhaps a setting somewhere under /etc/iscsi…?

Reinhard says:

2010-10-29 at 12:38

Hi,

I seem to have the same problem as Bill. I have two nodes sharing the same ISCSI device, and if node1 writes to that device (I used dd), node2 keeps reading from its cache. If I invalidate the cache manually (echo 3 > /proc/sys/vm/drop_caches) on node2, I see the correct content.

Strange enough, this only happens on my Linux Cluster based on VMware. On a bare metal Linux (same version and same Openfiler as ISCSI target) this does not happen.

Any suggestions ?

Many thanks in advance
Reinhard

mikko says:

2010-11-28 at 12:26

Actually, this post describes a single host using dual storage servers for mirroring the same data. So it is storage server high availability, not clustering two nodes with a shared filesystem.

I guess you really do need to disable caching on the nodes if you are using more than one client node. I don’t really know how to do that – someone wiser could comment on this.

As a side note, I discovered that GFS works with dual client nodes, but performance is horrible when using LVM mirrored disks underneath (2 GFS nodes, 2 storage servers). I would like to test OCFS2 in this regard, but my test system is down, perhaps permanently, so I probably need to build a new one when I have some time. My guess is that DRBD mirroring is the way to go instead of LVM mirroring.

Anttix says:

2010-11-30 at 14:44

You absolutely NEED a clustered filesystem to use the same iSCSI disk (LUN) from more than one client. Look for GFS2 or OCFS2. You can make LVM VG on top of this shared iSCSI disk, but then You need the same for LVM metadada. Look for CLVM.

Hugo says:

2010-12-29 at 18:27

Where do the LVM knows, which of the two iscsi disks in a mirror is the one with the newest data? I mean if one of the iscsi disk get disconnected and come back after some time (or reboot), which disk is mirrored to the other?

Leave No Bit Unturned

Redundant iSCSI storage for Linux

Install Open-iSCSI and map targets

Make persistent across reboots

Partition with fdisk (optional)

The LVM Part

Speed

Status of the Mirrored Logical Volume

Testing for failure

Links and references

Like this:

6 thoughts on “Redundant iSCSI storage for Linux”

Leave a Reply to Hugo Cancel reply

Install Open-iSCSI and map targets

Make persistent across reboots

Partition with fdisk (optional)

The LVM Part

Speed

Status of the Mirrored Logical Volume

Testing for failure

Links and references

Share:

Like this:

6 thoughts on “Redundant iSCSI storage for Linux”

Leave a Reply to Hugo Cancel reply