Difference: DESStorageImplement (1 vs. 7)

Revision 72012-01-15 - TWikiAdminUser

Line: 1 to 1

META TOPICPARENT	name="ExperimentalStorage"

Implementing a Storage Area Network Prototype

Line: 6 to 6

Objective

Changed:

<
<

This project was the result of projects on the design and simulation of a storage system for the Department to Engineering Science (DES). These projects suggested that a core-edge SAN might provide improved performance and reliability over the existing storage system. NDSG received funding to build a prototype storage system to compare to the models developed.

>
>

Initial Core-Edge SAN Implementation

Changed:

<
<

The network we set out to build was the result of design and simulation work on the DES storage system. The final design of the network is shown below:

>
>

The network we set out to build was the result of design and simulation work on the DES storage system. The final design of the network is shown below:

Revision 62011-09-08 - TWikiAdminUser

Line: 1 to 1

Changed:

<
<

META TOPICPARENT	name="ORUA.ExperimentalStorage"

>
>

META TOPICPARENT	name="ExperimentalStorage"

Implementing a Storage Area Network Prototype

Implementing a Storage Area Network Prototype

Objective

Changed:

<
<

>
>

Initial Core-Edge SAN Implementation

Changed:

<
<

The network we set out to build was the result of design and simulation work on the DES storage system. The final design of the network is shown below:

>
>

The network we set out to build was the result of design and simulation work on the DES storage system. The final design of the network is shown below:

Changed:

<
<

(For a full size image see the design project.)

>
>

(For a full size image see the design project.)

As a first step in implementing this design we are building an experimental network consisting only of the New Servers and the Disks with a smaller-core-edge network:

Line: 749 to 749

META FILEATTACHMENT	attachment="ndsg_cluster.jpg" attr="h" comment="" date="1292383089" name="ndsg_cluster.jpg" path="ndsg_cluster.jpg" size="133243" stream="ndsg_cluster.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
META FILEATTACHMENT	attachment="ndsg_cluster_drbd.jpg" attr="h" comment="" date="1292445932" name="ndsg_cluster_drbd.jpg" path="ndsg_cluster_drbd.jpg" size="134498" stream="ndsg_cluster_drbd.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
META FILEATTACHMENT	attachment="san_architecture_drbd.jpg" attr="h" comment="" date="1292446437" name="san_architecture_drbd.jpg" path="san_architecture_drbd.jpg" size="68493" stream="san_architecture_drbd.jpg" tmpFilename="" user="MichaelOSullivan" version="1"

Added:

>
>

META TOPICMOVED	by="BaseUserMapping_333" date="1315516702" from="NDSG.DESStorageImplement" to="ORUA.DESStorageImplement"

Revision 52011-09-08 - TWikiAdminUser

Line: 1 to 1

Changed:

<
<

META TOPICPARENT	name="StorageImplementation"

>
>

META TOPICPARENT	name="ORUA.ExperimentalStorage"

Implementing a Storage Area Network Prototype

Implementing a Storage Area Network Prototype

Revision 42010-12-15 - MichaelOSullivan

Line: 1 to 1

META TOPICPARENT	name="StorageImplementation"

Implementing a Storage Area Network Prototype

Revision 32010-12-15 - MichaelOSullivan

  META TOPICPARENT 
 name="StorageImplementation" 

 Implementing a Storage Area Network Prototype
- META TOPICPARENT
+ name="StorageImplementation"
 Both Distributed Replicated Block Device (DRBD) and Red Hat Cluster Suite (RHCS) use a single IP address to identify systems that are part of the distributed replicated block device (DRBD) or cluster (RHCS). In order to make the mirrored block device and the cluster for core-edge storage system reliable, we want to make sure they do not "break" if the link (or NIC) that DRBD or RHCS use fail. To do this we create a new load-shared "link" with a new IP address using Linux Advanced Routing and Traffic Control (LARTC).
-<
<
+h2 Using LARTC to create a load-shared IP address
->
>
+ Using LARTC to create a load-shared IP address
 On all the machines we want to create this single (load shared) IP address. To do this:
-<
<
+{code}
->
>
 modprobe sch_teql
tc qdisc add dev eth0 root teql0
tc qdisc add dev eth1 root teql0
ip link set dev teql0
-<
<
+{code}
->
>
-<
<
+In our network we have already set the IP addresses as shown in Figure 1, so only the new IP address needs to be set. On Storage 1 we do this:
{code}
->
>
+In our network we have already set the IP addresses as shown below, so only the new IP address needs to be set. On Storage 1 we do this:
 ip addr add dev teql0 192.168.10.1/24
-<
<
+{code}

{image:worksite:/ndsg_cluster_drbd.jpg|ndsg_cluster_drbd.jpg}
->
>
-<
<
+Figure 1 Architecture for DRBD/Multipath Storage Network
->
>
-<
<
+h2 Using DRBD to mirror the disk
->
>
+ Using DRBD to mirror the disk
-<
<
+First, ensure that DRBD is installed on both storage servers (I used yum in CentOS). Then create /etc/drbd.conf:
{code}
->
>
+First, ensure that DRBD is installed on both storage servers (I used yum in CentOS). Then create /etc/drbd.conf:
 common {
    protocol C;
}
         address 192.168.10.2:7799;
    }
}
-<
<
+{code}
->
>
 You may have to "zero" enough space on your device:
-<
<
+{code}
->
>
 dd if=/dev/zero bs=1M count=128 of=/dev/VolGroup00/iscsidisk1; sync
-<
<
+{code}
->
>
 This creates space for the 128MB meta data disk.
-<
<
+Now create the drbd device on the server.
{code}
drbdadm create-md ~~resource~~
drbdadm attach ~~resource~~
drbdadm connect ~~resource~~
cat /proc/drbd
{code}
->
>
+Now create the drbd device on the server.

drbdadm create-md _resource_

drbdadm attach _resource_

drbdadm connect _resource_>br>
cat /proc/drbd
-<
<
+See https://www.drbd.org/docs/install/ for more details.
->
>
+See https://www.drbd.org/docs/install/ for more details.
 Repeat this process on the other server.
-<
<
+On ~~one~~ of the storage servers synchronize the data (I am assuming both servers are currently empty):
{code}
->
>
+On one of the storage servers synchronize the data (I am assuming both servers are currently empty):
 drbdadm -- --overwrite-data-of-peer primary ~~resource~~
-<
<
+{code}
->
>
 Now enable a Primary/Primary configuration by adding:
-<
<
+{code}
->
>
 resource ~~resource~~
  net { 
    allow-two-primaries;
   }
  ...
}
-<
<
+{code}
to /etc/drbd.conf and using the following commands:
{code}
->
>
+to /etc/drbd.conf and using the following commands:
 drbdadm adjust ~~resource~~
drbdadm primary ~~resource~~
-<
<
+{code}

For any problems please consult https://www.drbd.org/.
->
>
+For any problems please consult https://www.drbd.org/.
-<
<
+h2 Use iSCSI to present the DRBD device
->
>
+ Use iSCSI to present the DRBD device
-<
<
+Now use iSCSI as specified in [Building the Alternative (Core-Edge) SAN | Core-Edge SAN Implementation] with two VERY IMPORTANT differences. This time you will be pointing to a single iSCSI target using two different IP addresses. We will also use a serial number to emulate drbd0 being the same disk. Create a single ietd.conf file as before, add ~~ScsiId=DRBD0~~ to the end of the Lun line in ietd.conf:
{code}
->
>
+Now use iSCSI as specified above with two VERY IMPORTANT differences. This time you will be pointing to a single iSCSI target using two different IP addresses. We will also use a serial number to emulate drbd0 being the same disk. Create a single ietd.conf file as before, add ScsiId=DRBD0 to the end of the Lun line in ietd.conf:
         Lun 0 Path=/dev/drbd0,Type=fileio,ScsiId=DRBD0
-<
<
+{code}
and finally copy this file to the other storage server. This should "fool" the iSCSI initiators into thinking /dev/drbd0 is a single device (which for all intents and purposes it is!).
->
>
+and finally copy this file to the other storage server. This should "fool" the iSCSI initiators into thinking /dev/drbd0 is a single device (which for all intents and purposes it is!).
-<
<
+h2 Use Multipath to specify a single device
->
>
+ Use Multipath to specify a single device
 Use
-<
<
+{code}
->
>
 scsi_id -g -u -s /block/sdc
scsi_id -g -u -s /block/sdd
-<
<
+{code}
->
>
 to make sure your iSCSI target is showing up as the same device on each connection (via the two TEQL devices, i.e., 4 NICs).
-<
<
+Now edit multipath.conf so that these scsi device (sdc and sdd) are seen as the same device with multiple paths:
{code}
->
>
+Now edit multipath.conf so that these scsi device (sdc and sdd) are seen as the same device with multiple paths:
 # This is a basic configuration file with some examples, for device mapper
# multipath.
# For a complete list of the default configuration values, see
 #		path_grouping_policy	multibus
#	}
#}
-<
<
+{code}

h2 Set up GFS on RHCS
->
>
-<
<
+Now we have single mirrored disk presented to both Server 1 and Server 2 so we can create a cluster and GFS as in [Building the Alternative (Core-Edge) SAN | Core-Edge SAN Implementation]. However, one important difference is the presence of the TEQL devices. Use these IP addresses in the /etc/hosts file an d the cluster will not break if a single NIC goes down. Otherwise, use ricci and luci or cman, rgmanager and clvmd as before to set up a cluster with a GFS system.
->
>
+ Set up GFS on RHCS
-<
<
+h3 Problem!
->
>
+Now we have single mirrored disk presented to both Server 1 and Server 2 so we can create a cluster and GFS as above. However, one important difference is the presence of the TEQL devices. Use these IP addresses in the /etc/hosts file an d the cluster will not break if a single NIC goes down. Otherwise, use ricci and luci or cman, rgmanager and clvmd as before to set up a cluster with a GFS system.
->
>
+ Problem!
 It appears that a TEQL device does not work well with RHCS, causing a split-brain problem in a 2-node cluster. We will replace the TEQL devices with bonded NICs since speed is not an issue in the controller cluster.
-<
<
+First, replace /etc/sysconfig/network-scripts/ifcfg-eth0 and /etc/sysconfig/network-scripts/ifcfg-eth1 with the following file:
{code}
->
>
+First, replace /etc/sysconfig/network-scripts/ifcfg-eth0 and /etc/sysconfig/network-scripts/ifcfg-eth1 with the following file:
 DEVICE=ethX # <- use eth0 or eth 1 as appropriate
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
-<
<
+{code}
Then, create a new file /etc/sysconfig/network-scripts/ifcfg-bond0:
{code}
->
>
+Then, create a new file /etc/sysconfig/network-scripts/ifcfg-bond0:
 DEVICE=bond0
IPADDR=192.168.10.4
NETMASK=255.255.255.0
 ONBOOT=yes
BOOTPROTO=none
USERCTL=no
-<
<
+{code}
Next, add the following lines to /etc/modprobe.conf:
{code}
->
>
+Next, add the following lines to /etc/modprobe.conf:
 alias bond0 bonding
options bond0 mode=0 # <-mode=0 means balance-rr or balanced round robin routing
-<
<
+{code}
->
>
 Finally, restart the networking
-<
<
+{code}
->
>
 service network restart
-<
<
+{code}
->
>
 alternatively
-<
<
+{code}
->
>
 /etc/rc.d/init.d/network restart
-<
<
+{code}
->
>
 Now, there should be a bonded dual-NIC interface for running RHCS.
-<
<
+h2 More Architecture Problems

It appears that having the entire architecture on the same subnet means that only one switch can be "seen" at a time. Splitting it up using VLANs is not a solution as VLANs and TEQL devices don't seem to mix properly. By changing all eth0 to be on the 192.168.20.xx subnet and all eth1 to be on 192.168.30.xx subnet, we can see both switches (192.168.20.5 and 192.168.30.5 respectively). The TEQL devices can be 192.168.50.xx combining the 192.168.20.xx and 192.168.30.xx interfaces. The 192.168.50.xx subnet is then used for both iSCSI and DRBD. Finally, the bonding can happen on VLAN 4 (so add eth0.4 and eth1.4 to the two cluster servers) to provide redundancy for RHCS (see /etc/hosts).
->
>
+ More Architecture Problems
->
>
+It appears that having the entire architecture on the same subnet means that only one switch can be "seen" at a time. Splitting it up using VLANs is not a solution as VLANs and TEQL devices don't seem to mix properly. By changing all eth0 to be on the 192.168.20.xx subnet and all eth1 to be on 192.168.30.xx subnet, we can see both switches (192.168.20.5 and 192.168.30.5 respectively). The TEQL devices can be 192.168.50.xx combining the 192.168.20.xx and 192.168.30.xx interfaces. The 192.168.50.xx subnet is then used for both iSCSI and DRBD. Finally, the bonding can happen on VLAN 4 (so add eth0.4 and eth1.4 to the two cluster servers) to provide redundancy for RHCS (see /etc/hosts).
-<
<
+{image:worksite:/san_architecture_drbd.jpg|san_architecture_drbd.jpg}

Figure 2 Final architecture
->
>
+The final architecture is shown below:
 -- MichaelOSullivan - 15 Dec 2010
  META FILEATTACHMENT 
 attachment="san_configuration_small.jpg" attr="h" comment="" date="1292382586" name="san_configuration_small.jpg" path="san_configuration_small.jpg" size="187109" stream="san_configuration_small.jpg" tmpFilename="" user="MichaelOSullivan" version="1" 


 META FILEATTACHMENT 
 attachment="san_config_stripped_down.jpg" attr="h" comment="" date="1292382800" name="san_config_stripped_down.jpg" path="san_config_stripped_down.jpg" size="50555" stream="san_config_stripped_down.jpg" tmpFilename="" user="MichaelOSullivan" version="1" 


 META FILEATTACHMENT 
 attachment="ndsg_cluster.jpg" attr="h" comment="" date="1292383089" name="ndsg_cluster.jpg" path="ndsg_cluster.jpg" size="133243" stream="ndsg_cluster.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
- META FILEATTACHMENT
+ attachment="san_configuration_small.jpg" attr="h" comment="" date="1292382586" name="san_configuration_small.jpg" path="san_configuration_small.jpg" size="187109" stream="san_configuration_small.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
- META FILEATTACHMENT
+ attachment="san_config_stripped_down.jpg" attr="h" comment="" date="1292382800" name="san_config_stripped_down.jpg" path="san_config_stripped_down.jpg" size="50555" stream="san_config_stripped_down.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
- META FILEATTACHMENT
+ attachment="ndsg_cluster.jpg" attr="h" comment="" date="1292383089" name="ndsg_cluster.jpg" path="ndsg_cluster.jpg" size="133243" stream="ndsg_cluster.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
->
>
+ META FILEATTACHMENT 
 attachment="ndsg_cluster_drbd.jpg" attr="h" comment="" date="1292445932" name="ndsg_cluster_drbd.jpg" path="ndsg_cluster_drbd.jpg" size="134498" stream="ndsg_cluster_drbd.jpg" tmpFilename="" user="MichaelOSullivan" version="1" 


 META FILEATTACHMENT 
 attachment="san_architecture_drbd.jpg" attr="h" comment="" date="1292446437" name="san_architecture_drbd.jpg" path="san_architecture_drbd.jpg" size="68493" stream="san_architecture_drbd.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
- META FILEATTACHMENT
+ attachment="ndsg_cluster_drbd.jpg" attr="h" comment="" date="1292445932" name="ndsg_cluster_drbd.jpg" path="ndsg_cluster_drbd.jpg" size="134498" stream="ndsg_cluster_drbd.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
- META FILEATTACHMENT
+ attachment="san_architecture_drbd.jpg" attr="h" comment="" date="1292446437" name="san_architecture_drbd.jpg" path="san_architecture_drbd.jpg" size="68493" stream="san_architecture_drbd.jpg" tmpFilename="" user="MichaelOSullivan" version="1"

Revision 22010-12-15 - MichaelOSullivan

Line: 1 to 1

META TOPICPARENT	name="StorageImplementation"

Implementing a Storage Area Network Prototype

Added:

>
>

Implementing a Storage Area Network Prototype

Objective

Initial Core-Edge SAN Implementation

The network we set out to build was the result of design and simulation work on the DES storage system. The final design of the network is shown below:

(For a full size image see the design project.)

As a first step in implementing this design we are building an experimental network consisting only of the New Servers and the Disks with a smaller-core-edge network:

Purchasing the Network

We purchased the “kit” for building this network from PB Technologies. While they did not have the exact components of the original network, they supplied us with the equivalent network for less cost. Below is a summary of the New Servers, Disks and Link from PB Tech:

New Server 1	HP Proliant + NIC card
New Server 2	HP Proliant + NIC card
Disk 1	Tower, motherboard, processor, 2 x 1TB SATA-2 drives
Disk 2	Tower, motherboard, processor, 2 x 1TB SATA-2 drives
Links	Cat 6e cable
Switches	4 x 3Com OfficeConnect Gigabit Switch 8

Storage System Architecture

We wanted to use iSCSI to present the Disks to the New Servers, mdadm to multipath and RAID the storage and Red Hat Cluster Suite with GFS to allow access to the storage. A diagram of our proposed architecture is shown below:

Configuring the Components

The first task was to add the extra NIC to the HP Servers to provide 2 ports. This involved opening the case and pushing the card into one of the extra NIC slots.

The HP Proliant servers came with the ability to install Windows Server 2003. However, given that we wanted to cluster the servers and use iSCSI we opted to use CentOS 5 on these servers instead. Using 6 installation CDs we successfully installed CentOS on both servers (these servers did not have a DVD drive).

The next task was to build the storage boxes, we had to assemble these PCs by hand following step-by-step instructions posted on the internet. We were successful in building the boxes (they passed a check by our IT staff). We had to get some extra SATA cables for the disk drives as the DVD for these boxes also used SATA. Also, we did not get extra NIC cards for these boxes in our initial purchase, so we had to obtain and install these.

We also installed CentOS in these boxes, but encountered a problem with ACPI. This was overcome during installation by installing using

linux acpi=off

Initially we chose to install Virtualisation, Clustering and Storage Clustering (along with the Desktop and Server modules), but this caused a problem with the CentOS boot (xen):

irq #9 ata1: COMRESET (errno = -16)

Fortunately the base kernel was there and working, so I edited grub.conf to set this as the default. This could also have been solved by a complete reinstall.

We ran

yum update -y

on all the boxes to make sure they were up to date.

Building the Network

Building the network was simple (as it is a small configuration). We discovered we needed crossover cables for the inter-switch links (ISLs). We also discovered that one of the added NIC cards was not pushed all the way into its slot and fixed that. One of our IT staff set up the network and also allowed all the boxes to connect to the internet via Server 1. He used the correct iptables commands and put the right settings in the Network Administration dialog to set this up.

The map between the ports and the IP addresses is as follows:

Server 1 eth0 is connected to the internet with IP address xxx.xxx.xxx.155 and to the network with IP address 192.168.100.5. eth1 is assigned IP address 192.168.100.1.
Server 2 eth0 and eth1 are 192.168.100.2 and 192.168.100.3 respectively.
Disk 1 eth0 and eth1 are 192.168.100.6 and 192.168.100.7 respectively.
Disk 2 eth0 and eth1 are 192.168.100.11 and 192.168.100.13 respectively.

Configuring the Storage Devices

First, we need to install iSCSI Target on the storage boxes. We downloaded the latest version of iSCSI Target from SourceForge. We also needed gcc and open-ssl to install iSCSI Target. We installed the necessary packages as follows

yum install gcc
yum install openssl-devel
yum install kernel-devel

Then, in the directory where we saved the iSCSI Target tarball (iscsitarget-0.4.16.tar.gz in our case) we ran the following commands

make
make install

iSCSI Target was then installed!

We didn't need to edit initiators.allow as there were only our servers on the network. However, to keep you target secure you may want to edit this.

Next, we set up the targets by editing /etc/ietd.conf:

# Edited next line
Target iqn.2008-09.nz.net.ndsg.disk1:storage1
        # Users, who can access this target. The same rules as for discovery
        # users apply here.
        # Leave them alone if you don't want to use authentication.
        #IncomingUser joe secret
        #OutgoingUser jim 12charpasswd
        # Logical Unit definition
        # You must define one logical unit at least.
        # Block devices, regular files, LVM, and RAID can be offered
        # to the initiators as a block device.
        #Lun 0 Path=/dev/sdc,Type=fileio <- removed and added next line
        Lun 0 Path=/dev/sda2,Type=fileio
        # Alias name for this target
        # Alias Test <- removed and added next line
        Alias storage1
        # various iSCSI parameters
        #MaxConnections          1 <- added next line to allow for multipathing
        MaxConnections          2

Initially we tried to add two LUNs to the target (one for each disk drive in the storage box), but we realised this was not possible. In fact, we had to reinstall with a custom partition for the storage. On sda we had

100MB /boot
8GB /
4GB swap
the remaining space was left for the storage

We left sdb entirely for the storage.

We then created a striped logical volume (using Logical Volume Management) over the storage sections of sda and sdb. Since the size of storage use on sda and sdb needed to be the same for striped LV, there was ~9.9Gb of space left on sdb.

Then we could present this single LUN as an iSCSI target.

Alternatively, we could have created files to serve as the iSCSI targets

dd bs=1G count=2000 if=/dev/zero of=/iscsi/disk1

Next we started iscsi-target

service iscsi-target start

and make sure it persists by adding it to chkconfig

chkconfig --add iscsi-target
chkconfig --levels 2345 iscsi-target on

Configuring the Servers

First, we installed iSCSI on the New Servers (to act as storage controllers)

yum install iscsi-initiator-utils

Then, one would normally edit /etc/iscsi/iscsid.conf to set up usernames and passwords for iSCSI. Since this network is experimental we are not using usernames and passwords so we skipped this step.

Then we set the name of the initiator in /etc/iscsi/initiatorname.iscsi.

#InitiatorName=iqn.1994-05.com.redhat:2828cb48ea2 <- commented out here, added line below
InitiatorName=iqn.2008-09.edu.byu.et.265r.croi-01:controller1

Next, we started iSCSI

service iscsi start

and made sure it persisted by adding it to chkconfig

chkconfig --add iscsi

chkconfig --levels 2345 iscsi on

Now, we could discover the iSCSI disks.

iscsiadm -m discovery -t sendtargets -p 192.168.100.6
iscsiadm -m discovery -t sendtargets -p 192.168.100.7
iscsiadm -m discovery -t sendtargets -p 192.168.100.11
iscsiadm -m discovery -t sendtargets -p 192.168.100.13

Next, change the portals for Disk 2 to be part of portal group 2 (since they point to a different target).

iscsiadm -m node -o delete -T  Target iqn.2008-09.nz.net.ndsg.disk2:storage2
-p 192.168.100.11,1
iscsiadm -m node -o delete -T  Target iqn.2008-09.nz.net.ndsg.disk2:storage2
-p 192.168.100.13,1
iscsiadm -m node -o new -T  Target iqn.2008-09.nz.net.ndsg.disk2:storage2
-p 192.168.100.11,2
iscsiadm -m node -o new -T  Target iqn.2008-09.nz.net.ndsg.disk2:storage2
-p 192.168.100.13,2

Now, login to your iSCSI nodes

iscsiadm -m node --login

and check the disks are available (note there will be multiple copies of the same disk, e.g., sdc and sdd both point to Disk 1)

fdisk -l

Building the Storage Network

First, make a new loop device in /etc/udev/devices (to make sure it survives a reboot),

[root@server1 ~]# mknod /etc/udev/devices/iscsi2 b 9 102

symbolically link it to a "regular" device

[root@server1 ~]# ln -s /etc/udev/devices/iscsi2 /dev/iscsi2

and use mdadm turn it into a multipath device

[root@server1 ~]# mdadm -C /dev/iscsi2 --auto=yes --level=multipath --raid-devices=2 /dev/sde /dev/sdf
mdadm: /dev/sde appears to contain an ext2fs file system
    size=1928724480K  mtime=Mon Apr 21 10:39:50 2008
mdadm: /dev/sdf appears to contain an ext2fs file system
    size=1928724480K  mtime=Mon Apr 21 10:39:50 2008
Continue creating array? y
mdadm: array /dev/iscsi2 started.

Once all your disks have been created as multipath devices you can gather them together into a RAID-5 array also using mdadm

mknod /etc/udev/devices/iscsi_raid b 9 100
ln -s /etc/udev/devices/iscsi_raid /dev/iscsi_raid

mdadm -C -v /dev/iscsi_raid --auto=yes -l5 -n2 /dev/iscsi1 /dev/iscsi2
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 64K
mdadm: /dev/iscsi1 appears to contain an ext2fs file system
    size=1932787712K  mtime=Mon Apr 21 10:22:14 2008
mdadm: /dev/iscsi2 appears to contain an ext2fs file system
    size=1928724480K  mtime=Mon Apr 21 10:39:50 2008
mdadm: size set to 1928724352K
Continue creating array? y
mdadm: array /dev/iscsi_raid started.

Now your RAID-5 disk array is complete and you can use it like regular storage. However, we will wait until we have created the storage controller cluster before creating physical volumes, volume groups, etc.

Now create a configuration file for use on any other storage controllers. Copy the output of an mdadm scan to /etc/mdadm.conf

# mdadm --examine --scan
ARRAY /dev/md102 level=multipath num-devices=2 UUID=xxx:xxx:xxx:xxx
ARRAY /dev/md101 level=multipath num-devices=2 UUID=xxx:xxx:xxx:xxx
ARRAY /dev/md100 level=raid5 num-devices=2 UUID=xxx:xxx:xxx:xxx

rename the devices to the loop device created and add the iSCSI devices, e.g.,

DEV /dev/sd[bcde] /dev/iscsi[12]
ARRAY /dev/iscsi2 level=multipath num-devices=2 UUID=xxx:xxx:xxx:xxx
ARRAY /dev/iscsi1 level=multipath num-devices=2 UUID=xxx:xxx:xxx:xxx
ARRAY /dev/iscsi_raid level=raid5 num-devices=2 UUID=xxx:xxx:xxx:xxx

Next, add the appropriate devices to the ARRAY lines

DEV /dev/sd[bcde]
ARRAY /dev/iscsi2 level=multipath num-devices=2 UUID=xxx:xxx:xxx:xxx devices=/dev/sdd,/dev/sde
ARRAY /dev/iscsi1 level=multipath num-devices=2 UUID=xxx:xxx:xxx:xxx devices=/dev/sdb,/dev/sdc
ARRAY /dev/iscsi_raid level=raid5 num-devices=2 UUID=xxx:xxx:xxx:xxx devices=/dev/iscsi1,/dev/iscsi2

and your configuration file is complete.

On any extra storage controllers you need to perform all the iSCSI steps and create the loop devices, but you don't need to build the devices using mdadm. Simply copy the mdadm.conf file to the machine, edit the drive names (e.g., sdb may represent a different drive on the new machine) and use

mdadm --assemble /dev/iscsi2

etc

IMPORTANT You need to make sure that the iSCSI portals are loaded in the same order on all storage controllers so that /dev/sdb, /dev/sdc, etc are the same. We created a script to ensure this:

echo -n "Resetting iSCSI drives"
echo
NETS="100"
tpgt=1
NICS="6 7"
for nic in $NICS; do
  for net in $NETS; do
    iscsiadm -m node -T iqn.2008-09.edu.byu.et.265r.sophia$tpgt:storage$tpgt -p 192.168.$net.$nic:3260,$tpgt --logout
  done
done
tpgt=2
NICS="11 13"
for nic in $NICS; do
  for net in $NETS; do
    iscsiadm -m node -T iqn.2008-09.edu.byu.et.265r.sophia$tpgt:storage$tpgt -p 192.168.$net.$nic:3260,$tpgt --logout
  done
done

fdisk -l

# Creating iSCSI drives

echo -n "Creating iSCSI drives"
echo
NETS="100"
tpgt=1
NICS="6 7"
for nic in $NICS; do
  for net in $NETS; do
    iscsiadm -m node -T iqn.2008-09.edu.byu.et.265r.sophia$tpgt:storage$tpgt -p 192.168.$net.$nic:3260,$tpgt --login
    fdisk -l
  done
done
tpgt=2
NICS="11 13"
for nic in $NICS; do
  for net in $NETS; do
    iscsiadm -m node -T iqn.2008-09.edu.byu.et.265r.sophia$tpgt:storage$tpgt -p 192.168.$net.$nic:3260,$tpgt --login
    fdisk -l
  done
done

Building the Storage Controller Cluster

The final step is to create a storage controller cluster to create physical volumes, volume groups and a GFS on the iSCSI raid device.

First, make sure that you have the Red Hat Cluster Suite installed (i.e., cman, rgmanager, clvm). I have also installed luci and ricci so I can use the web-based management interface conga. Here is a good reference manual:

RHCS Administration

For example, you can install ricci and luci as the logical volume manager for clusters as follows:

yum install ricci
yum install luci
yum install lvm2-cluster

You must make sure the correct IP ports are enabled so that the cluster can communicate

Cluster Manager (cman)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -m state --state NEW -p udp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 5404,5405 -j ACCEPT

Cluster Node and luci Server (luci)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -p tcp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 8084 -j ACCEPT

Cluster Node and luci Server (ricci)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -m state --state NEW -p tcp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 11111 -j ACCEPT

Cluster Node and luci Server (modclusred)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -m state --state NEW -p tcp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 16851 -j ACCEPT

Cluster Node and luci Server (dlm)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -m state --state NEW -p tcp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 21064 -j ACCEPT

Cluster Node and luci Server (rgmanager)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -m state --state NEW -p tcp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 41966,41967,41968,41969 -j ACCEPT

Cluster Node and luci Server (ccsd, TCP)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -m state --state NEW -p tcp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 50006,50008,50009 -j ACCEPT

Cluster Node and luci Server (ccsd, UDP)

iptables -A INPUT -i ~~<ip_address>~~ -m multiport -m state --state NEW -p udp -s ~~<subnet_mask>~~ -d ~~<subnet_mask>~~ --dports 50007 -j ACCEPT

Note Subnet mask 255.255.255.0 is __/24 (thanks Percy).

However, in our case, I created a startup shell for the storage that included

IPTABLES=/sbin/iptables
CLUSTER_INTERFACES="eth0 eth1"
TCP_PORTS="41966 41967 41968 41969 50006 50008 50009 21064 16851 11111 8084 860 3260"
UDPPORTS="50007 5404 5405 6809"

echo -n "Applying iptables rules"
echo
for interface in $CLUSTER_INTERFACES; do
  for port in $TCP_PORTS; do
    $IPTABLES -I INPUT  -i $interface -p tcp -m tcp --sport $port -j ACCEPT
    $IPTABLES -I INPUT  -i $interface -p tcp -m tcp --dport $port -j ACCEPT
    $IPTABLES -I OUTPUT -o $interface -p tcp -m tcp --sport $port -j ACCEPT
    $IPTABLES -I OUTPUT -o $interface -p tcp -m tcp --dport $port -j ACCEPT
  done
  for port in $UDP_PORTS; do
    $IPTABLES -I INPUT  -i $interface -p udp -m udp --sport $port -j ACCEPT
    $IPTABLES -I INPUT  -i $interface -p udp -m udp --dport $port -j ACCEPT
    $IPTABLES -I OUTPUT -o $interface -p udp -m udp --sport $port -j ACCEPT
    $IPTABLES -I OUTPUT -o $interface -p udp -m udp --dport $port -j ACCEPT
  done
done
echo "[OK]"
echo
echo -n "Saving new rules"
echo
(/etc/init.d/iptables save && \
 echo "[OK]" || echo "[FAILED]")

You may need to add the storage controller names to /etc/hosts if you are not connected to a DNS server, e.g.,

192.168.100.1     server1.ndsg.net.nz
192.168.100.5     server1.ndsg.net.nz
192.168.100.2     server2.ndsg.net.nz
192.168.100.3     server2.ndsg.net.nz

Now create the cluster configuration file using ccs_tool as follows:

ccs_tool create -2 ndsg_cluster

Note the -2 option is for a 2-node cluster only. Add manual fencing:

ccs_tool addfence Human fence_manual

Add nodes:

ccs_tool addnode -n 1 server1.ndsg.net.nz -f Human
ccs_tool addnode -n 2 server2.ndsg.net.nz -f Human

Now start the cluster

service cman start
service rgmanager start
service clvmd start

Next, install the GFS2 service

yum install gfs2-utils kmod-gfs
service gfs2 start
chkconfig --add gfs2
chkconfig --levels 2345 gfs2 on

Change the locking_type to 3 in /etc/lvm/lvm.conf on all cluster nodes.

Now, on one of the cluster nodes, prepare a physical volume on /dev/iscsi_raid

pvcreate /dev/iscsi_raid

Create a volume group for the GFS2 file system

vgcreate -cy iscsi_raid_vg /dev/iscsi_raid

and create a logical volume in this group

lvcreate -L 2T iscsi_raid_vg

Create a GFS2 system on the logical volume

mkfs.gfs2 -t ndsg_cluster:iscsi_raid_gfs -p lock_dlm -j 5 /dev/iscsi_raid_vg/lvol0

Now mount the GFS2 file system on this node and all the other cluster nodes

mount -t gfs2 /dev/iscsi_raid_vg/lvol0 /mnt/iscsi_raid

Mdadm Problems

Unfortunately, once this configuration was complete we became aware that mdadm is NOT cluster aware! We decided to change the architecture to use DRBD and multipath.

Reconfiguring with DRBD and Multipath

Both Distributed Replicated Block Device (DRBD) and Red Hat Cluster Suite (RHCS) use a single IP address to identify systems that are part of the distributed replicated block device (DRBD) or cluster (RHCS). In order to make the mirrored block device and the cluster for core-edge storage system reliable, we want to make sure they do not "break" if the link (or NIC) that DRBD or RHCS use fail. To do this we create a new load-shared "link" with a new IP address using Linux Advanced Routing and Traffic Control (LARTC).

h2 Using LARTC to create a load-shared IP address

On all the machines we want to create this single (load shared) IP address. To do this: {code} modprobe sch_teql tc qdisc add dev eth0 root teql0 tc qdisc add dev eth1 root teql0 ip link set dev teql0 {code}

In our network we have already set the IP addresses as shown in Figure 1, so only the new IP address needs to be set. On Storage 1 we do this: {code} ip addr add dev teql0 192.168.10.1/24 {code}

{image:worksite:/ndsg_cluster_drbd.jpg|ndsg_cluster_drbd.jpg}

Figure 1 Architecture for DRBD/Multipath Storage Network

h2 Using DRBD to mirror the disk

First, ensure that DRBD is installed on both storage servers (I used yum in CentOS). Then create /etc/drbd.conf: {code} common { protocol C; } resource r0 { device /dev/drbd0; meta-disk internal; on storage1.ndsg.net.nz { disk /dev/VolGroup00/iscsidisk1; address 192.168.10.1:7799; } on storage2.ndsg.net.nz { disk /dev/VolGroup00/iscsidisk2; address 192.168.10.2:7799; } } {code}

You may have to "zero" enough space on your device: {code} dd if=/dev/zero bs=1M count=128 of=/dev/VolGroup00/iscsidisk1; sync {code} This creates space for the 128MB meta data disk.

Now create the drbd device on the server. {code} drbdadm create-md ~~resource~~ drbdadm attach ~~resource~~ drbdadm connect ~~resource~~ cat /proc/drbd {code}

See https://www.drbd.org/docs/install/ for more details.

Repeat this process on the other server.

On ~~one~~ of the storage servers synchronize the data (I am assuming both servers are currently empty): {code} drbdadm -- --overwrite-data-of-peer primary ~~resource~~ {code}

Now enable a Primary/Primary configuration by adding: {code} resource ~~resource~~ net { allow-two-primaries; } startup { become-primary-on both; } ... } {code} to /etc/drbd.conf and using the following commands: {code} drbdadm adjust ~~resource~~ drbdadm primary ~~resource~~ {code}

For any problems please consult https://www.drbd.org/.

h2 Use iSCSI to present the DRBD device

Now use iSCSI as specified in [Building the Alternative (Core-Edge) SAN | Core-Edge SAN Implementation] with two VERY IMPORTANT differences. This time you will be pointing to a single iSCSI target using two different IP addresses. We will also use a serial number to emulate drbd0 being the same disk. Create a single ietd.conf file as before, add ~~ScsiId=DRBD0~~ to the end of the Lun line in ietd.conf: {code} Lun 0 Path=/dev/drbd0,Type=fileio,ScsiId=DRBD0 {code} and finally copy this file to the other storage server. This should "fool" the iSCSI initiators into thinking /dev/drbd0 is a single device (which for all intents and purposes it is!).

h2 Use Multipath to specify a single device

Use {code} scsi_id -g -u -s /block/sdc scsi_id -g -u -s /block/sdd {code} to make sure your iSCSI target is showing up as the same device on each connection (via the two TEQL devices, i.e., 4 NICs).

Now edit multipath.conf so that these scsi device (sdc and sdd) are seen as the same device with multiple paths: {code} # This is a basic configuration file with some examples, for device mapper # multipath. # For a complete list of the default configuration values, see # /usr/share/doc/device-mapper-multipath-0.4.7/multipath.conf.defaults # For a list of configuration options with descriptions, see # /usr/share/doc/device-mapper-multipath-0.4.7/multipath.conf.annotated

# Blacklist all devices by default. Remove this to enable multipathing # on the default devices. #blacklist { # devnode "*" #} blacklist { devnode "^sda$" devnode "^sda[0-9]$" devnode "^sd[f-z]*$" }

## By default, devices with vendor = "IBM" and product = "S/390.*" are ## blacklisted. To enable mulitpathing on these devies, uncomment the ## following lines. #blacklist_exceptions { # device { # vendor "IBM" # product "S/390.*" # } #}

## Use user friendly names, instead of using WWIDs as names. defaults { user_friendly_names yes # Added from here multipath_tool "/sbin/multipath -v0" udev_dir /dev polling_interval 10 selector "round-robin 0" path_grouping_policy multibus getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout /bin/true path_checker readsector0 rr_min_io 100 max_fds 8192 rr_weight priorities failback immediate } ## ## Here is an example of how to configure some standard options. ## # #defaults { # udev_dir /dev # polling_interval 10 # selector "round-robin 0" # path_grouping_policy multibus # getuid_callout "/sbin/scsi_id -g -u -s /block/%n" # prio_callout /bin/true # path_checker readsector0 # rr_min_io 100 # max_fds 8192 # rr_weight priorities # failback immediate # no_path_retry fail # user_friendly_names yes #} ## ## The wwid line in the following blacklist section is shown as an example ## of how to blacklist devices by wwid. The 2 devnode lines are the ## compiled in default blacklist. If you want to blacklist entire types ## of devices, such as all scsi devices, you should use a devnode line. ## However, if you want to blacklist specific devices, you should use ## a wwid line. Since there is no guarantee that a specific device will ## not change names on reboot (from /dev/sda to /dev/sdb for example) ## devnode lines are not recommended for blacklisting specific devices. ## #blacklist { # wwid 26353900f02796769 # devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" # devnode "^hd[a-z]" #} multipaths { multipath { wwid ~~your scsi id goes here~~ alias iscsi path_grouping_policy multibus path_checker readsector0 path_selector "round-robin 0" failback immediate } } #multipaths { # multipath { # wwid 3600508b4000156d700012000000b0000 # alias yellow # path_grouping_policy multibus # path_checker readsector0 # path_selector "round-robin 0" # failback manual # rr_weight priorities # no_path_retry 5 # } # multipath { # wwid 1DEC_____321816758474 # alias red # } #} #devices { # device { # vendor "COMPAQ " # product "HSV110 (C)COMPAQ" # path_grouping_policy multibus # getuid_callout "/sbin/scsi_id -g -u -s /block/%n" # path_checker readsector0 # path_selector "round-robin 0" # hardware_handler "0" # failback 15 # rr_weight priorities # no_path_retry queue # } # device { # vendor "COMPAQ " # product "MSA1000 " # path_grouping_policy multibus # } #} {code}

h2 Set up GFS on RHCS

Now we have single mirrored disk presented to both Server 1 and Server 2 so we can create a cluster and GFS as in [Building the Alternative (Core-Edge) SAN | Core-Edge SAN Implementation]. However, one important difference is the presence of the TEQL devices. Use these IP addresses in the /etc/hosts file an d the cluster will not break if a single NIC goes down. Otherwise, use ricci and luci or cman, rgmanager and clvmd as before to set up a cluster with a GFS system.

h3 Problem!

It appears that a TEQL device does not work well with RHCS, causing a split-brain problem in a 2-node cluster. We will replace the TEQL devices with bonded NICs since speed is not an issue in the controller cluster.

First, replace /etc/sysconfig/network-scripts/ifcfg-eth0 and /etc/sysconfig/network-scripts/ifcfg-eth1 with the following file: {code} DEVICE=ethX # <- use eth0 or eth 1 as appropriate USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none {code} Then, create a new file /etc/sysconfig/network-scripts/ifcfg-bond0: {code} DEVICE=bond0 IPADDR=192.168.10.4 NETMASK=255.255.255.0 NETWORK=192.168.10.0 BROADCAST=192.168.10.255 ONBOOT=yes BOOTPROTO=none USERCTL=no {code} Next, add the following lines to /etc/modprobe.conf: {code} alias bond0 bonding options bond0 mode=0 # <-mode=0 means balance-rr or balanced round robin routing {code} Finally, restart the networking {code} service network restart {code} alternatively {code} /etc/rc.d/init.d/network restart {code}

Now, there should be a bonded dual-NIC interface for running RHCS.

h2 More Architecture Problems

It appears that having the entire architecture on the same subnet means that only one switch can be "seen" at a time. Splitting it up using VLANs is not a solution as VLANs and TEQL devices don't seem to mix properly. By changing all eth0 to be on the 192.168.20.xx subnet and all eth1 to be on 192.168.30.xx subnet, we can see both switches (192.168.20.5 and 192.168.30.5 respectively). The TEQL devices can be 192.168.50.xx combining the 192.168.20.xx and 192.168.30.xx interfaces. The 192.168.50.xx subnet is then used for both iSCSI and DRBD. Finally, the bonding can happen on VLAN 4 (so add eth0.4 and eth1.4 to the two cluster servers) to provide redundancy for RHCS (see /etc/hosts).

{image:worksite:/san_architecture_drbd.jpg|san_architecture_drbd.jpg}

Figure 2 Final architecture

-- MichaelOSullivan - 15 Dec 2010

Added:

>
>

META FILEATTACHMENT	attachment="san_configuration_small.jpg" attr="h" comment="" date="1292382586" name="san_configuration_small.jpg" path="san_configuration_small.jpg" size="187109" stream="san_configuration_small.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
META FILEATTACHMENT	attachment="san_config_stripped_down.jpg" attr="h" comment="" date="1292382800" name="san_config_stripped_down.jpg" path="san_config_stripped_down.jpg" size="50555" stream="san_config_stripped_down.jpg" tmpFilename="" user="MichaelOSullivan" version="1"
META FILEATTACHMENT	attachment="ndsg_cluster.jpg" attr="h" comment="" date="1292383089" name="ndsg_cluster.jpg" path="ndsg_cluster.jpg" size="133243" stream="ndsg_cluster.jpg" tmpFilename="" user="MichaelOSullivan" version="1"

Revision 12010-12-15 - MichaelOSullivan

Line: 1 to 1

Added:

>
>

META TOPICPARENT	name="StorageImplementation"

Implementing a Storage Area Network Prototype

-- MichaelOSullivan - 15 Dec 2010

View topic | History: r7 < r6 < r5 < r4 | More topic actions...