Home > Sysadminery > Fighting with SAN

Fighting with SAN

We’ve been a big NAS shop for a number of years, actually well before I come on board.  We are starting to use SAN more and more nowadays.  We have a much more stable SAN fabric (the network side of fiber channel storage for those of you keeping score at home).  So I spend several days before the break fighting with various SAN issues.  Most of them were my lack of particular experience with our SAN implementation as well as host level tools.  The pain of SAN comes largely from the host end.  Your SAN device (even in our case with NetAPP) is probably pretty good at doing it’s end and is well documented.  But on the linux side SAN is very vendor specific, which always leads to problems.  For example if you are using an EMC you have to get supported HBAs then in some cases run a custom kernel to support that HBA and then you probably end up needed vendor specific tools for handling things.  In my setup I don’t need a custom kernel, but we do have to support a small vendor package of tools.  NetApp is actually pretty good when it comes linux supoprt, they package RPMs in most cases and stay current with versions as far as support.

Several of the things I played with were adding a lun to a machine and getting it to show up without rebooting.  Translating docs gleaned from the web to my configuration was a bit tough at the beginning because we have a highly redundant fabric.  That meaning we have 2 HBAs in each host each with 2 fiber paths.  What this means is that when I get luns to show up I see them 4 times for each lun.  Apparently most people that write about their SAN experiences do it with a single path to their storage device through the fabric.   I also went through the rigamarole of removing a lun from a host (again without rebooting).  All in all it was pretty clean, a series of echo’s to the /sys subsystem, not nearly as ugly as adding and removing actual scsi devices.  It was also completely non-disruptive to other luns and overall performance.

There’s been quite a bit of debate among the other SAs at work about how we should handle luns at the host level.  Originally the thought was to add LVM on top of the LUN (which with multipath is kinda a bear) then create the filesystem on top of LVM.  The thought was originally that this would enable us to grow and shrink as needed and give us a similiar flexibility to the NFS volumes we are so used to dealing with.  Turns out shrinking is still iffy.  I’ve tried it twice now and had catestrophic failures both times.  The filer seems to handle it fine, but the host just flat out fails to see it as a valid filesystem once the lun gets smaller.

With multipath configured correctly what we see with an fdisk -l is 5 new ‘disks’.  sdX – sd{X+4} and dm-X.  So depending on how many existing scsi devices (including other luns) we have sde,f,g,h and dm-0 (assuing a, b, c already existing and no other san luns).  What’s a little confusing is that each of these devices is the same disk, you don’t want to use the sdX devices for anything (unless it’s a onetime operation) in case you lose a path.  So you do everything to the dm-X device created by multipath.  The other confusing thing is that while these are ‘disks’ they also are not.  You can create partitions on them but you don’t really need to, so it kinda confuses your brain in what you are used to dealing with.

So the original plan with LVM was to create a partition consisting of the entire disk, add it as a PV, create a volume group, then a LV on the volume group of the whole size.  It struck several of us that this really was overkill.  Where LVM shines is when you have lots of descrete storage objects and you want to group them all together.  Logically this ‘thing’ is a single lun where all the physical abstraction is already done (with about 4 levels of abstraction in the case of NetApp).  The other alternative, which I ended up doing for this particular implementation, was to just create a filesystem right on dm-0.  I didn’t create a partition, didn’t do LVM, just mkfs.ext3 /dev/dm-0.  Worked like a charm, no wasted space, very simple.

There is a gotcha though.  Multipath has the annoying habit of renaming the multi-disk device (dm-X) when the host reboots and it encounters additional luns.  So if you add a lun to a machine that already has one then reboot it’s possible, nigh on likely, that they will swap dm-1 and dm-0 to the opposite of what you expect.  This is pretty annoying from a mounting standpoint.  This is one potential winning point for LVM, since the LVM data is written to the disk itself you can have a consistent name which to use in fstab etc.  But all that overhead just for a consistent name?  Am I really getting anything else out of LVM in this scenerio?

Enter ext2/3 labels.  Most SAs I know don’t like labels because if you do things like label a disk ‘/’ and try to put it in another machine for recover purposes, you probably won’t get the disk you expect (it’ll depend on bus order).  However labels give us a way to consistently name a dm device regardless of what multipath wants to call it.  This also lets me give meaningful symbolic names to SAN disks that may move hosts (oracle volumes is their current use, so there are 2, 1 for primary and 1 for standby).  So I use e2label /dev/dm-0 FOO to label my san disk.  Then in fstab I use LABEL=FOO. An interesting side effect is that df output shows the uuid of the disk rather than it’s multipath name, but other than that it seems to work.

Next I need so spend a bunch of time with a non-critical volume and figure out all the ins and outs of growing and (maybe) shrinking the fileystem.  All of the above work was done on a RHEL5 system (64bit), my feeling is that all bets are off when it comes to RHEL4 and LVM might be a very real hard requirement.  I also wonder if multipath is the right way to go.  Would it be possible to use LVM to create a fault tolerant storage device?

Advertisements
Categories: Sysadminery Tags: , , , , ,
  1. December 23, 2008 at 2:22 PM

    From my prior e-mail – (Thanks for enabling comments!)

    1) Qlogic HBAs make it easy to see one path (if you do not desire to do load balancing and you aren’t using EMC powerpath). The ql2xfailover=1 kernel flag given to the qla2xxx module in modprobe.conf provides failover functionality and allows you to see a single device instead of seeing multiples for each LUN. For Linux boxes, I always like Qlogic HBAs regardless of whether I’m using EMC powerpath or not. If you’re using PowerPath, PowerPath will takes care of both making device names that do not change (/dev/emcpower[abc][123]) and takes care of load balancing and failover. EMC recommends against using ql2xfailover=1 for this reason. Powerpath is annoying, however, particularly if you want to use an /dev/emcpowerX1 device as a boot device.

    2) I’ve had similar issues to yours with shrinking LUNs. Better to start small. It’s simple to grow a filesystem but shrinking is nasty.

    3) Labeling your partitions is a great way to find them again and mount them in the right place, however, if you get a wild hair, just tweak udev to create symlinks for the devices you want to use so that they are always there and always in the right order. A good article about that:

    http://www.cyberciti.biz/tips/linux-assign-static-names-to-scsi-devices.html

    This was a good post. Keep them coming! Sysadminery lives at miscellaneous.net!

    Your friend,
    -John

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: