31 Jan 2010 by rayheffer
It’s Monday morning and you arrive late at the office thanks to the trains being delayed yet again. At that particular moment in time as you grab your morning coffee, several hundred users have already logged in and started launching their email client, web applications, and a myriad of documents and spreadsheets. So far this sounds like any other morning, but what I didn’t mention was the fact that just 30 minutes before you arrived at the office a water from a pipe in the ceiling started to leak into the rack containing your SAN’s disk array. What a nightmare. Not only has the water managed to get into both SAN controllers, but it has caused the trip switch for that rack to shut off. But wait… not a single user has called to say they can’t access their applications or data. Thanks to storage mirroring between two SAN arrays in separate racks, the business has continued to operate and all of the servers are now communicating with your secondary array. Seamless.
Walking back to your desk with your morning coffee, your phone receives the first SMS message. Here it is, an alert from your monitoring system to say that the primary storage array is offline.
09:24 PRI-SAN01 offline, critical.
At this point, it would certainly be pertinent to discuss best practice for data centre design, environment monitoring, and DR procedures. To achieve a solid DR solution for your infrastructure you must have the basics in place before anything else. This means your DR strategy has got to be reviewed on a regular basis, and business continuity planning must be in motion with all areas of the business. Without a solid continuity plan, your DR might not serve the actual needs of the business. The focus on this article is disaster recovery for your SAN rather than business continuity, but BCP must never be ignored. Lets rewind back to the implementation of a highly available SAN architecture, it’s far more interesting!
In 2005 I started to look at how storage mirroring can protect your data in this type of situation, and also provide you with ‘zero downtime’ maintenance windows for your SAN. Over the past few years, storage vendors have been implementing mirroring, thin provisioning, snapshots, and asynchronous replication for remote sites in entry level SAN solutions, not just the large enterprise offerings. Don’t be fooled into thinking that designing a highly available SAN architecture is limited to those with massive budgets. There are other solutions, such as SANmelody or SANsymphony by Datacore, that allow you to present your existing disk arrays or SAN to storage servers. It is far more cost effective than upgrading your entire SAN hardware, and you can even increase performance by using the storage servers RAM for your write cache.
Datacore SAN software is what I have been working with, in conjuction with EMC and HP SAN storage over the past few years. The main reason being that we can present storage from different SAN vendors, and create pooled storage that can then be partitioned up into virtual volumes (or LUN’s) for your application servers. On top of that we gain mirroring, thin provisioning, snapshots, and other features that our HP and EMC didn’t have without an expensive upgrade. Datacore are releasing SANSymphony-V in 2010, which I’ve had the pleasure of using in a technology preview recently. Datacore were talking about storage virtualisation back in 1999, so I’d certainly recommend you speak to them about what they can offer.
Lets familiarise ourselves with some key storage technologies:
Synchronous storage mirroring - When data is written to the primary array it is also written to the secondary array. Will require a high speed link between both arrays, such as fibre channel or iSCSI. This provides high availability for your SAN, but can double up on the storage cost in some situations.
Asynchronous mirroring - SAN replication to a DR site or remote office. Will replicate data in the background, using queuing, buffering and scheduling to the remote site. Typically used over WAN connections.
Snapshots - The ability to take a ‘point-in-time’ snapshot of your data. Very useful in a DR scenario, and for testing.
To set the scene I’ll use a typical IT infrastructure that you would find in most SME organisations. They have already implemented virtualisation for at least 50% of the server infrastructure, and have a midrange SAN from a well known vendor using fibre channel. SAN capacity is up to 8TB which contains a mix of virtual machine, database, and file store LUN’s. The majority of servers are running Microsoft Windows Server 2008, and some Linux servers for key network services.
Using this example you’ll see that virtualisation is already in place having implemented VMware with High Availability, and additional high availability has been implemented with a Microsoft SQL database cluster. There is enough capacity to support a single host failure using VMware high availability, but there are still some physical application servers that are yet to be virtualised. Given this is a typical SME infrastructure, lets also imagine that the SAN has dual controllers, and it’s connected to a fibre fabric consisting of two core fibre channel switches (A and B). This is a very good situation to be in as we have most of the servers virtualised, SQL databases are stored on the SAN, in addition to file server storage for our shared drives.
Implementing on-site HA (High Availability) using synchronous mirroring, even to another building with a fibre link between the two, gives this environment an excellent level of resilience. However synchronous mirrors do have some pitfalls, mainly due to the cost as you need twice the amount of storage as the solution is split into two. One rack will contain a SAN with an 8TB array, and the other rack will contain another SAN with an 8TB array with mirroring between the two. You will then need to decide on the level of disk redundancy within each array as you could use a basic RAID0 stripe, given the fact you have mirroring between separate arrays. I personally prefer to stick with RAID5 arrays, even though they are mirrored between two arrays.
An asynchronous mirror is where true disaster recovery comes into play. By selecting key SAN Lun’s (or data volumes) to be replicated to a remote site you can specify which databases, virtual machines or file stores are part of the replication. This does introduce an extra layer of complexity though, which you don’t get with synchronous mirrors. First of all you need to have a suitable location / site for the destination SAN, unless you consider using a co-location service with an ISP. Depending on how much the replicated data changes, the link between these sites could be very busy so bandwidth is a consideration. That being said, a 20Mb private circuit between two sites around 40 miles away should be in a fairly realistic price bracket. If you are using a co-location provider, they should be able to provide this for you. As a rough estimate, I would say £10,000 to £20,000 per annum for a 20Mb link in the UK.
Adding further complexity to the asynchronous mirroring solution is what to do with the destination data in the event of a disaster (or DR test). When a SAN Lun is first presented to an application server, whether that is a VMware host, Windows or Linux host, it will need to write a disk signature to the disk (LUN). When using asynchronous mirrors, the destination LUN (at the DR site) will have exactly the same signature. In this case you must make sure the disk isn’t re-signatured by the application servers at the remote site. VMware servers (ESX and vSphere) have an advanced option to disable re-signaturing, whereas Windows servers shouldn’t cause an issue unless they are part of a cluster.
When testing your DR site with the replicated data, it is recommended that snapshots are used to take a ‘point in time’ snapshot of the destination volume. The snapshot volume is then presented to the application servers at the DR site, leaving the replication of live data to continue. Using asynchronous mirroring and snapshots provide the ability to carry our DR tests without impacting the live environment, so can be done during normal business hours in most cases.
Storage replication and snapshot technology certainly provide the key ingredients to form part of your DR solution, but there are still important factors to consider. Do you want high availability, replication to another site, or both? Does your existing SAN support these technologies, or should you consider an upgrade? Obviously your budget is going to be a major factor, and I’m not here to lecture you on ‘what would it cost if you actually had a disaster’, you can make that decision!
If you decide to adopt mirroring and snapshot technologies as part of your DR solution and you are already running a virtual infrastructure, then you are on your way to an excellent DR solution. There are some technical complexities you need to be aware of, but if you have a good knowledge in these areas they are only minor factors.
Comments are closed for this post.