vSphere Metro Storage Cluster and GAD : Rise of HDS Virtual Storage Machine

Sometime around August HDS will launch what is called Global Active Device, also known as GAD. This is part of the new Virtual Storage Platform G1000.

What is a G1000?

The VSP G1000 is the latest variant of HDS high-end Tier-1 array that scales to thousands of drives and millions of IOPS (if you’re into just counting millions of IOPS like some vendors).

And when we talk about an increase in performance, we’re not talking about going from 50,000 IOPS to 100,000. It’s in the millions. (My colleague) Hu Yoshida has written about a benchmark on a G1000 that achieved 3.9 million (nasty random read) IOPs already. A Random Read workload is what breaks many storage arrays as it often results in cache misses which must fetch the blocks from disk.

http://blogs.hds.com/hu/2014/05/where-are-the-performance-specs-for-vsp-g1000.html 

It’s my opinion that it’s kind of ironic that scale-out storage fulfills different purposes, and one of those is to be able to scale IOPS along with array-side compute resources together. Without doing that, single or dual chassis controller configurations have serious trouble once the IOPS go through the roof. The G1000 can do that in the same chassis but more on that in another post.

The benchmark was using Hitachi Accelerated Flash. I’m gonna write a post about that soon but suffice it to say that is also a very elegantly engineered solution, with CPU embedded in each flash drive to scale performance WITH capacity.:

What’s the fuss about GAD?

I believe this will be game-changing technology for many customers that need active-active compute and storage configurations. These customers are subject to the strictest availability requirements for their IT environments where failure is not an option.

I must admit before I joined HDS I felt HDS created some great engineering (hardware and software) but didn’t always match it up with zingy titles and acronyms. GAD is a case in point. it’s not the most glamorous title (sorry product management).

It may not sound glamorous but the words tell you what it does… Global, Active, Device…….. It’s awesome technology and something I’ve kind of written about wanting to see before, here:

Metro Storage Cluster Part 1

and here:

Metro Storage Cluster Part 2

GAD will allow customers to move to an active-active fully fault tolerant storage configuration without any external hardware (excluding quorum, which I’ll explain later). With GAD and G1000 that is here today.

This is something NONE of our competitors can achieve.

Zero RPO/RTO for Storage

The sweetspot for this is zero RPO and zero RTO against a site failure. I’m talking dead datacenter taken out by an airplane. It will be the ability of the application to continue without disruption that will dictate whether VMs or other tenants are impacted. Some applications like Oracle RAC will continue to run.

If  VMware eventually increase the number of vCPUs supported in Fault Tolerance you could see this being used much more frequently with pairs of virtual machines with DRS virtual-machine anti-affinity rules running across the datacenters. Then you might experience no impact on your VMs.

Obviously if there are “regular” virtual machines in the affected datacenter they will be restarted by vSphere HA on the alternate site, but the storage part will be seamless. And if you were using something like application-level clustering across distance you will have a much improved solution.

Now don’t forget that vMSC use case is load balancing across datacenters. VMware do not pitch it as a pure BC/DR solution.

The VSP already has redundancy built-in to offer zero RPO/RTO against multiple simultaneous failures to the storage frame on it’s own. It didn’t need GAD for that.

The current schedule (which could be subject to change) is that GAD will launch and GA with support for vSphere Metro Storage Cluster in August. GAD will be relevant for physical servers as well as virtual datacenters. For the sake of this I want to focus on vMSC and I think this is where GAD will cause a lot of disruption.

vMSC Configuration

In terms of architecture the magic is the creation of what’s called a Virtual Storage Machine or VSM. So what appears to our ESXi cluster seems to be a single array that stretches between the sites. The pre-requisite is that the Ethernet and Storage fabrics must be stretched.

G1000 GAD Blog document

The magic happens with the virtual storage machine using the serial number of the “primary” array as well as the LUN (LDEV) ID of the  primary LUN. This means that the application – in this case ESXi – thinks that it’s got two paths to the same LUN (which is derived from WWN and LDEV ID).

Reads and Writes are accepted simultaneously down both paths and both LDEVs are always active.

This is not HDS High Availability Manager which was an active-passive architecture. In this model the concept of active-passive goes away.

No external hardware required !!!!

Did I mention that to use this you don’t require ANY external expensive complicated hardware which needs to sit between your application and the storage. Isn’t that just a bit special ?.

That’s handled by the G1000, in software 😉

What about data integrity ?

There is still a concept of a primary in the sense that in-order write integrity must be assured. So if you write to the LUN on site B, behind the scenes, an acknowledgement is only returned to the application when it has been committed to Site A followed by Site B. But you will not need to worry about that.

In terms of data locality, if you’re using Hitachi MPP software (HDLM) it will understand a lot more and leverage read and write locality based on IO latency. HDLM may be required initially for Metro distances but more info will be available on that soon. Make sure to check the VMware HCL as always when this goes live.

Failure Scenarios

The quorum does not have to be a VSP but must be a HDS array at the moment. I can cover failure scenarios in the future but assuming the quorum device is outside the primary and secondary site the following failure scenarios do not result in any loss of connectivity to a datastore from any surviving cluster:

  • Primary array failure
  • Secondary array failure
  • Quorum array failure
  • Primary Site failure
  • Quorum Failure
  • Secondary Site failure

There are lots of other scenarios and you could have the primary and the quorum on the same site but what would be the point ?. You CANNOT have the quorum on the secondary site as that just wouldn’t make sense.

Hope that provides an introduction. There’s much more information coming on vMSC and GAD.

Disclaimer: I work for HDS but this is not an officially sanctioned blog and is my personal opinion always. Make sure to consult with your local account team for official queries on anything I’ve mentioned here.

 

12 thoughts on “vSphere Metro Storage Cluster and GAD : Rise of HDS Virtual Storage Machine

    1. Hey Dave, thanks for your comment. I’m taking about customers who are looking at storage for the highest levels of availability and performance and I can’t agree with that use case being best served by Lefthand storage for most customers. Lefthand is fine in my view for low to mid range but not enterprise class.

  1. Hello,

    Thank you for your article
    Very nice technology.

    I am thinking about the third site with quorum disk.
    What about remote latency requirement for this lun ?
    On a PCA case we have the two site with the lowest latency but the third was for PRA purpose because of high latency (30-40ms in my case)

    Is that ok to pu the quorum disk on it?

    Thank you

    stéphane

  2. Hi Paul,
    Do you know where I could get more information about the other failure scenarios and testcases for VSP G1000? My company is going to make some GAD functionality, reliability and portability tests next month..
    Thank you very much for your help!

    1. Hi Jenny,

      There are a number of resources you can use. Firstly I would suggest looking at this document written by my colleagues which should prove very helpful. There are a number of test cases and scenarios covered in this document:
      http://www.hds.com/assets/pdf/deploy-vmware-vsphere-metro-storage-cluster-on-hitachi-vsp-g1000-using-global-active-device.pdf

      I also suggest check out this link on the HDS community. This blog is written by Paul Morrissey who is our Global Product Manager for VMware integration. You can ask a question here and I am sure Paul will answer it – if you don’t hear back feel free to get in touch. If you’re on Twitter just follow me and I’ll follow you and I can DM you my email address.
      The community site also gives access to other people within the company who go on there and answer questions about our products. You can also search this site for further information.

      Finally you can use the launchpad of the HDS.com website for other links and info.
      http://www.hds.com/products/storage-software/global-active-device.html

      For any other questions let me know. I am always a little reluctant to undermine the local PS teams as there is very deep knowledge of GAD in all countries so HDS should be able to assist you. Where are you based ?

      All the best,
      Paul

      1. Hi Paul,
        Thank you very much for your fast answer! Your links and websiters are very helpful for me. My company is already in contact with HDS. 🙂
        Kind regards,
        Jenny

        1. Ok Jenny. Let me know if I can help in the future. GAD is simply awesome technology so I think its gonna be a great experience. Dont forget to use another methodology to provide recovery points. All the best.

          Paul

Leave a Reply

Your email address will not be published.