Sometime around August HDS will launch what is called Global Active Device, also known as GAD. This is part of the new Virtual Storage Platform G1000.
What is a G1000?
The VSP G1000 is the latest variant of HDS high-end Tier-1 array that scales to thousands of drives and millions of IOPS (if you’re into just counting millions of IOPS like some vendors).
And when we talk about an increase in performance, we’re not talking about going from 50,000 IOPS to 100,000. It’s in the millions. (My colleague) Hu Yoshida has written about a benchmark on a G1000 that achieved 3.9 million (nasty random read) IOPs already. A Random Read workload is what breaks many storage arrays as it often results in cache misses which must fetch the blocks from disk.
It’s my opinion that it’s kind of ironic that scale-out storage fulfills different purposes, and one of those is to be able to scale IOPS along with array-side compute resources together. Without doing that, single or dual chassis controller configurations have serious trouble once the IOPS go through the roof. The G1000 can do that in the same chassis but more on that in another post.
The benchmark was using Hitachi Accelerated Flash. I’m gonna write a post about that soon but suffice it to say that is also a very elegantly engineered solution, with CPU embedded in each flash drive to scale performance WITH capacity.:
What’s the fuss about GAD?
I believe this will be game-changing technology for many customers that need active-active compute and storage configurations. These customers are subject to the strictest availability requirements for their IT environments where failure is not an option.
I must admit before I joined HDS I felt HDS created some great engineering (hardware and software) but didn’t always match it up with zingy titles and acronyms. GAD is a case in point. it’s not the most glamorous title (sorry product management).
It may not sound glamorous but the words tell you what it does… Global, Active, Device…….. It’s awesome technology and something I’ve kind of written about wanting to see before, here:
GAD will allow customers to move to an active-active fully fault tolerant storage configuration without any external hardware (excluding quorum, which I’ll explain later). With GAD and G1000 that is here today.
This is something NONE of our competitors can achieve.
Zero RPO/RTO for Storage
The sweetspot for this is zero RPO and zero RTO against a site failure. I’m talking dead datacenter taken out by an airplane. It will be the ability of the application to continue without disruption that will dictate whether VMs or other tenants are impacted. Some applications like Oracle RAC will continue to run.
If VMware eventually increase the number of vCPUs supported in Fault Tolerance you could see this being used much more frequently with pairs of virtual machines with DRS virtual-machine anti-affinity rules running across the datacenters. Then you might experience no impact on your VMs.
Obviously if there are “regular” virtual machines in the affected datacenter they will be restarted by vSphere HA on the alternate site, but the storage part will be seamless. And if you were using something like application-level clustering across distance you will have a much improved solution.
Now don’t forget that vMSC use case is load balancing across datacenters. VMware do not pitch it as a pure BC/DR solution.
The VSP already has redundancy built-in to offer zero RPO/RTO against multiple simultaneous failures to the storage frame on it’s own. It didn’t need GAD for that.
The current schedule (which could be subject to change) is that GAD will launch and GA with support for vSphere Metro Storage Cluster in August. GAD will be relevant for physical servers as well as virtual datacenters. For the sake of this I want to focus on vMSC and I think this is where GAD will cause a lot of disruption.
In terms of architecture the magic is the creation of what’s called a Virtual Storage Machine or VSM. So what appears to our ESXi cluster seems to be a single array that stretches between the sites. The pre-requisite is that the Ethernet and Storage fabrics must be stretched.
The magic happens with the virtual storage machine using the serial number of the “primary” array as well as the LUN (LDEV) ID of the primary LUN. This means that the application – in this case ESXi – thinks that it’s got two paths to the same LUN (which is derived from WWN and LDEV ID).
Reads and Writes are accepted simultaneously down both paths and both LDEVs are always active.
This is not HDS High Availability Manager which was an active-passive architecture. In this model the concept of active-passive goes away.
No external hardware required !!!!
Did I mention that to use this you don’t require ANY external expensive complicated hardware which needs to sit between your application and the storage. Isn’t that just a bit special ?.
That’s handled by the G1000, in software ;-)
What about data integrity ?
There is still a concept of a primary in the sense that in-order write integrity must be assured. So if you write to the LUN on site B, behind the scenes, an acknowledgement is only returned to the application when it has been committed to Site A followed by Site B. But you will not need to worry about that.
In terms of data locality, if you’re using Hitachi MPP software (HDLM) it will understand a lot more and leverage read and write locality based on IO latency. HDLM may be required initially for Metro distances but more info will be available on that soon. Make sure to check the VMware HCL as always when this goes live.
The quorum does not have to be a VSP but must be a HDS array at the moment. I can cover failure scenarios in the future but assuming the quorum device is outside the primary and secondary site the following failure scenarios do not result in any loss of connectivity to a datastore from any surviving cluster:
- Primary array failure
- Secondary array failure
- Quorum array failure
- Primary Site failure
- Quorum Failure
- Secondary Site failure
There are lots of other scenarios and you could have the primary and the quorum on the same site but what would be the point ?. You CANNOT have the quorum on the secondary site as that just wouldn’t make sense.
Hope that provides an introduction. There’s much more information coming on vMSC and GAD.
Disclaimer: I work for HDS but this is not an officially sanctioned blog and is my personal opinion always. Make sure to consult with your local account team for official queries on anything I’ve mentioned here.