Some new Study resources for VCP

Part of my remit in HDS is to help my colleagues within the EMEA region to focus more on being Virtualisation Architects, and less on being Storage Architects, and to advise and assist based on my own experience. I am a big advocate of VMware’s certification model and the whole ecosystem, so it makes sense to borrow and replicate the most useful parts.

It’s daunting when you first face down a blueprint – even for VCP.

I didn’t use the blueprint so much back then, which might explain why I didn’t really enjoy the study process. I’ve been collecting materials from the blueprint for colleagues and thought why not throw them up here for all.

I’ve created a new page where I will start to collate information on whatever activities I’m involved in. I hope it’s useful – it may already exist elsewhere – if it does now there’s another way to increase your skill levels and make your CV even hotter than it already is ;-).

The first page is a link to the blueprint resources  for VCP550-DV which is  the VCP exam for vSphere 5.5 in the Datacenter Virtualisation pillar/track. Here’s the page: VCP550-DV Blueprint resources.

You can also do the VCP for version 5.1 of vSphere but I don’t recommend that now. 5.1 has been out a long time now and it’s folly to consider doing that exam as you will be out of step with the current version. This is particularly the case with Single Sign-on which is a critical operational and design consideration.

 

vSphere Metro Storage Cluster and GAD : Rise of HDS Virtual Storage Machine

Sometime around August HDS will launch what is called Global Active Device, also known as GAD. This is part of the new Virtual Storage Platform G1000.

What is a G1000?

The VSP G1000 is the latest variant of HDS high-end Tier-1 array that scales to thousands of drives and millions of IOPS (if you’re into just counting millions of IOPS like some vendors).

And when we talk about an increase in performance, we’re not talking about going from 50,000 IOPS to 100,000. It’s in the millions. (My colleague) Hu Yoshida has written about a benchmark on a G1000 that achieved 3.9 million (nasty random read) IOPs already. A Random Read workload is what breaks many storage arrays as it often results in cache misses which must fetch the blocks from disk.

http://blogs.hds.com/hu/2014/05/where-are-the-performance-specs-for-vsp-g1000.html 

It’s my opinion that it’s kind of ironic that scale-out storage fulfills different purposes, and one of those is to be able to scale IOPS along with array-side compute resources together. Without doing that, single or dual chassis controller configurations have serious trouble once the IOPS go through the roof. The G1000 can do that in the same chassis but more on that in another post.

The benchmark was using Hitachi Accelerated Flash. I’m gonna write a post about that soon but suffice it to say that is also a very elegantly engineered solution, with CPU embedded in each flash drive to scale performance WITH capacity.:

What’s the fuss about GAD?

I believe this will be game-changing technology for many customers that need active-active compute and storage configurations. These customers are subject to the strictest availability requirements for their IT environments where failure is not an option.

I must admit before I joined HDS I felt HDS created some great engineering (hardware and software) but didn’t always match it up with zingy titles and acronyms. GAD is a case in point. it’s not the most glamorous title (sorry product management).

It may not sound glamorous but the words tell you what it does… Global, Active, Device…….. It’s awesome technology and something I’ve kind of written about wanting to see before, here:

Metro Storage Cluster Part 1

and here:

Metro Storage Cluster Part 2

GAD will allow customers to move to an active-active fully fault tolerant storage configuration without any external hardware (excluding quorum, which I’ll explain later). With GAD and G1000 that is here today.

This is something NONE of our competitors can achieve.

Zero RPO/RTO for Storage

The sweetspot for this is zero RPO and zero RTO against a site failure. I’m talking dead datacenter taken out by an airplane. It will be the ability of the application to continue without disruption that will dictate whether VMs or other tenants are impacted. Some applications like Oracle RAC will continue to run.

If  VMware eventually increase the number of vCPUs supported in Fault Tolerance you could see this being used much more frequently with pairs of virtual machines with DRS virtual-machine anti-affinity rules running across the datacenters. Then you might experience no impact on your VMs.

Obviously if there are “regular” virtual machines in the affected datacenter they will be restarted by vSphere HA on the alternate site, but the storage part will be seamless. And if you were using something like application-level clustering across distance you will have a much improved solution.

Now don’t forget that vMSC use case is load balancing across datacenters. VMware do not pitch it as a pure BC/DR solution.

The VSP already has redundancy built-in to offer zero RPO/RTO against multiple simultaneous failures to the storage frame on it’s own. It didn’t need GAD for that.

The current schedule (which could be subject to change) is that GAD will launch and GA with support for vSphere Metro Storage Cluster in August. GAD will be relevant for physical servers as well as virtual datacenters. For the sake of this I want to focus on vMSC and I think this is where GAD will cause a lot of disruption.

vMSC Configuration

In terms of architecture the magic is the creation of what’s called a Virtual Storage Machine or VSM. So what appears to our ESXi cluster seems to be a single array that stretches between the sites. The pre-requisite is that the Ethernet and Storage fabrics must be stretched.

G1000 GAD Blog document

The magic happens with the virtual storage machine using the serial number of the “primary” array as well as the LUN (LDEV) ID of the  primary LUN. This means that the application – in this case ESXi – thinks that it’s got two paths to the same LUN (which is derived from WWN and LDEV ID).

Reads and Writes are accepted simultaneously down both paths and both LDEVs are always active.

This is not HDS High Availability Manager which was an active-passive architecture. In this model the concept of active-passive goes away.

No external hardware required !!!!

Did I mention that to use this you don’t require ANY external expensive complicated hardware which needs to sit between your application and the storage. Isn’t that just a bit special ?.

That’s handled by the G1000, in software ;-)

What about data integrity ?

There is still a concept of a primary in the sense that in-order write integrity must be assured. So if you write to the LUN on site B, behind the scenes, an acknowledgement is only returned to the application when it has been committed to Site A followed by Site B. But you will not need to worry about that.

In terms of data locality, if you’re using Hitachi MPP software (HDLM) it will understand a lot more and leverage read and write locality based on IO latency. HDLM may be required initially for Metro distances but more info will be available on that soon. Make sure to check the VMware HCL as always when this goes live.

Failure Scenarios

The quorum does not have to be a VSP but must be a HDS array at the moment. I can cover failure scenarios in the future but assuming the quorum device is outside the primary and secondary site the following failure scenarios do not result in any loss of connectivity to a datastore from any surviving cluster:

  • Primary array failure
  • Secondary array failure
  • Quorum array failure
  • Primary Site failure
  • Quorum Failure
  • Secondary Site failure

There are lots of other scenarios and you could have the primary and the quorum on the same site but what would be the point ?. You CANNOT have the quorum on the secondary site as that just wouldn’t make sense.

Hope that provides an introduction. There’s much more information coming on vMSC and GAD.

Disclaimer: I work for HDS but this is not an officially sanctioned blog and is my personal opinion always. Make sure to consult with your local account team for official queries on anything I’ve mentioned here.

 

Synergy Open Source software for increased productivity in your Lab

I previously wrote a post HERE where I described using a Gaming PC running VMware Workstation as one virtual datacenter, connected to a MacBook Pro running VMware Fusion running another. I run SRM between them to simulate having two datacenters with a small Gig switch in between. It works well and is a pretty simple configuration.

It’s a pretty easy setup to get up and running. I did it manually but always give a shout out to Autolab created by Nick Marshall and Alastair Cooke. This is an awesome way of automating building a lab in a very small footprint like a nested system. You provide the ISO images and it does the rest. Anyway go here to read more about it: http://www.labguides.com/autolab/

Keyboard and Mouse Sharing

So running either two different OS flavours or two different machines brings the whole question of keyboard, mouse and screen sharing. For years we’ve been able to share screens and keyboards etc using switches in the datacenter and for the desktop.

Of late I’ve been using Synergy Open Source software successfully with both Mac and Windows 7 to use a single keyboard and mouse and multiple screens.  Synergy is an open source project that originated in a subsidiary of SGI back in 1996. You can find them here: http://synergy-project.org

Synergy_-_Mouse_and_keyboard_sharing_software

I’m not sure how many people are using it but for me having multiple monitors setup correctly is pretty important and Synergy works well when you run multiple systems in one place with a shared network.

Easy Setup. 

When you download it, it’s a breeze to setup and literally takes 2 minutes. You install one of your machines as the server for your keyboard and mouse and the other(s) as client(s). So on the server I just download the Mac software and select this machine as the server.  On the other system that happens to be running Windows 7 I download the Windows client and install it. Apply the configuration and start the services on the server. For the client you just select that tick box and provide the IP address or hostname of the server.

Once it’s setup, the movement between systems is seamless and so fast you won’t notice.

config screen synergy

I currently run Synergy on my Mac and have a lovely Apple keyboard and touchpad that work perfectly on Windows.

You can chain up to 15 monitors and systems together and use it for multiple display setups should you choose. You can drag and drop the monitors around as you would locally on a system with multiple screens.

Server_Configuration

Anyway for those who know that’s great and for those who might need this I think you’ll find it increases your productivity no-end for a lab environment, either with multiple systems i.e work and lab, or where you have many servers and want to make things a bit easier.

Enjoy….

 

VMware NSX, SDN and the operational challenges ahead

Earlier today Ed Grigson posted a tweet on Twitter which turned into a mini conversation. It ended up involving Dan Barber, Gregg Robertson, Ed and myself and went like this:

Screen Shot 2014-06-24 at 15.52.10

The subject was how experienced IT architects will engage with NSX and other complex SDN solutions. Not only whether we CAN but also whether we SHOULD.

I know Ed and a few of the other guys are attending a beta NSX ICM course in Frimley in the UK. And VMware are now enabling the channel to sell NSX to try and increase adoption rates. It is a mature product that should be ready for the Enterprise.

It is inevitable that this will ultimately lead to trained people conducting design as well as admin – not my buddies on this training, but eventually out in the field.

Is the Enterprise ready for NSX ?

There is no doubt this is awesome technology. However this does open up a debate regarding what the operational support model looks like with vSphere, Hyper-V, Openstack and advanced SDN  solutions under the same umbrella.

So I’m asking is the Enterprise ready for NSX or Cisco ACI for that matter ? That’s not from a product perspective. I mean purely from an operational and manageability point of view ?

And I haven’t seen this question debated much. This seems one of the most fundamental operational considerations before deploying an SDN solution.

In a previous life 6 years ago I worked on my first (very) complex vSphere implementation. It was the design of a hosted multi-tenanancy Cloud platform running on vSphere 4 for Ireland’s PTT. Where now we use software-defined solutions such as NSX, I worked alongside two Cisco CCIE’s – one of whom had three CCIEs. That solution was built using Cisco 6500, 7600 and other existing kit that was made do a job.

We were absolutely constrained by budget and my colleagues performed some magic to get it all to work. In that case the solution was a commercial product that needed to be designed for scale and expansion.

It comprised shared ESXi hosts with virtualised firewalls, load balancers, multi-VPN access as well as other services such as provisioning of aggregation services for backup/storage-as-a-service for existing co-location customers.

Working on that project made me realise that where security is paramount, design should be the remit of Network engineers. That was 6 years ago. Now potentially that power exists within NSX and the Hypervisor management layer.

Network Architect skills

On that project I worked with engineers familiar with vSphere networking. Trying to engage routing/switching CCIE’s with no knowledge of virtualisation is like buying a pig in a poke. It’s getting the wrong man for the job.

And looking at Ed’s tweet and remembering taking the HOL lab on NSX at VMworld last year I realised the power of NSX and what can now be easily completed with software, where once it was necessary to argue and beg Cisco for equipment to try out some of these concepts.

I suggest it would be crazy to let a vSphere Architect  take design responsibility for a complex solution like NSX without sufficient networking training and experience. I would also suggest it is folly to allow a network guy (CCIE-level) design such a platform without significant vSphere training and experience. My humble opinion is that a knowledge deficit can now occur in either direction, and that needs to be bridged.

And this is the operational challenge. As Dan pointed out in the tweets, VMware now has Datacenter, End User Computing, Cloud and Networking as it’s four pillars. I’m sure some will achieve triple-VCDX but can you really be an expert across three or four of what are now disciplines or almost distinct practices ?

Reading material

And now onto the real subject…..

By coincidence about a week or so ago I asked for suggestions for good CCNP-level networking material. I plan to personally get my hands on the following three books. Thanks to Craig Kilborn @craig_kilborn, Ramses Smeyers @rsmeyers and Nicolas Vermandé @nvermande I have these three suggestions for all vSphere architects to up your game, whether VCDX is your plan or just to surprise some of your colleagues:

1. I haven’t read it but always recommend Chris Wahl’s material for great technical content with a great sense of humour that’s easy on the ear/eye:

Screen Shot 2014-06-24 at 16.29.27

2. Try this book out recommended by Ramses for in-depth NX-OS R&S.

ISBN-10: 1-58714-304-6

Screen Shot 2014-06-24 at 16.28.41

3. Check this one out by Nicolas which kind of covers DC designs from a virtualisation perspective.

ISBN-13: 978-1-58714-324-3

Screen Shot 2014-06-24 at 16.29.00

Enjoy ….

Do you fear loss of control or worry about lack of training and knowledge ?

What are your thoughts on this ?

#VCDX Constraint: LBT (and vSphere Distributed vSwitch) ?

While creating a VCDX design, you consider common decision criteria and use cases and revisit them over and over. You baseline & compare design decisions against alternatives. These alternatives can be raised for your consideration during your defence.

One of the most common and subjective examples is the use of different logical network designs in a 10Gb/s scenario. 10Gb/s is becoming standard especially with the resurgence of blades in converged platforms such as Hitachi Unified Compute Platform, Cisco UCS and other solutions.

Within converged and hyper-converged infrastructure with embedded 10Gb/s Ethernet connectivity there is a reduction in the number of physical uplinks. It’s normal to see a physical Blade or host with a maximum of 20Gb/s virtual bandwidth per host either delivered via two logical 10Gb/s virtual devices, or 8 x 2.5Gb/s or other combinations.

Compared to the (good/bad?) old days of 6-8 x 1Gb/s Ethernet plus 2 x 4Gb/s FC this is an embarrassment of riches right ? That’s kinda true until we lay on virtualisation which raises the spectre of reduced redundancy and increased risk for a networking design.

Let’s be honest and admit that sometimes there’s just no right way to do this and it boils down to experience, planning and expecting that when things change the design can be changed to accommodate these impacts.

Some Background Reading

For those who want to understand vSphere teaming and failover design/considerations I recommend this VMware document. It’s an excellent resource for decision-making in vSphere/Cloud/Hyper-V networking design:

http://www.vmware.com/files/pdf/vsphere-vnetwork-ds-migration-configuration-wp.pdf

When starting designing vSphere solutions, I was used to Tier-1 storage solutions and using Active-Active architectures at all times. Active-Passive didn’t cut it. I applied this mindset to networking as well as storage. In hindsight much of this was due to lack of knowledge on my part. That kind of made me want to learn more which is how I ended up going down the VCAP-VCDX route to get to the bottom of it.

The document above shows you why this is not always optimal.  It made everything clear in terms of understanding good practice and moving away from slavishly adhering to “active/active” topologies. Some protocols such as NFS v3 find it hard to leverage LACP and in those cases LACP does not provide a performance or management benefit, as it increases management complexity in my view.

There are many excellent posts on the subject such as this one by Chris Wahl here:

http://wahlnetwork.com/2011/06/08/a-look-at-nfs-on-vmware/

Chris has written a series where he has tested LBT in his lab and established the definitive behaviour of what happens when a link nears traffic saturation.

and by Michael Webster here:

http://longwhiteclouds.com/2012/04/10/etherchannel-and-ip-hash-or-load-based-teaming/

and by Frank Denneman here (complete with gorgeous Mac Omnigraffle logical topologies):

http://frankdenneman.nl/2011/02/24/ip-hash-versus-lbt/

The VMware document is also useful in terms of showing how clear documentation makes for nice and easy deployments, rather than “back of a fag/cigarette pack” designs. You cannot put a value on  good clear documentation laid out like this. When it’s in a visual format in 2-D you can really get a picture for the way different traffic will route. It’s here you should make changes, not when you’re installing the platform.

How many times is this never done by VMware partners when installing vSphere and vCenter ?. I’ve seen it a lot and this can lead to many issues.

LACP will take care of it 

And sometimes it’s assumed that LACP will “take care of it” as now we have more bandwidth.

This is not the case for a discrete TCP/IP session  from a virtual machine to an external destination or for NFS. These sessions will only ever use the same uplink when an IP hash is calculated as Frank has shown. Yes a VM might use multiple uplinks across multiple sessions (TCP/IP ports) but never for one point-point “conversation”.

And NFS typical use case – a vSphere host mounting a datastore from a VIP on a NAS device will also only every use a single uplink as Chris has clearly shown.

The VMware document also shines a light on the fact that keeping it simple and avoiding LACP and other more complex topologies that may not deliver any material benefit is important. Use case, requirements and constraints drive conceptual and logical design decisions.

Your logical network design is a template that will be replicated many times. If the logic behind it is questionable, any adverse issues will be amplified when deployed across one or more clusters. Personally I believe this is the most critical element in designs (from experience) to ensure cluster and host stability and availability. I would put storage as a close second.

Logical Network Design

When making design choices regarding balancing different workload types across a pair of 10Gb/s adapters it can be a game of chance. If you’ve completed your current state analysis and understand your workload you can apply some science. You still might suffer from massive organic growth and the effect of previous unknown technology such as vMotion.

From a design perspective there are so many things to consider:

  • Understanding the current workload
  • Upstream connectivity (logical and physical)
  • Traffic Type(s)
  • Traffic priority (relative)
  • Latency requirements for different types of traffic
  • Application dependencies (some may benefit or suffer from virtual machine to host affinity)
  • Workload profile (across day/month/year)
  • Bandwidth required
  • Performance under contention
  • Scalability

A balanced solution such as Teaming and Failover based on Physical NIC load (also known as load-based teaming) is an excellent way to ensure traffic is moved between uplinks non-disruptively. It’s like DRS for your network traffic.

So for me LBT without LACP is a good solution and can be used in many use cases. I personally would hold off using Network I/O Control Day 1. It’s better to give all traffic types access to all bandwidth and only put on the handbrake for good reason. NIOC can be applied later in real-time.

And now the constraint

Unfortunately LBT is only a feature of the vSphere Distributed Switch. This post is kind of trying to reach out to raise this within the community and make the point that vDS is around a long time now (more than 4-5 years) and it’s time it made it’s way down the SKUs to standard edition.

After all Microsoft doesn’t have any such limitations at present.

10Gb/s is now pervasive which means more and more we need a software(policy)-based solution to ensure fairness and optimum use of bandwidth. I believe we have reached the point where VMware Load Based Teaming without LACP is a great solution in many common use cases today to balance traffic under load.

I haven’t gotten into Network I/O Control and the additional priority and Class of Service (CoS/QoS) that can be applied to different traffic types. That’s another tool in the weaponry that can be used. Maybe more on that later.

KISS

For me network design comes down to KISS  Keeping It Simple, Stupid.

So if LBT is the a great potential solution, without Enterprise Plus you can’t use virtual distributed switches and without this feature you cannot use LBT. This also rules Network I/O control which also requires the vDS.

As vSphere has evolved we have seen awesome features appear and become commonplace in lower SKUs. It’s strange that vDS is still an Enterprise Plus feature and I don’t like to have to use this as a design constraint with customers who can’t afford Enterprise Plus.

I hope someday soon this awesome technology will be available to all customers regardless of which license tier they reside in.

Thoughts ?

Moving on to HDS

It’s been too long since my last post. I’m glad to say I have something exciting to report.

Next Tuesday I start a new role with Hitachi Data Systems (HDS) as Principal Consultant EMEA for VMware Solutions. This is a wide-ranging role that centres around HDS’ strategic decision to focus (even more) on VMware Solutions as a core growth strategy for this and subsequent years.

The role involves not only engaging with HDS markets across the region to help them grow HDS VMware solutions business, but also internal evangelism and education of the end user community of existing and new VMware solutions. It’s a great opportunity to stay very technical but also help grow HDS’ Vmware business, leveraging some of their great product engineering  such as VSP, HNAS and the Hitachi Unified Computing Platform (UCP). UCP integration with vCenter is awesome and has surprised even me. It’s going to be a major growth area for us this year and has some unique features such as logical partitioning that other vendors can’t match.

While this role focuses on VMware and it’s growing portfolio, over time I’m sure this will expand to all things Openstack, Hyper-V, RH and other Virtualisation and Cloud Solutions.

I hope to be able to help those within the VMware and wider community if I can. Just ping me a note and I’ll try to help.

I’ve been working as a freelance consultant over the last year. It has been amazingly challenging due to the financial climate in Ireland which has been in one word, horrendous. However it’s also been the most productive and effective of my career.

Last year I passed VCAP-DCA, VCAP-DCD, started a blog and my own website and this year I have been awarded vExpert. But it’s the new friends gained through the community that mean the most. Thankfully that remains regardless of role.

I always endeavour to remain neutral or at least make sure never to engage in FUD or any such nonsense. That stays the same. For me the community is paramount and something whose integrity I will ensure I never compromise regardless of employer.

Finally in a world of hyper-convergence, convergence, low-end, high-end, scale-up, scale-out, web-scale etc etc I feel there is a place for all platforms in different use cases. That’s what makes this exciting, working with customers in the most demanding environments across the continents.

Regular updates coming soon.

Thoughts for Today, Ideas for Tomorrow