While creating a VCDX design, you consider common decision criteria and use cases and revisit them over and over. You baseline & compare design decisions against alternatives. These alternatives can be raised for your consideration during your defence.
One of the most common and subjective examples is the use of different logical network designs in a 10Gb/s scenario. 10Gb/s is becoming standard especially with the resurgence of blades in converged platforms such as Hitachi Unified Compute Platform, Cisco UCS and other solutions.
Within converged and hyper-converged infrastructure with embedded 10Gb/s Ethernet connectivity there is a reduction in the number of physical uplinks. It’s normal to see a physical Blade or host with a maximum of 20Gb/s virtual bandwidth per host either delivered via two logical 10Gb/s virtual devices, or 8 x 2.5Gb/s or other combinations.
Compared to the (good/bad?) old days of 6-8 x 1Gb/s Ethernet plus 2 x 4Gb/s FC this is an embarrassment of riches right ? That’s kinda true until we lay on virtualisation which raises the spectre of reduced redundancy and increased risk for a networking design.
Let’s be honest and admit that sometimes there’s just no right way to do this and it boils down to experience, planning and expecting that when things change the design can be changed to accommodate these impacts.
Some Background Reading
For those who want to understand vSphere teaming and failover design/considerations I recommend this VMware document. It’s an excellent resource for decision-making in vSphere/Cloud/Hyper-V networking design:
When starting designing vSphere solutions, I was used to Tier-1 storage solutions and using Active-Active architectures at all times. Active-Passive didn’t cut it. I applied this mindset to networking as well as storage. In hindsight much of this was due to lack of knowledge on my part. That kind of made me want to learn more which is how I ended up going down the VCAP-VCDX route to get to the bottom of it.
The document above shows you why this is not always optimal. It made everything clear in terms of understanding good practice and moving away from slavishly adhering to “active/active” topologies. Some protocols such as NFS v3 find it hard to leverage LACP and in those cases LACP does not provide a performance or management benefit, as it increases management complexity in my view.
There are many excellent posts on the subject such as this one by Chris Wahl here:
Chris has written a series where he has tested LBT in his lab and established the definitive behaviour of what happens when a link nears traffic saturation.
and by Michael Webster here:
and by Frank Denneman here (complete with gorgeous Mac Omnigraffle logical topologies):
The VMware document is also useful in terms of showing how clear documentation makes for nice and easy deployments, rather than “back of a fag/cigarette pack” designs. You cannot put a value on good clear documentation laid out like this. When it’s in a visual format in 2-D you can really get a picture for the way different traffic will route. It’s here you should make changes, not when you’re installing the platform.
How many times is this never done by VMware partners when installing vSphere and vCenter ?. I’ve seen it a lot and this can lead to many issues.
LACP will take care of it
And sometimes it’s assumed that LACP will “take care of it” as now we have more bandwidth.
This is not the case for a discrete TCP/IP session from a virtual machine to an external destination or for NFS. These sessions will only ever use the same uplink when an IP hash is calculated as Frank has shown. Yes a VM might use multiple uplinks across multiple sessions (TCP/IP ports) but never for one point-point “conversation”.
And NFS typical use case – a vSphere host mounting a datastore from a VIP on a NAS device will also only every use a single uplink as Chris has clearly shown.
The VMware document also shines a light on the fact that keeping it simple and avoiding LACP and other more complex topologies that may not deliver any material benefit is important. Use case, requirements and constraints drive conceptual and logical design decisions.
Your logical network design is a template that will be replicated many times. If the logic behind it is questionable, any adverse issues will be amplified when deployed across one or more clusters. Personally I believe this is the most critical element in designs (from experience) to ensure cluster and host stability and availability. I would put storage as a close second.
Logical Network Design
When making design choices regarding balancing different workload types across a pair of 10Gb/s adapters it can be a game of chance. If you’ve completed your current state analysis and understand your workload you can apply some science. You still might suffer from massive organic growth and the effect of previous unknown technology such as vMotion.
From a design perspective there are so many things to consider:
- Understanding the current workload
- Upstream connectivity (logical and physical)
- Traffic Type(s)
- Traffic priority (relative)
- Latency requirements for different types of traffic
- Application dependencies (some may benefit or suffer from virtual machine to host affinity)
- Workload profile (across day/month/year)
- Bandwidth required
- Performance under contention
A balanced solution such as Teaming and Failover based on Physical NIC load (also known as load-based teaming) is an excellent way to ensure traffic is moved between uplinks non-disruptively. It’s like DRS for your network traffic.
So for me LBT without LACP is a good solution and can be used in many use cases. I personally would hold off using Network I/O Control Day 1. It’s better to give all traffic types access to all bandwidth and only put on the handbrake for good reason. NIOC can be applied later in real-time.
And now the constraint
Unfortunately LBT is only a feature of the vSphere Distributed Switch. This post is kind of trying to reach out to raise this within the community and make the point that vDS is around a long time now (more than 4-5 years) and it’s time it made it’s way down the SKUs to standard edition.
After all Microsoft doesn’t have any such limitations at present.
10Gb/s is now pervasive which means more and more we need a software(policy)-based solution to ensure fairness and optimum use of bandwidth. I believe we have reached the point where VMware Load Based Teaming without LACP is a great solution in many common use cases today to balance traffic under load.
I haven’t gotten into Network I/O Control and the additional priority and Class of Service (CoS/QoS) that can be applied to different traffic types. That’s another tool in the weaponry that can be used. Maybe more on that later.
For me network design comes down to KISS Keeping It Simple, Stupid.
So if LBT is the a great potential solution, without Enterprise Plus you can’t use virtual distributed switches and without this feature you cannot use LBT. This also rules Network I/O control which also requires the vDS.
As vSphere has evolved we have seen awesome features appear and become commonplace in lower SKUs. It’s strange that vDS is still an Enterprise Plus feature and I don’t like to have to use this as a design constraint with customers who can’t afford Enterprise Plus.
I hope someday soon this awesome technology will be available to all customers regardless of which license tier they reside in.