Disclaimer: I work for Hitachi Data Systems. However this post is not officially sanctioned by or speaks on behalf of my company. These thoughts are my own based on my experience. Use at your own discretion.
VMware vSphere has many features within its storage stack that enhance the operation of a virtual datacenter. When used correctly these can lead to optimal performance for tenants of the virtual environment. When used incorrectly these features can lead to a conflict with array-based technology such as dynamic tiering, dynamic (thin) pools and other features. This can obviously have a detrimental affect on performance and lead to an increase in operational overhead to manage an environment.
It is quite common to observe difficulty among server administrators as well as storage consultants in understanding the VMware technology stack and how it interacts with storage subsystems in servers and storage arrays. It is technically complex and the behavior of certain features changes frequently, as newer versions of vSphere and vCenter are released.
Don’t ever just enable a feature like Storage I/O Control, Storage DRS or any other feature without thoroughly understanding what it does. Always abide by the maxim “Just because you can doesn’t necessarily mean you should.” Be conservative when introducing features to your environment and ensure you understand the end-to-end impact of any such change. It is my experience that some vSphere engineers haven’t a clue how these features really work, so don’t always just take what they say at face value. Always do your homework.
This is the first in a series of posts to introduce the reader to some of the features that must be considered when designing and managing VMware vSphere solutions in conjunction with Hitachi or other storage systems. Many of the design and operational considerations apply across different vendors technology and could be considered generic recommendations unless stated otherwise.
In order to understand how vSphere storage technology works I strongly recommend all vSphere or Storage architects read the cluster deep-dive book by Duncan Epping and Frank Denneman. This is the definitive source for a deep dive in how vSphere works under the covers. Also checkout any of Cormac Hogan’s posts which are also the source for much clarification on these matters.
Why write this series ?
When questions came up in my head regarding whether I should or shouldn’t use certain features I always ended up slightly confused. Thanks to Twitter and Duncan, Cormac Hogan, Frank and others who are always available to answer these questions.
This is an attempt to pull together black and white recommendations regarding whether you should use a certain feature or not in conjunction with storage features, and bring this all into a single series.
The first series focuses on Storage IO control, Adaptive Queuing, Storage DRS (Dynamic Resource Scheduler) and HLDM/multi-pathing in VMware environments, and how these features interoperate with storage. I also plan to cover thick vs thin (covered in a previous post), VAAI, VASA, and HBA queue depth and queuing considerations in general. Basically anything that seems relevant.
Should I use or enforce limits within vSphere?
Virtualization has relied on oversubscription of CPU, Memory, Network and Storage in order to provide better utilization of hardware resources. It was common to see less than 5-10% average peak CPU and Memory utilization across a server estate.
While vSphere uses oversubscription as a key resource scheduling strategy, this is designed to take advantage of the idle cycles available. The intention should always be to monitor an environment and ensure an out-of-resources situation does not occur. An administrator could over-subscribe resources on a server leading to contention and degradation in performance. This is the danger of not adopting a conservative approach to design of a vSphere cluster. Many different types of limits can be applied to ensure that this situation does not arise.
In some environments close-to 100% server virtualization has been achieved, so gambling with a company’s full workload running on one or more clusters can impact all the company’s line-of-business applications. That’s why risk mitigation in vSphere design is always the most critical design strategy (IMO).
If at all possible please be conservative with vSphere design. If you’re putting your eggs in one basket use common sense. Don’t oversubscribe your infrastructure to death. Always plan for disaster and assume it will happen as it probably will. And don’t just use tactics such as virtual CPU to physical CPU consolidation ratios as your design strategy. If a customer doesn’t want to pay explain that the cost of bad design is business risk which has a serious $$$ impact.
More on Reservations
Reservations not only take active resources from a server preventing other tenants on the same server from using them, but also require other servers in a HA cluster to hold back resources to ensure this can be respected by the cluster in the event of a failure. This feature of vSphere is called High Availability (HA) Admission Control and ensures a cluster always maintains resources (preparing for when a host failure occurs).
Implementing limits is a double-edged sword in vSphere. Do not introduce Reservations, Resource Pools or any other limits unless absolutely necessary. These decisions should be driven by specific business requirements and informed by monitoring existing performance to achieve the best possible outcome.
In certain cases like Microsoft Exchange it makes complete sense to use reservations, as Exchange is a CPU-sensitive application that should never be oversubscribed !. But that is an application/business requirement driving that decision and is a VMware/Microsoft recommendation.
The following text has been taken from the vSphere Resource Management guide for vSphere 5.5 and provides some important guidance regarding enforcing limits.
Limit specifies an upper bound for CPU, memory, or storage I/O resources that can be allocated to a virtual machine. A server can allocate more than the reservation to a virtual machine, but never allocates more than the limit, even if there are unused resources on the system. The limit is expressed in concrete units (megahertz, megabytes, or I/O operations per second). CPU, memory, and storage I/O resource limits default to unlimited. When the memory limit is unlimited, the amount of memory configured for the virtual machine when it was created becomes its effective limit.
In most cases, it is not necessary to specify a limit. There are benefits and drawbacks:
- Benefits — Assigning a limit is useful if you start with a small number of virtual machines and want to manage user expectations. Performance deteriorates as you add more virtual machines. You can simulate having fewer resources available by specifying a limit.
- Drawbacks — You might waste idle resources if you specify a limit. The system does not allow virtual machines to use more resources than the limit, even when the system is underutilized and idle resources are available. Specify the limit only if you have good reasons for doing so.
In the next part I cover Storage I/O Control … coming soon.