Disclaimer: I work for Hitachi Data Systems. However this post is not officially sanctioned by or speaks on behalf of my company. These thoughts are my own based on my experience. Use at your own discretion.
This is the second part of my storage interoperability series designed to bring it all together.
Here are the other parts:
This started from a desire I had to be crystal clear about what these settings are and when you should and shouldn’t use them. Many vSphere experts may be aware of these settings but I’m hoping to bring this to the masses in a single set for easy access. All feedback gratefully accepted !!!!
In this post I will cover some topics around Storage I/O Control. This is the first post related to SIOC. I will have at least one more and possibly two more depending on how easy each is to digest.
Storage I/O Control is an I/O queue-throttling mechanism that is enabled at datastore-level in vSphere. It takes action in real-time when IO congestion occurs for any datastore on which it is enabled.
When I say “takes action” I mean it reduces the queue depth or pipeline of I/Os queued up for a datastore in real-time leading to an immediate effect. It’s like driving with the handbrake on or in a lower gear.
It is important to note that device latency and device queuing relate to the individual queue associated with the datastore itself, not at HBA or at VMkernel level. The term device in vSphere always refers to a datastore when used in connection with storage.
You may have seen this image before which shows the different queues at play in vSphere and the work the kernel is doing which is quite amazing really:
SIOC does not take any action during normal operational conditions.
It only intervenes when one of two specific thresholds set by the administrator are breached:
- An explicit congestion latency threshold is breached for a datastore. This is an observed latency or response time in Milliseconds (ms)
- A percentage of peak performance has been breached for a datastore. (The peak performance of the datastore is calculated automatically by vSphere.)
When you enable Storage I/O Control on a datastore, ESXi begins to monitor the device latency observed by connected hosts via a file called IORMSTATS.sf which is stored directly on the datastore. All connected hosts have access to this file and can read from and write to the file. When device latency exceeds the defined threshold, the datastore is considered congested, and each virtual machine that accesses that datastore is allocated I/O resources in proportion to their shares.
If all virtual machines accessing the datastore have the same shares (the default is that they are all normal), each virtual machine will be allowed equal access regardless of their size or whether they are running on the same host or not. So, by default, critical virtual machines will have exactly the same priority or shares as the most unimportant virtual machines.
SIOC is not just a cluster-wide setting. It applies to any host connected to the datastore. Connected hosts write to a file called IORMSTATS.SF held on the datastore, regardless of cluster membership. So you could have standalone ESXi hosts managed by vCenter sharing datastores with a cluster. Not a great idea IMHO but it’s possible.
In an ideal world do not share datastores between clusters. It is possible but much more difficult to isolate performance (or other) problems in this scenario. This might not be possible when using VSAN or other similar system/SDS/hyper-converged systems.
In the past I recommended using “swing LUNs” to move datastores datafiles between clusters. This will not be so much of an issue now with the flexibility that is now supported within vCenter for cross-vCenter vmotion etc.
Unless a storage vendor or VMware has issued guidance otherwise, enable Storage IO Control on datastores. It is still a good idea to check before making wholesale changes affecting hundreds/thousands of virtual machines.
Without SIOC, hosts have equal device queue depths and a single VM could max out a datastore despite the fact the datastore is being accessed by multiple VMs on other hosts. When congestion occurs this situation will be maintained and it is possible that more critical workloads will be throttled to the benefit of less important workloads. It is always a good idea to enable SIOC regardless of the back-end disk configuration, with a high enough threshold so as not to inhibit performance.
Storage IO Control can very simply apply powerful controls against runaway workloads and prevent so-called “noisy neighbour” syndrome within a vSphere environment.
When configuring SIOC the following should be taken into account:
The feature is enabled once per-datastore and this setting is inherited across all hosts. Note the previous point that this can be across multiple clusters or any hosts that share the datastore.
- It takes action when target latency has been exceeded OR percentage performance capability of the datastore has been reached.
o Both of these settings are customizable for each datastore.
- Enabling virtual machine shares also allows you to set a ceiling of maximum IOPS per-VM if you choose, as well as the relative virtual machine shares.
o This allows relative priority and a level of Quality of Service (QoS) to be defined and enforced during periods of contention.
- With SIOC disabled, all hosts accessing a datastore get an equal portion of that datastores available queue. This means it is possible for a single virtual machine to become a “noisy neighbour” and drive more than it’s fair share of the available IOPS. In this scenario, with SIOC disabled, no intervention will occur to correct this situation.
By default, all virtual machine shares are set to Normal (1000) with unlimited IOPS. If you select this setting to prioritize performance the following settings are available:
- Low: 500
- Normal: 1000
- High: 2000
- Custom: Set a custom value for the VM
So the ratio of Low:Normal:High is 1:2:4. This can be adjusted if required or custom values can be set.
Storage I/O requirements and limitations
The following conditions must be adhered to when considering using SIOC.
Datastores that are enabled for SIOC must be managed by a single vCenter Server system. As mentioned earlier, vCenter is not part of the data plane and is only used for management and configuration of the service.
- Storage I/O Control is supported on Fibre Channel-connected, iSCSI-connected, and NFS-connected storage.
- Raw Device Mapping (RDM) is not supported for SIOC.
- Storage I/O Control does not support datastores with multiple extents. Extents are equivalent to traditional elements of a concatenated volume that are not striped.
It’s best not to allow concurrent access to a datastore (backed by a common disk set) to vSphere clusters or hosts managed by multiple vCenter instances. Always ensure back-end spindles are dedicated to clusters or individual hosts managed by the same vCenter instance if at all possible.
Many people wonder about whether you can use tiering with SIOC. I will cover this in the next post but it’s safe to use it, with some caveats, and this is one you should check with your vendor to ensure they have qualified it !!!.
Set a threshold that is “impossibly” high so as not to be “in play” in normal operational state. So for an SSD backed tiered pool make sure the threshold is at least 10-20ms and SIOC should not ever intervene unless serious problems occurs.
It should make sense now that you should not allow an external physical host to use the same backend disk pool (HDP or HDT) as a vSphere host or cluster on which SIOC is enabled. This is just bad design and an accident waiting to happen.
NOTE: Make sure not to forget about the impact of using VADP backup proxy servers accessing datastores/LUNs directly for SAN-based backup and the IOPS Impact this could have on your backend storage.
Until the next time …….