Switch Port Mismatch
Recently while rigging up a network config in vSphere, we ran into a strange scenario. It was not a greenfield site from a networking or vSphere perspective and the config was subject to certain constraints which I won’t go into here.
We had reason to have the following network configuration, using vSphere standard switches. A Green link indicates and Active adapter for the vmkernel port or virtual machine port group. An Amber link denotes standby. This environment did not have Enterprise Plus, so no dVS.
As you can see, the plan was to isolate vMotion on it’s own active vmnic.
This is a deterministic logical design, where we are not using LAG/LACP/PortChannel/Etherchannel.
The main reason for isolation of vMotion was due to the fact that it was being used for a large amount of migrations of VMs from an older cluster to a newer cluster. The number of uplinks was not ideal, and in fact this configuration has been modified completely in terms of port groups and uplinks/VLANs, to anonymise the customer setup.
So vmnic0 is setup as an access port on the switch. All other physical ports are setup as trunk ports, as they carry multiple VLANs. Most people know that’s the definition of a trunk port – the ability to carry multiple VLANs.
VMkernel port vMotion is configured to use vmnic0 as it’s active path and vmnic1 as it’s standby.
And because the Management port group is also using VLAN670, rightly or wrongly, and uses vmnic0 as it’s standby, only VLAN 670 is required on vmnic0. That’s why Port 0 is configured as an Access port.
[ Aside: If you’re going to use VLANs, don’t use the same VLANs for Management and vMotion vmkernel ports. Thats not good practice but this was a legacy configuration ]
The problem occurs immediately after installation of ESXi (5.1U1 or 5.1U2), version doesn’t matter. Once the management IP address is configured and the VLAN is set, all is good. We can ping the gateway, and lookup DNS.
Once we add the second vmnic (vmnic1) to the management configuration on the DCUI, the server disappears off the air. We were unable to ping it from another system on the same subnet.
We restarted the management agents (not the network) using the troubleshooting submenu on the DCUI.
More weirdness ensued…..
- Firstly, we noticed the hostd daemon was continually crashing on the host in question.
- We also noticed that sometimes when we performed a “Test Management Network” on the host, it worked the first time.
- The next time we did it immediately after, everything failed (ping gateway, DNS servers and resolve DNS name).
- When we googled it, we found the following link to a KB describing the creation of a file called /etc/hosts.backup in this scenario. There is a resolution and that was in the following KB article:
As suggested, we removed the file and could again test the management network and ping the host. However, when we tested the management network, the hosts.backup was created again and the next time a “Test Management Network” failed.
And round and round we go…..
Initially we suspected we had a hardware issue.
- We had 2 CCIE’s on site who could see nothing wrong with the network configuration.
- We had a second host which was working ok, with no visible problems. This was the main focus initially so we tried to rule out hardware issues or a possibly bad install by rebuilding the host, and switching physical switch ports between the server. None of this resolved it.
It was actually one of my colleagues who figured it out. Despite the reassurances from our network comrades it actually turned out be simple.
We had to change switch Port 0 from an access port to a trunk port. Still with a single VLAN defined.
All of a sudden it all worked.
If you think about it, when the Management vmkernel port is on vmnic1, VLANs are tagged and the trunk port works as expected. When we pull the vmnic1 cable and it switches to vmnic0, we are then using an access port. The VLAN is not tagged, so to speak, anymore, at least by the pSwitch. This causes the port group to mis-function.
If we apply the same logic to the vMotion port group. The starting position must be to leave it untagged on the vswitch. Otherwise it also malfunctions from the get-go. It works fine until we pull vmnic0. Then it switches to Port 1 which is a trunk port. With no VLAN tagging, it immediately gets into trouble.
So we have an inconsistent model. Even though the access port is set for VLAN670 and should work properly, it doesn’t.
Now …. from the Cisco website:
Configuring Access and Trunk Interfaces
Configuring a LAN Interface as an Ethernet Access Port
You can configure an Ethernet interface as an access port. An access port transmits packets on only one, untagged VLAN. You specify which VLAN traffic that the interface carries. If you do not specify a VLAN for an access port, the interface carries traffic only on the default VLAN. The default VLAN is VLAN1.
The VLAN must exist before you can specify that VLAN as an access VLAN. The system shuts down an access port that is assigned to an access VLAN that does not exist.
Anyway this was an issue, that due to suspicion of host-specific issues distracted us from the real issue. It’s incredible how something like this can lead to arbitrary inconsistent behaviour.