Multi NIC vMotion with jumbo frames on directly connected ESXi 5 hosts

For licensing reasons I run a 2 node cluster with 1 quad core CPU per node. Each host has 8 NICs. I’m using 2 for VM network traffic and want to use 2 for the management network. I wanted to use the other 4 NICs for multi NIC vMotion. To save switch ports, I connected both hosts directly. Because I saw no reason not to use jumbo frames, I wanted to set this up too.

Now, to enable the vMotion proces to make use of all the uplinks, you’ll have to assign one VMKernel port to one vmnic only. It works kind of the same as the software iSCSI setup. Thank you Duncan for explaining this in detail. Just creating a vSwitch with one VMKernel portgroup and assigning multiple uplinks won’t cut it. This will only use the multiple uplinks when vMotioning multiple VM’s. One vMotion per uplink. In multi NIC vMotion, all uplinks are used for every vMotion.

Ok, let’s get started. To make things easy for yourself, make sure the UTP cables are connected to the same NIC ports on both hosts. If the hosts are installed in the same manner, the physical ports shoud map to the same vmnics. In my example, I’m using vmnic1 & vmnic2 of an Intel 82850 quad port NIC and vmnic5 & vmnic6 of a Broadcom BCM5719 quad port NIC. Although the vMotion capability is not that critical of an operation, there is no reason to not take into account the same redundancy best practices as you would apply on your VM networks and management network(s). It’s never comfortable to have an interrupted vMotion because one of your NICs fails. Because of auto-MDIX you can connect the NICs without any risk. I have, however, configured 4 vmnics in one host as ‘1000/FULL’. The 4 vmnics in the other host are configured to ‘Auto negotiate’.

NOTE: After several hours of research (see problem at the end of this post), I have to conclude that adding multiple NIC’s to 1 vSwitch will NOT (always) work. Although port groups are binded to a vmnic, it seems the switch does not send IP traffic through that vmnic by default. It observes the IP ranges and then decides which vmnic to use. The simplest way to solve this, is to use 1 vSwitch per port group. It’s a little more work, but all other settings below are the same. And it will work every time =)

You’ll end up with something similar to this (notice 1 vmnic per vSwitch):

Create a new vSwitch with 4 VMKernel ports. Add the vmnics to the vswitch (vmnic1,2,5,6).

Enable vMotion on all portgroups and enable jumbo frames (MTU 9000) on the vSwitch

AND on all 4 portgroups.

Now just like software iSCSI, you’ll have to override the switch failover order and assign 1 active vmnic to each port group (the same on both servers!) and ‘disable’ the others

When you created the VMKernel port groups, you probably choose a A, B or C class private network /24 network mask, right? It’s even given as an example in the link I provided above. Now when you try to vMotion, everything seems to go right until but the progress bar will be stuck at 9% and you’ll receive the error:

The vMotion migrations failed because the ESX hosts were not able to connect over the vMotion network. Check the vMotion network settings and physical network configuration.
vMotion migration [168364033:1341590814020069] vMotion migration [168364033:1341590814020069] stream thread failed to connect to the remote host <10.9.8.30>: The ESX hosts failed to connect over the VMotion network
Migration [168364033:1341590814020069] failed to connect to remote host <10.9.8.30> from host <10.9.8.29>: Timeout
Module Migrate power on failed.
vMotion migration [168364033:1341590814020069] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout

The error is correct, the ESXi can’t connect! Think about it. Even though you seperated the portgroups and connected the NIC’s with a directly connected UTP cable, there is nothing in VMware that prevents it from connecting randomly to one of the other 4 IP addresses from the other host for the vMotion operation. You put them in one /24 subnet didn’t you? This is exactly the reason why these settings will work when you have the hosts connected to a switch, but not when you use a directly connected cable.

To solve this, you’ll have to create seperate networks for every VMKernel portgroup. I recommend using /30 networks. I haven’t tried /31 networks, but I can’t really see a reason why you should use them. Maybe one of you can comment. To make it easy to choose the correct IP addresses for hosts, use this cheat sheet.

So I created 4 networks. 10.9.8.0/31, 10.9.8.28/31, 10.9.8.60/31 and 10.9.8.92/31. I assigned the 2 IP adresses from each subnet to the VMKernel portgroup on each host. The subnet mask is 255.255.255.252.

This way you make sure, there is only 1 possible connection. Each IP connection can only exist between the IP addresses on both ends of the directly attached UTP cable. This is the reason why you should mark the other vmnics as ‘unused’ instead of ‘standby’ in the failover order on the portgroup. It is impossible for each vmnic to take the place of another in case of a failure.

To test the connection, I made a 16GB WS2K8R2SP1 VM and moved it from 1 host to the other. Here are the esxtop screenshots

Beautiful! vMotion completed in 21 seconds 😀

I’ll report back when the upgrade to 10GBit is completed 😉

UPDATE:

After rebooting 1 of the 2 hosts in the cluster, the vMotion error returned. I checked all settings, but no config was lost. I managed to get things going again by changing the active adapter to standby adapter,

ignoring the warning,

and then change it back to active again.

I only had to do this on the rebooted host. I’m still not sure why this is happening. It seems the binding of the IP address of the port group occurs on all adapters at boot. If that is the case, this would certainly be a bug. Made a reader can comment?

NOTE: Problem is reoccurring See note above for a solution to this problem.

Advertisements

About Yuri de Jager
Technology Addict

One Response to Multi NIC vMotion with jumbo frames on directly connected ESXi 5 hosts

  1. Cedric says:

    We have a similar problem when we deploy with kickstart but never when we configure it from from vSphere Client. We have a vswitch with Network management and 2 VMotion VMkernel (and 3 VMnics), all using different subnet. VMotion stops at 9% with the same error, but when we look at it, it tries to join the destination vmkernel with the wrong device (which is in another subnet).

    When the problem occured, we are obliged to delete the portgroups and reconfigure VMotion on all ESXi manualy…. and then, it always works. For Network Management network redundancy, we can’t afford to have distinct vSwitch.

    There are lot of problem with vmkernel et network. It would be good that VMware fix them.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: