Securing your HTTPS Apache 2.4 web server with correct parameters

Warning: Keep in mind this is an ongoing field that is quickly changing. Vulnerabilities in protocols and implementations are discovered daily. If you read this information in a few months or even weeks, things could be radically different.

The last few days I’ve been researching HTTPS connections from Apache 2.4 webservers. This research was sparked by the recent Hearthbleed bug in OpenSSL. I’ve been reading up on vulnerabilities, cipher suites, encryption and hashing methods ever since.

The general move to SSL certificates with a bitlength > 1024bit also fueled this research. Microsoft removed support for 1024bit root CA certificates from their Operating System through a patch in March 2014.

One particular website that has a lot of information is https://www.ssllabs.com. They host a nifty server test tool at https://www.ssllabs.com/ssltest/. When testing our production webservers, we got this very poor result. We got a C for supporting weak cipher suites. We were also not mitgated against the BEAST vulnerability. This was tested after we patched Hearthbleed.

ssllabs-c-weak-ciphers

 

 

As you can see in the first line, the server had no preference in cipher suite order. Why is this a problem you ask? With modern browsers you always use a newer cipher suite anyway. The point is that external parties (hackers) can force your web server to use an insecure cipher suite to communicate with them.

The most shocking to me was that we, as sysadmins, including me, never really gave much thought to this. The general idea is that you put an SSL certificate on your webserver and it’s secured. As with so many security related product, this assumption is dead wrong. The default ssl_mod settings are optimized for compatibility, not security. With our web servers getting 10 millions+ hits a month and providing privacy sensitive information, there wasn’t much thought needed to see this needed some serious attention.

After some time we came to the following requirements:

  • Create a config that is as secure as possible.
  • All common browsers should be able to connect to our websites securely. The highly unsafe IE browsers on the Windows XP platform included.
  • SSL3 must be supported for older browsers/platforms.
  • We must be mitigated against all know attacks, such as BEAST, CRIME, BREACH, Lucky Thirteen, padding attacks, renegotiation, etc.

This appeared to be quite a challenge. In the case of BEAST this deserved some special attention, because mitigating against BEAST is only possible using RC4 cipher suites in TLS1.0 and SSLv3 connections. Unfortunately the research in  cracking the RC4 encryption got a serious boost in March 2013.

After days of testing, I came up with the following:

  • Use Apache 2.4.7. This version does not allow the key exchange in Diffie-Hellman cipher suites to be less then 2048bit. It will use the bitlength of the SSL certificate but will use no less then 2048bit.
  • Use an Apache binary that is compiles against a recent version (1.0.1g) of OpenSSL lib. This will ensure the serving of ECDHE and ECDHE-ECDSA cipher suites.
  • Use a 4096bit SSL certificate. This will strengthen the DHE key exchange mechanism.
  • Specifically disable SSLv2 even though it is not supported anymore with recent OpenSSL libs.
  • Force the cipher suite order in mod_ssl.conf.
  • Use a specific, custom ciphers suite to satisfy your specific needs.
  • Use mod_socache_shmcb to allow session caching and session resumption.
  • Use the Strict-Transport-Security parameter in your Virtual Host config to support HSTS.

Cipher suites are selected on the following criteria in order of importance:

  • Prefer ECDHE-ECDSA cipher suites
  • Prefer ECDHE cipher suites
  • Prefer DHE cipher suites
  • Prefer GCM block cipher suites
  • Prefer CBC block cipher suites
  • Prefer SHA384 hashing
  • Prefer SHA256 hashing
  • Prefer SHA hashing
  • Prefer cipher suites which support Forward Secrecy
  • Prefer cipher suites with 256bit encryption
  • Prefer cipher suites with 128bit encryption
  • Prefer cipher suites with 112bit encryption

As you can see, encryption bit length is only a minor factor in this. Forward Secrecy preference is implicitly done by preferring the ECDHE and DHE cipher suites.

For us, this would result in the ssl_mod config:

SSLSessionCache shmcb:/var/cache/mod_ssl/scache(512000)
SSLSessionCacheTimeout 300
SSLProtocol -SSLv2 ALL
SSLHonorCipherOrder On
SSLCompression Off
SSLCipherSuite EECDH+ECDSA+AESGCM:EECDH+aRSA+AESGCM:EECDH+ECDSA+SHA384:EECDH+ECDSA+SHA256:EECDH+aRSA+SHA384:EECDH+aRSA+SHA256:EECDH+aRSA+RC4:EECDH:EDH+aRSA:DHE-RSA-CAMELLIA256-SHA:DHE-RSA-CAMELLIA128-SHA:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-SEED-SHA:AES256-SHA256:AES128-SHA256:AES128-SHA:DHE-RSA-DES-CBC3-SHA:DES-CBC3-SHA:RC4-SHA:!aNULL:!eNULL:!ADH:!EXP:!LOW:!DES:!MD5:!PSK:!SRP:!DSS

And the httpd config:

<VirtualHost *:443>

Header always set Strict-Transport-Security “max-age=63072000; includeSubDomains”

Alternatively, you can choose not to support CAMELLIA en SEED ciphers with the following parameter:

SSLCipherSuite EECDH+ECDSA+AESGCM:EECDH+aRSA+AESGCM:EECDH+ECDSA+SHA384:EECDH+ECDSA+SHA256:EECDH+aRSA+SHA384:EECDH+aRSA+SHA256:EECDH+aRSA+RC4:EECDH:EDH+aRSA:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:AES256-SHA256:AES128-SHA256:AES128-SHA:DHE-RSA-DES-CBC3-SHA:DES-CBC3-SHA:RC4-SHA:!aNULL:!eNULL:!ADH:!EXP:!LOW:!DES:!MD5:!PSK:!SRP:!DSS

It’s up for debate if TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA (DHE-RSA-DES-CBC3-SHA) and TLS_RSA_WITH_3DES_EDE_CBC_SHA (DES-CBC3-SHA) are still wanted. These create connections with 112bit cipher strength instead of 168bit which you may think. If you require 128bit, leave them out. the DHE-RSA-DES-CBC3-SHA cipher provides Forward Secrecy, so the key to decode one session, cannot be used for another session.

You might want to put TLS_DHE_RSA_WITH_AES_128_CBC_SHA (DHE-RSA-AES128-SHA) before TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA (DHE-RSA-DES-CBC3-SHA), but since it’s included in EDH+aRSA, you’ll have to write it out.

Above settings result in an A+ rated site with BEAST mitigation, Strict Transport Security (HSTS), Forward Secrecy with all common browsers and no RC4 in vulnerable cipher suites:

 

ssllabs-a-plusssllabs-protocol-detailsssllabs-handshake-simulation ssllabs-protocol-details-and-cipher-suites

If you are not interested in mitigating BEAST (as most browsers are patched), you could use the following order:

SSLCipherSuite ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES256-SHA384:ECDH-RSA-AES256-GCM-SHA384:ECDH-RSA-AES256-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256:ECDH-RSA-AES128-GCM-SHA256:ECDH-RSA-AES128-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-CAMELLIA256-SHA:DHE-RSA-CAMELLIA128-SHA:DHE-RSA-SEED-SHA:CAMELLIA256-SHA:CAMELLIA128-SHA:DES-CBC3-SHA:!aNULL:!eNULL:!ADH:!EXP:!LOW:!DES:!MD5:!PSK:!SRP:!DSS

Again, CAMELLIA and SEED could be left out:

SSLCipherSuite ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES256-SHA384:ECDH-RSA-AES256-GCM-SHA384:ECDH-RSA-AES256-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256:ECDH-RSA-AES128-GCM-SHA256:ECDH-RSA-AES128-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-SHA256:DES-CBC3-SHA:!aNULL:!eNULL:!ADH:!EXP:!LOW:!DES:!MD5:!PSK:!SRP:!DSS

And explicit denial of all RC4 encryption might be preferable:

SSLCipherSuite ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES256-SHA384:ECDH-RSA-AES256-GCM-SHA384:ECDH-RSA-AES256-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256:ECDH-RSA-AES128-GCM-SHA256:ECDH-RSA-AES128-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-SHA256:DES-CBC3-SHA:!aNULL:!eNULL:!ADH:!EXP:!LOW:!DES:!MD5:!PSK:!SRP:!DSS:!RC4

I would also advise leaving out the 112bit TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA (ECDHE-RSA-DES-CBC3-SHA) cipher suite:

SSLCipherSuite ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES256-SHA384:ECDH-RSA-AES256-GCM-SHA384:ECDH-RSA-AES256-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256:ECDH-RSA-AES128-GCM-SHA256:ECDH-RSA-AES128-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-SHA256:DES-CBC3-SHA:!aNULL:!eNULL:!ADH:!EXP:!LOW:!DES:!MD5:!PSK:!SRP:!DSS:!RC4

This would result in all modern browsers/flatforms (IE in Vista and higher, Andriod 4.0.4 and higher, FireFox, Chrome and Safari) to use ECDHE cipher suites with 256bit encryption. The TLS_RSA_WITH_3DES_EDE_CBC_SHA (DES-CBC3-SHA) suite will be used for all exceptions. Also encryption with the potentially vulnerable RC4 cipher is prevented.

 

References:
“Elliptic curve cryptography” http://en.wikipedia.org/wiki/Elliptic_curve_cryptography
“Elliptic Curve DSA” http://en.wikipedia.org/wiki/Elliptic_Curve_DSA
“Elliptic curve Diffie–Hellman” http://en.wikipedia.org/wiki/ECDHE
“SSL/TLS Deployment Best Practices” https://www.ssllabs.com/projects/best-practices/
“Configuring Apache, Nginx, and OpenSSL for Forward Secrecy” https://community.qualys.com/blogs/securitylabs/2013/08/05/configuring-apache-nginx-and-openssl-for-forward-secrecy
“RC4 in TLS is Broken: Now What?” https://community.qualys.com/blogs/securitylabs/2013/03/19/rc4-in-tls-is-broken-now-what
“Updated SSL/TLS Deployment Best Practices Deprecate RC4” https://community.qualys.com/blogs/securitylabs/2013/09/17/updated-ssltls-deployment-best-practices-deprecate-rc4
“SSL Labs Test for the Heartbleed Attack” https://community.qualys.com/blogs/securitylabs/2014/04/08/ssl-labs-test-for-the-heartbleed-attack
“SSL Labs: Stricter Security Requirements for 2014” https://community.qualys.com/blogs/securitylabs/2014/01/21/ssl-labs-stricter-security-requirements-for-2014
“SPDY” http://en.wikipedia.org/wiki/SPDY
“ChaCha20” http://en.wikipedia.org/wiki/ChaCha20#ChaCha_variant
“Poly1305-AES” http://en.wikipedia.org/wiki/Poly1305
“QUIC” http://en.wikipedia.org/wiki/QUIC
Session Resumption http://en.wikipedia.org/wiki/Transport_Layer_Security#Session_IDs
“HTTP Strict Transport Security” http://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security
“Nginx” http://en.wikipedia.org/wiki/Nginx
“Strong SSL Security on Apache2” https://raymii.org/s/tutorials/Strong_SSL_Security_On_Apache2.html
“Increasing DHE strength on Apache 2.4.x” http://blog.ivanristic.com/2013/08/increasing-dhe-strength-on-apache.html
“OpenSSL cipher suites” https://www.openssl.org/docs/apps/ciphers.html
https://www.ssllabs.com/ssltest/analyze.html?d=payload.hu. This site runs Apache. It’s cipher suites are: “EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS”.
https://www.ssllabs.com/ssltest/analyze.html?d=icnseo.com This server runs nginx. It’s cipher suites are: ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA:HIGH:!aNULL:!eNULL:!LOW:!3DES:!MD5:!EXP:!CBC:!EDH:!kEDH:!PSK:!SRP:!kECDH;
https://www.ssllabs.com/ssltest/analyze.html?d=blck.io. This site uses mod_spdy voor Apache 2.4.
https://github.com/eousphoros/mod-spdy mod_spdy.
chrome://net-internals/#spdy Monitor SPDY connections in the Chrome browser

Remediate entity error: Host cannot download files from VMware vSphere Update Manager patch store. Check the network connectivity and firewall setup, and check esxupdate logs for details.

While trying to remediate an ESXi server I got the following error:

Host cannot download files from VMware vSphere Update Manager patch store. Check the network connectivity and firewall setup, and check esxupdate logs for details.

A quick cat of /var/log/esxupdate.log revealed what was going on

vmware-remediate-error-01

After changing the config to hold the correct IP addresses of the DNS servers all went fine 🙂

Remove host from cluster error “Cannot remove the host because it’s part of VDS dvSwitch”

I reveived an error while trying to remove a host from a cluster.

Cannot remove the host [hostname] because it’s part of VDS [dvSwitch name]

remove-host-dvswitch-01

 

 

This error is correct 🙂 As you can probably see by it’s name, this host is connected to a dvSwitch called dvSwitch-vMotion. Obviously it’s used for vMotion. To be able to remove the host from the cluster I had to disconnect it from the dvSwitch.

Press Ctrl-Shift+N and you get to the networking part of your vCenter Server Inventory. Select your dvSwitch

remove-host-dvswitch-02

 

and select the Host tab

remove-host-dvswitch-03

 

Now select the host you want to remove and right-click on it. Select the ‘Remove from vSphere Distributed Switch’ option

remove-host-dvswitch-04

 

Read the following error very closely. You could put yourself in trouble if you disconnect the wrong port-groups

remove-host-dvswitch-05

 

If you have decided it’s safe to remove the host anyway, you will most probably receive the following error

remove-host-dvswitch-06

 

vDS [dvSwitch name] port [port number] is still on host [hostname] connected to [hostname] nic=vmk1 type=hostVmknic

This error is also correct 😉 You have not disconnected the virtual adapter from the portgroup!

Go back to your host. Select the Configuration tab and go to the Networking config. Select vSphere Distributed Switch. Here you can see the virtual adapter that’s still connected to the portgroup.

remove-host-dvswitch-07

 

Select Manage Virtual Adapters. You’ll see a list of the connected adapters. I’m my case it’s only 1. Select the adapter you want to remove and click Remove. Again, think a second about the warning.

remove-host-dvswitch-08

 

Now you’re able to disconnect the server from the dvSwitch. Press Ctrl-Shift+N again and remove the host from the vSphere Distributed Switch. The remove process will disconnect your dvUplinks for you and your physical adapters will be free for other use.

remove-host-dvswitch-09

 

And now you’re able to remove your host from the cluster.

remove-host-dvswitch-10

 

Congrats, you did it! Have a good one.

Update/Change your ESXi hosts DNS IP address settings with PowerCLI

Long time since last blogpost. Reason: Our third child was born. And it runs with higher priority then work related stuff =)

Anyways, moving right along….

During a domain upgrade we introduced 2 new DC’s in our AD domain that will take over the DNS server role from the old DC’s. Hence all servers need to be updated with the new DNS server IP addresses including the ESXi hosts.

This proved to be relatively easy. I updated all ESXi hosts (managed with vCenter Server) with the command:

Get-VMHost | Get-VMHostNetwork | Set-VMHostNetwork -DnsAddress [DNS1 IP address],[DNS2 IP address]

Test it first on one server by specifying it:

Get-VMHost -Name [FQDN of ESXi host] | Get-VMHostNetwork | Set-VMHostNetwork -DnsAddress [DNS1 IP address],[DNS2 IP address]

You can also change other parameters, like the Domain and SearchDomain

Get-VMHost | Get-VMHostNetwork | Set-VMHostNetwork -DnsAddress [DNS1 IP address],[DNS2 IP address] -Domain [Domain name] -SearchDomain [Search domain name]

Easy! PowerCLI FTW!

Reference: https://www.vmware.com/support/developer/PowerCLI/PowerCLI41U1/html/Set-VMHostNetwork.html

VMware vSphere 5.1 is anounced. Available from september 12th 2012. What’s New?

NOTE: I’m still  digging through the ‘what’s new’ documents, so I’ll probably be adding info to this document in the coming days.

This post is summing up all the notable changes and new features in vSphere 5.1 for future reference. VM:

  • 64 vCPU is now possible (up from 32 in 5.0)
  • VM version 9. Now supports GPU acceleration (Only usable in VMware View) on selected nVidia GPU’s
  • New feature: Guest OS Storage Reclamation (Only usable in VMware View)
  • New feature: VHV (virtualized hardware virtualization) to run VM’s inside of VM’s
  • A shift from using virtual hardware versioning to ‘Compatibility levels’

Platform:

  • Upgrading to VM version 9 doesn’t require the VM to be down
  • Shell activity is no longer logged as root, but as the logged on user
  • New feature: Support for SNMPv3
  • vMotion and svMotion can now be combined in one action
  • Windows 8/2012 are supported
  • Piledriver, Sandy Bridge E and Ivy bridge CPUs are now supported

Network:

  • dvSwitch supports LACP
  • New feature: Network Health Check
  • New feature: Configuration Backup & Restore
  • New feature: Roll Back and Recovery
  • SR-IOV support

Availability:

  • vMotion is now possible on non-shared storage
  • vMotion and svMotion can now be done simultaneously (“Unified” vMotion)
  • New feature: vSphere Data Protection. Agentless VM to disk backup with dedup
  • New feature: vSphere Replication. Replication on VM level over SAN or WAN
  • After installing VMware Tools 5.1 (and reboot), no more reboots are needed with future installs of VMware Tools

Security:

  • New feature: VMwarevShield Enpoint is now included

Automation:

  • sDRS and Storage Driven Profiles are now integated in VMware vCloud Director (5.1?)
  • vSphere Auto Deploy has 2 new method for deploying vSphere hosts: Stateless Caching and Statefull Installs
  • Up to 80 concurrent hosts boots are now supported by the Auto Deploy server

Management:

  • vSphere Web Client is greatly improved
  • New feature: vCenter Signle Sign-on
  • vCenter Orchestrator has newly designed workflows and can now be launched by the vSphere Web Client

Licensing:

  • vSphere licensing is back to CPU sockets instead of a specific amounts of vRAM per CPU socket. (Thanks Michael)

Storage:

  • Max number of hosts that can share a read-only file is now 32 (up from 8 in vSphere 5.0)
  • New feature: Virtual disk type SE sparse disk. Used to enable wipe/UNMAP free disk space initiated from within the VM (VMware Tools). HW version 9 required. VMware View only.
  • Grain size of VMDK is now tunable, but not by users =) Default size is still 4KB. Redo  logs still user 512B.
  • Improved APD and PDL handling: Misc.APDHandlingEnable, Misc.APDTimeout, disk.terminateVMOnPDLDefault, das.maskCleanShutdownEnabled
  • Extends detection to PDL for iSCSI arrays with single LUN per target
  • Booting from software FCoE adapter is now supported
  • 16Gb speed support for 16Gb FC HBA’s
  • vCloud Director 5.1 can now also make use of VAAI NAS primitives (using a plugin from the vendor) to off-load the creation of linked clones
  • New namespaces in esxcli (esxcli storage san)for troubleshooting
  • New feature: Smartd daemon. Collects SMART info from disks. Can only be used from esxcli.
  • SOIC will now calculate the latency threshold from the 90% throughput value (90% of peak value)
  • SOIC will now be enabled in stats only mode by default
  • When involving vCloud Director linked clones, sDRS will now not recommend placing them on datastores that do not contain the base disk or a shadow vm copy of the base disk
  • The datastore correlation detector now uses the I/O injector, so no VASA support is needed anymore
  • New SIOC metric: VmObservedLatency
  • Support for4 parallel disk copies per svMotion operation to distinct datastores
  • Jumbo frame support for all iSCSI adapters

Duncan Epping posted the links to the ‘what’s new’ documents. If you want to read them yourself, click here.

Multi NIC vMotion with jumbo frames on directly connected ESXi 5 hosts

For licensing reasons I run a 2 node cluster with 1 quad core CPU per node. Each host has 8 NICs. I’m using 2 for VM network traffic and want to use 2 for the management network. I wanted to use the other 4 NICs for multi NIC vMotion. To save switch ports, I connected both hosts directly. Because I saw no reason not to use jumbo frames, I wanted to set this up too.

Now, to enable the vMotion proces to make use of all the uplinks, you’ll have to assign one VMKernel port to one vmnic only. It works kind of the same as the software iSCSI setup. Thank you Duncan for explaining this in detail. Just creating a vSwitch with one VMKernel portgroup and assigning multiple uplinks won’t cut it. This will only use the multiple uplinks when vMotioning multiple VM’s. One vMotion per uplink. In multi NIC vMotion, all uplinks are used for every vMotion.

Ok, let’s get started. To make things easy for yourself, make sure the UTP cables are connected to the same NIC ports on both hosts. If the hosts are installed in the same manner, the physical ports shoud map to the same vmnics. In my example, I’m using vmnic1 & vmnic2 of an Intel 82850 quad port NIC and vmnic5 & vmnic6 of a Broadcom BCM5719 quad port NIC. Although the vMotion capability is not that critical of an operation, there is no reason to not take into account the same redundancy best practices as you would apply on your VM networks and management network(s). It’s never comfortable to have an interrupted vMotion because one of your NICs fails. Because of auto-MDIX you can connect the NICs without any risk. I have, however, configured 4 vmnics in one host as ‘1000/FULL’. The 4 vmnics in the other host are configured to ‘Auto negotiate’.

NOTE: After several hours of research (see problem at the end of this post), I have to conclude that adding multiple NIC’s to 1 vSwitch will NOT (always) work. Although port groups are binded to a vmnic, it seems the switch does not send IP traffic through that vmnic by default. It observes the IP ranges and then decides which vmnic to use. The simplest way to solve this, is to use 1 vSwitch per port group. It’s a little more work, but all other settings below are the same. And it will work every time =)

You’ll end up with something similar to this (notice 1 vmnic per vSwitch):

Create a new vSwitch with 4 VMKernel ports. Add the vmnics to the vswitch (vmnic1,2,5,6).

Enable vMotion on all portgroups and enable jumbo frames (MTU 9000) on the vSwitch

AND on all 4 portgroups.

Now just like software iSCSI, you’ll have to override the switch failover order and assign 1 active vmnic to each port group (the same on both servers!) and ‘disable’ the others

When you created the VMKernel port groups, you probably choose a A, B or C class private network /24 network mask, right? It’s even given as an example in the link I provided above. Now when you try to vMotion, everything seems to go right until but the progress bar will be stuck at 9% and you’ll receive the error:

The vMotion migrations failed because the ESX hosts were not able to connect over the vMotion network. Check the vMotion network settings and physical network configuration.
vMotion migration [168364033:1341590814020069] vMotion migration [168364033:1341590814020069] stream thread failed to connect to the remote host <10.9.8.30>: The ESX hosts failed to connect over the VMotion network
Migration [168364033:1341590814020069] failed to connect to remote host <10.9.8.30> from host <10.9.8.29>: Timeout
Module Migrate power on failed.
vMotion migration [168364033:1341590814020069] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout

The error is correct, the ESXi can’t connect! Think about it. Even though you seperated the portgroups and connected the NIC’s with a directly connected UTP cable, there is nothing in VMware that prevents it from connecting randomly to one of the other 4 IP addresses from the other host for the vMotion operation. You put them in one /24 subnet didn’t you? This is exactly the reason why these settings will work when you have the hosts connected to a switch, but not when you use a directly connected cable.

To solve this, you’ll have to create seperate networks for every VMKernel portgroup. I recommend using /30 networks. I haven’t tried /31 networks, but I can’t really see a reason why you should use them. Maybe one of you can comment. To make it easy to choose the correct IP addresses for hosts, use this cheat sheet.

So I created 4 networks. 10.9.8.0/31, 10.9.8.28/31, 10.9.8.60/31 and 10.9.8.92/31. I assigned the 2 IP adresses from each subnet to the VMKernel portgroup on each host. The subnet mask is 255.255.255.252.

This way you make sure, there is only 1 possible connection. Each IP connection can only exist between the IP addresses on both ends of the directly attached UTP cable. This is the reason why you should mark the other vmnics as ‘unused’ instead of ‘standby’ in the failover order on the portgroup. It is impossible for each vmnic to take the place of another in case of a failure.

To test the connection, I made a 16GB WS2K8R2SP1 VM and moved it from 1 host to the other. Here are the esxtop screenshots

Beautiful! vMotion completed in 21 seconds 😀

I’ll report back when the upgrade to 10GBit is completed 😉

UPDATE:

After rebooting 1 of the 2 hosts in the cluster, the vMotion error returned. I checked all settings, but no config was lost. I managed to get things going again by changing the active adapter to standby adapter,

ignoring the warning,

and then change it back to active again.

I only had to do this on the rebooted host. I’m still not sure why this is happening. It seems the binding of the IP address of the port group occurs on all adapters at boot. If that is the case, this would certainly be a bug. Made a reader can comment?

NOTE: Problem is reoccurring See note above for a solution to this problem.

Creating an HP IRF (Intelligent Resilient Framework) Networking stack between 2 switches

A short checklist for creating a IRF stack on 2 HP switches. I executed this on 2 HP A3600 EI switches:

  1. Login onto the switch using the console port
  2. sys (Enter system view)
  3. show version (Ensure that both switches are running the same software version)
  4. reset saved-configuration (Reset the config)
  5. irf member 1 renumber 1 (Assign an IRF member number to the first switch)
  6. irf member 1 renumber 2 (Assign an IRF member number to the second switch)
  7. quit (Quit to user view)
  8. save (Save the config)
  9. reboot (Reboot the switches)
  10. irf mac-address persistent always (Enable MAC address persistance)
  11. irf member 1 priority 32 (Set the highest prio on the first member/switch)
  12. irf member 2 priority 30 (Set the second highest prio on the second member/switch)
  13. int GigabitEthernet 1/0/51
  14. shut
  15. int GigabitEthernet 1/0/52
  16. shut
  17. int GigabitEhternet 2/0/51
  18. shut
  19. int GigabitEthernet 2/0/52
  20. shut (shutdown all interfaces you want to use for IRF on both switches)
  21. irf port 1/1 (Create IRF port 1/1 on the first member)
  22. port group interface GigabitEthernet 1/0/51 (add the switch port to the IRF port)
  23. quit
  24. irf port 1/2 (Create IRF port 1/2 on the first member)
  25. port group interface GigabitEthernet 1/0/52 (add the switch port to the IRF port)
  26. quit
  27. irf port 2/1 (Create IRF port 2/1 on the second member)
  28. port group interface GigabitEthernet 2/0/51 (add the switch port to the IRF port)
  29. quit
  30. irf port 2/2 (Create IRF port 2/2 on the second member)
  31. port group interface GigabitEthernet 2/0/52 (add the switch port to the IRF port)
  32. quit
  33. save (Save config)
  34. interface GigabitEthernet 1/0/51
  35. undo shut
  36. interface GigabitEthernet 1/0/52
  37. undo shut
  38. interface GigabitEthernet 2/0/51
  39. undo shut
  40. interface GigabitEthernet 2/0/52
  41. undo shut (enable all interfaces you want to use for IRF on both switches)
  42. irf-port-configuration active (Activate the IRF config on BOTH switches)
  43. Now connect your fiber CROSSWISE, so 1/0/51 to 2/0/52 and 1/0/52 to 2/0/51
  44. ATTENTION: The second IRF member will reboot! Wait for it to get back up. You will see the switches negotiate for about 30 seconds before the IRF becomes active.
  45. If all works well;
  46. quit
  47. save
  48. reboot
  49. disp irf (Display the IRF setup)
  50. disp irf topology (Display the IRF Topology)
  51. Both irf port should be up. If one is DOWN or DIS(abled), something went wrong.
  52. Check the IRF prio of the second member. It should be 30.

Download the IRF Configuration Guide for more information on configuring IRF including optional parameters

I got my info mainly from here and adjusted it so it applies to a stack with just 2 switches.

%d bloggers like this: