Improving performance and reducing CPU Ready Times by removing vCPU’s; a real world example

When I started to work on a customers vSphere enironment, there was  one thing I noticed immediately. VM’s had 2 vCPU by default. On top of that there were several 4 and even 8 vCPU VM’s. Total number of vCPU was twice the number of physical CPU cores. Total physical CPU cores (without Hyperthreading) is 144 and the number of vCPU’s was over 500. Normally, this wouldn’t be an issue, but because of the multi-CPU VM’s, there was considerable CPU ready time logged.

I started to investigate the reasons behind the ‘2 vCPU default’. It turned out, there were 2 reasons. First (and foremost) was the reason that we as virtualization admins hear the most (I presume), ‘because it’s faster’ and ‘just to be sure’. I see these 2 reasons as one, because they come from the same ‘legacy’ thought. The other reason was to be able to handle unpredictable peaks in load.

It was clear this environment was being run by admins that still had the ‘physical way of doing things’. Mind you, I’m not blaming anyone. The techniques, concepts and ideas behind virtualization are actually quite complex. There is no shame in sticking to what you know will work. This, however, made correcting this problem a lot easier. I went through a lot of logging and  statistics the weeks after this. It became clear that, at the moment, there was no real performance impact except during backup hours. Ready times were below 10%, averaging to about 3%-6% with some spikes to 8% during production hours. During backup hours, however, the ready times went through the roof. But as I said, that was of less concern. The real problem lay ahead. If the current growth of 2vCPU VM’s would continue in this pace, in about 6 months, ready times would become a serious problem. Seeing how most admins try to solve CPU related problems, namely by adding more CPU’s, it was clear I had to act now 😀

I looked at CPU performance counters and graphs for every single VM seperately to determine if more then 1 vCPU was justified. Obviously, I found very little VM’s where this was the case. I consulted with application admins to try to figure out what VM’s could expect CPU load and I also managed to convince management to drastically reduce the number of vCPU’s overall. The other admins had some healthy amount of scepsis. I managed to convince them to do this, because I explained it would be far easier to add a vCPU to a VM after reducing the total number of vCPU’s. My plan of reduction was by no means a ‘hard’ deadline. Exceptions could always be made.

I managed to reduce the number of vCPU on almost 98% of all VM’s. Over 85% of all VM’s got 1 vCPU. All 8vCPU VM’s got 4vCPU, all 4 vCPU VM’s got 1 or 2 vCPU. The total number of vCPU went from 500+ to 270 making the CPU core/vCPU ratio about 2. And, most importantly, I managed to make VM’s with 1 vCPU the defacto standard.

Here are some graphs to show the improvements. CPU ready times are in red. CPU usage in blue/grey. The change was made on June 2nd. To extract the CPU ready time percentage from these graphs, please read this blog post by Jason Boche.

4vCPU to 2vCPU:

8vCPU to 4vCPU:

4vCPU to 2vCPU:

2vCPU to 1 vCPU:

Conclusion are:

  • No impact on performance for applications and end users
  • Reduced CPU ready times to almost 0% thereby improving performance
  • Created enough free resources to enable future growth
  • Extending life cycle of hardware thereby increasing ROI

Overall, the change was a succes. DB admins identified 1 VM they expect to grow significantly in the near future. We will be monitoring it’s CPU hunger and add vCPU’s accordingly.


ESXi host error: Unable to apply DRS resource settings on host

After upgrading vCenter Server to 5.0 Update 1, there was an ESXi 4.1 host that displayed the error

Unable to apply DRS resource settings on host (Reason: A general system error occurred: Invalid Fault). This can significantly reduce the effectiveness of DRS.

I quickly stumbled upon KB1004667. The solution seemed quite an undertaking. At the end of the article, however, is a small note that reads

Note: It was reported on one occasion that when logging directly into the ESX host encountering the issue with a vSphere Client a hung VMware Tools install was found. The tools install was resolved by right clicking on the VM to complete the install. Once the tools install was completed both HA and DRS reconfigured successfully on the cluster.

Well, I guess they can make that ‘on two occasions’ 😀 After I connected to the ESXi host, I saw there was one VM that initiated a VMware Tools install about 3 weeks ago. It never finished. Resolving this issue made the error disappear.

%d bloggers like this: