Improving performance and reducing CPU Ready Times by removing vCPU’s; a real world example

When I started to work on a customers vSphere enironment, there was  one thing I noticed immediately. VM’s had 2 vCPU by default. On top of that there were several 4 and even 8 vCPU VM’s. Total number of vCPU was twice the number of physical CPU cores. Total physical CPU cores (without Hyperthreading) is 144 and the number of vCPU’s was over 500. Normally, this wouldn’t be an issue, but because of the multi-CPU VM’s, there was considerable CPU ready time logged.

I started to investigate the reasons behind the ‘2 vCPU default’. It turned out, there were 2 reasons. First (and foremost) was the reason that we as virtualization admins hear the most (I presume), ‘because it’s faster’ and ‘just to be sure’. I see these 2 reasons as one, because they come from the same ‘legacy’ thought. The other reason was to be able to handle unpredictable peaks in load.

It was clear this environment was being run by admins that still had the ‘physical way of doing things’. Mind you, I’m not blaming anyone. The techniques, concepts and ideas behind virtualization are actually quite complex. There is no shame in sticking to what you know will work. This, however, made correcting this problem a lot easier. I went through a lot of logging and  statistics the weeks after this. It became clear that, at the moment, there was no real performance impact except during backup hours. Ready times were below 10%, averaging to about 3%-6% with some spikes to 8% during production hours. During backup hours, however, the ready times went through the roof. But as I said, that was of less concern. The real problem lay ahead. If the current growth of 2vCPU VM’s would continue in this pace, in about 6 months, ready times would become a serious problem. Seeing how most admins try to solve CPU related problems, namely by adding more CPU’s, it was clear I had to act now 😀

I looked at CPU performance counters and graphs for every single VM seperately to determine if more then 1 vCPU was justified. Obviously, I found very little VM’s where this was the case. I consulted with application admins to try to figure out what VM’s could expect CPU load and I also managed to convince management to drastically reduce the number of vCPU’s overall. The other admins had some healthy amount of scepsis. I managed to convince them to do this, because I explained it would be far easier to add a vCPU to a VM after reducing the total number of vCPU’s. My plan of reduction was by no means a ‘hard’ deadline. Exceptions could always be made.

I managed to reduce the number of vCPU on almost 98% of all VM’s. Over 85% of all VM’s got 1 vCPU. All 8vCPU VM’s got 4vCPU, all 4 vCPU VM’s got 1 or 2 vCPU. The total number of vCPU went from 500+ to 270 making the CPU core/vCPU ratio about 2. And, most importantly, I managed to make VM’s with 1 vCPU the defacto standard.

Here are some graphs to show the improvements. CPU ready times are in red. CPU usage in blue/grey. The change was made on June 2nd. To extract the CPU ready time percentage from these graphs, please read this blog post by Jason Boche.

4vCPU to 2vCPU:

8vCPU to 4vCPU:

4vCPU to 2vCPU:

2vCPU to 1 vCPU:

Conclusion are:

  • No impact on performance for applications and end users
  • Reduced CPU ready times to almost 0% thereby improving performance
  • Created enough free resources to enable future growth
  • Extending life cycle of hardware thereby increasing ROI

Overall, the change was a succes. DB admins identified 1 VM they expect to grow significantly in the near future. We will be monitoring it’s CPU hunger and add vCPU’s accordingly.

Advertisements

About Yuri de Jager
Technology Addict

7 Responses to Improving performance and reducing CPU Ready Times by removing vCPU’s; a real world example

  1. Nitin Pitre says:

    I might be a little late in reading this blog, but I am glad I read it, and I am glad I read it now…just last week I embarked upon virtualization and I think I will benefit from your blog. I will post the result after I implement your solution.
    The blog is very well written, providing just the right amount of detail with facts. Thanks!

    • Thank you for the compliment Nitin! I’m very curious about what problems you encountered in your environment. I hope my blog will help you on your path towards virtualization 🙂

  2. Kevin Gorman says:

    Yuri.. I have recently inherited a VMWare farm and am in the midst of making similiar changes, reducing the vCPU where applicable. Unforatunely I have an Oracle database farm, and it appears 11g R2 ‘expects’ at least 2 vCPU for maintenance activities, even in our development environment. This is likely due to the fact we have some older ESX hosts in the farm which don’t have great support for VT/HT. CPU %RDY values are in the 20-35% range, and convincing application admins to ‘lower’ the vCPU has been challenging. Dropping the vCPU down to 1 has been met with consistent 10%+ performance gains.

    I have moved application servers between farms, reducing the CPU %RDY values from as high as 25% down to 1%. This change resulted in an immediate 25-30%+ increase in performance, but keep in mind this move also involved newer hardware.

    My experiences have shown that hardware makes a huge difference. We have HP Proliant DL385 G2 and DL380 G5/G7s. The vCPU:pCPU ratios for each environment increase substantially from old to new hardware.

    I appreciate your post.. nice to see others going through similar experiences.

    • Hi Kevin,

      Thanks for sharing your experience. I recognise your struggle with your applications admins 🙂 Even with the hard facts in hand, convincing them is hard. It goes against everything they ever learned from the physical world.

      Good to see that Canadian provinces do also have ‘real world’ vSphere environments. I wish you the best of luck and great successes!

  3. Abhishek Mehta says:

    Yuri , I have a similar problem in one of our vsphere cluster as the vkernel report says constraining resource to be CPU ready, each ESX server is a hex core and have two physical CPU, on VI client CPU utilization for each hosts averages between 20-25 % only and memory utilization ranges betwoeen 50-65%.Each host houses an average of 55 VM and config of these VM is from 1 to 4 vCPU and 4 to 8 GB of memory.All these VM’s are linux VM’s and R&D uses them for development purpose and as per there request we have to give them more than 1 vCPU per VM, can you please suggest how to tackle this problem

    Thanks,
    Abhishek

  4. vmPete says:

    Good post. One thing I’d say that challenges some environments is when VMs are provisioned to compile source code. The compilers are amazingly efficient at utilizing every single vCPU available to them. 4vCPUs, 8vCPU’s etc. Perfectly distributed across all vCPUs. The business case is there, as the Developers need the builds is fast as possible. But this makes sizing very difficult. Currently building out a new cluster with dual socket 8 core sandy bridge chips so that all code compile VM’s can go from 4vCPU to 8vCPU. Hope to get a blog post out on it soon.

    • another-pete says:

      vmPete your devs should be in their own cluster, so allowing them the additional vcpu should not affect the live clusters 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: