Install new Broadcom bnx2/bnx2x/bnx2i/cnic drivers on your ESXi 5.0 hosts

To exclude the driver as cause of a seemingly randomly occuring problem, I decided to update the Broadcom NIC drivers on 2 ESXi 5.0 hosts.

Download the latest drivers here and extract the …..-offline_bundle-xxxxxx.zip file. If you read this blogpost months or years after I created it, check for newer updates of the driver here.

Unfortunately I don’t have version 5.0 of the vMA. You can’t use vMA 4.1 to update ESXi 5.0 hosts. This is due to the fact that vihostupdate can’t be used anymore. Instead, new esxcli commands are created that are not available in vMA 4.1. You will recieve the following error if you try anyway:

Error: Unknown namespace software

esxcli can only be used with version 4.0 or newer servers

If you DO have vMA version 5.0, here is an excellent guide on how to update drivers.

So I’m using a SSH connection to the ESXi 5.0 hosts and I use the command line to update the hosts.

First, use WinSCP to transfer the update file, BCM-NetXtremeII-1.0-offline_bundle-553511.zip in my case, to the ESXi 5.0 host. I placed the file in /tmp because it will be deleted when the host is rebooted.

Enable the TSM-SSH service on the ESXi 5.0 host,  connect using an SSH client e.g. PuTTY and install the new drivers using the command

esxcli software vib update –depot=/tmp/BCM-NetXtremeII-1.0-offline_bundle-553511.zip

The result should be something like this:

~ # esxcli software vib update –depot=/tmp/BCM-NetXtremeII-1.0-offline_bundle-553511.zip
Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: Broadcom_bootbank_misc-cnic-register_1.70.0.v50.9-1OEM.500.0.0.472560, Broadcom_bootbank_net-bnx2_2.1 .12b.v50.3-1OEM.500.0.0.472560, Broadcom_bootbank_net-bnx2x_1.70.34.v50.1-1OEM.500.0.0.472560, Broadcom_bootbank_net-cnic_1.11.18.v50.1-1OEM.500.0.0.472560, Broadcom_bootbank_scsi-bnx2i_2.70.1k.v50.2-1OEM.500.0.0.472560
VIBs Removed: VMware_bootbank_misc-cnic-register_1.1-1vmw.500.0.0.469512, VMware_bootbank_net-bnx2_2.0.15g.v50.11-5vmw.500.0.0.469512, VMware_bootbank_net-bnx2x_1.61.15.v50.1-1vmw.500.0.0.469512, VMware_bootbank_net-cnic_1.10.2j.v50.7-2vmw.500.0.0.469512, VMware_bootbank_scsi-bnx2i_1.9.1d.v50.1-3vmw.500.0.0.469512
VIBs Skipped: Broadcom_bootbank_scsi-bnx2fc_1.0.1v.v50.1-1OEM.500.0.0.406165

You can see a reboot is required. After the reboot, you can verify the driver version using the command

ethtool -i vmnic[your vmnic number]

It should return something like this:

~ # ethtool -i vmnic0
driver: bnx2
version: 2.1.12b.v50.3
firmware-version: bc 1.9.6
bus-info: 0000:03:00.0

or you can look at the Hardware Status tab of your host in your vSphere Client. You might need to update the info first using the link in the upper right of the tab:

Enable SNMP on your ESXi 4.0/4.1/5.0 hosts from the vMA

To enable SNMP:

vicfg-snmp –server [ESXi host FQDN] -c public -p 161 -t [SNMP manager IP address]@161/public,[SNMP manager IP address]@161/public –username root

Seperate multiple managers with comma’s. No spaces. Adjust your community names accordingly.

Enable the SNMP agent:

vicfg-snmp –server [ESXi host FQDN] -E –username root

Test:

vicfg-snmp –server [ESXi host FQDN] -s –username root

svMotion vCloud Director VM’s using the vCloud REST API and cURL

Because I had to clear out 2 LUN’s I ran into the problem that moving vCloud Director VM’s through vSphere is a bad idea.

Luckily I remembered a blog post made by the brilliant William Lam not so long ago: Performing A Storage vMotion in vCloud Director Using vCloud REST API. This will be the basis of the solution. Please note I’m using vCloud Director 1.5.

In preperation of this, I did the usual:

  1. Create new LUN’s
  2. Present them to the ESXi hosts
  3. Create new datastores

The next thing to do is to add the datastores to vCloud Director. I was trying to add them in the vSphere Resources section, but couldn’t find the functionality there

The correct place is do this is in the Provider vDC:

Now it’s time to call the REST API’s. First, download cURL. Since I’m using Windows, I got the curl-7.25.0-ssh2-ssl-sspi-zlib-idn-static-bin-w32.zip from the Win32 section.

Before issuing the commands, you need to get a session key. You need to provide credentials to get one. I don’t know what permissions you need to perform the svMotion, so I’ll just take the vCloud Administrator account. Execute the following command in a Command Prompt to get the session key:

curl -i -k -H “Accept:application/*+xml;version=1.5” -u Administrator@system:[vCloud Administrator password] -X POST https://%5BYour vCloud Server FQDN]/api/sessions

Your reply should look something like this. The session key is highlighted in bold:

HTTP/1.1 200 OK
Date: Fri, 18 May 2012 10:22:51 GMT
x-vcloud-authorization: y27xZ3cYNlmmtMkR9oXhHwca2+U8pb3pIxWlFCXIn78=

Now to get all the VM’s in your vCloud Director environment, execute the following command. Please note I use the session key I just retrieved :

curl -i -k -H “Accept:application/*+xml;version=1.5” -H “x-vcloud-authorization: y27xZ3cYNlmmtMkR9oXhHwca2+U8pb3pIxWlFCXIn78=” -X GET https://%5BYour vCloud Server FQDN]/api/query?type=adminVM

This command differs from what is posted on William Lam’s blog. When I specify the ‘&fields=name,vCloud’ parameter, I receive the error ”fields’ is not recognized as an internal or external command, operable program or batch file.’, so I left that out.

You will get another HTTP/1.1 200 OK response followed by alot of XML code. Look for the <AdminVMRecord tag. This will indicate the start a code block for a VM.

<AdminVMRecord vmToolsVersion=”8290″ vdc=”https://%5BYour vCloud Server FQDN]/api/vdc/acace623-5973-4c1f-9b81-f6f5bf
88aaa7″ vc=”https://%5BYour vCloud Server FQDN]/api/admin/extension/vimServer/9d57c5dc-6e58-4514-9e6e-a3cf2f4cddfb” stat
us=”POWERED_OFF” org=”https://%5BYour vCloud Server FQDN]/api/org/21749a1d-5b02-4228-ba9b-2b0db996b629″ numberOfCpus=”1″
networkName=”vApp Intern” name=”WinXP” moref=”vm-682″ memoryMB=”2048″ isVdcEnabled=”true” isVAppTemplate=”false” isPubl
ished=”false” isDeployed=”false” isDeleted=”false” hostName=”[ESXi host]” hardwareVersion=”8″ guestOs=
“Microsoft Windows XP Professional (32-bit)” datastoreName=”[Your datastore name]” containerName=”[Your vApp name]” contain
er=”https://%5BYour vCloud Server FQDN]/api/vApp/vapp-ad1fcab3-fa67-4e87-9f9d-7db9e7d1f3eb” href=”https://%5BYour vCloud Server FQDN]/api/vApp/vm-0302a63d-b749-4c10-9fb3-019ab1e2ef09″ pvdcHighestSupportedHardwareVersion=”8″ containerStatus=
“RESOLVED”/>

Look for the value href and copy the URL. Do this for all <AdminVMRecord blocks and you’ll end up with a list of VM’s with their href URL’s:

name=”WinXP1″ href=”https://%5BYour vCloud Server FQDN]/api/vApp/vm-0302a63d-b749-4c10-9fb3-019ab1e2ef09″
name=”Linux1″ href=”https://%5BYour vCloud Server FQDN]/api/vApp/vm-55bc30b8-738f-4388-8165-59caa61bfd63″
name=”WinXP2″ href=”https://%5BYour vCloud Server FQDN]/api/vApp/vm-6aa01bf8-3ae8-48d9-bb39-0de7fad9850c”
name=”Linux2″ href=”https://%5BYour vCloud Server FQDN]/api/vApp/vm-7b653086-1f89-4d86-9802-d37f61922c79″
name=”Linux3″ href=”https://%5BYour vCloud Server FQDN]/api/vApp/vm-bb28fd62-65b1-4474-af55-86621d791558″

Now do something similar to get your datastore href’s:

curl -i -k -H “Accept:application/*+xml;version=1.5” -H “x-vcloud-authorization: y27xZ3cYNlmmtMkR9oXhHwca2+U8pb3pIxWlFCXIn78=” -X GET https://%5BYour vCloud Server FQDN]/api/query?type=datastore

Look for the <DatastoreRecord tag in the XML output and copy the corresponding href values:

<DatastoreRecord vcName=”[Your VC name]” vc=”https://%5BYour vCloud Server FQDN]/api/admin/extension/vimServer/9d57c5dc-6e
58-4514-9e6e-a3cf2f4cddfb” storageUsedMB=”200149″ storageMB=”767744″ requestedStorageMB=”246784″ provisionedStorageMB=”5
03169″ numberOfProviderVdcs=”1″ name=”[Your datastore name]” moref=”datastore-689″ isEnabled=”true” isDeleted=”false” datas
toreType=”VMFS5″ href=”https://%5BYour vCloud Server FQDN]/api/admin/extension/datastore/6f16b43f-3c6d-433e-b4a2-55d0976
dda9c” taskStatus=”success” task=”https://%5BYour vCloud Server FQDN]/api/task/36e286f1-d6bc-42ed-9a43-01b5e663e1ac” tas
kStatusName=”jobAdd”/>

Again, you will end up with a list similar to this:

name=”vCloud-SourceDisk01″ href=”https://%5BYour vCloud Server FQDN]/api/admin/extension/datastore/10f678c7-9a9f-4b2d-907d-c129d9d0119d”
name=”vCloud-SourceDisk02″ href=”https://%5BYour vCloud Server FQDN]/api/admin/extension/datastore/855a6cf8-1cf3-4e1e-ba63-9863f1607402″
name=”vCloud-DestinationDisk01″ href=”https://%5BYour vCloud Server FQDN]/api/admin/extension/datastore/80816ef7-80fe-4084-88eb-d58ead85e7d5″
name=”vCloud-DestinationDisk02″ href=”https://%5BYour vCloud Server FQDN]/api/admin/extension/datastore/6f16b43f-3c6d-433e-b4a2-55d0976dda9c”

You’ll now have to create a text document for every destination datastore for the vCloud VM’s. In my case I had 2 destination datastores, so I made 2 files, relocate-response-destinationdisk01.txt and relocate-response-destinationdisk02.txt with the following info:

<RelocateParams xmlns=”http://www.vmware.com/vcloud/v1.5″&gt;
<Datastore href=”https://%5BYour vCloud Server FQDN]/api/admin/extension/datastore/80816ef7-80fe-4084-88eb-d58ead85e7d5″/>
</RelocateParams>

and

<RelocateParams xmlns=”http://www.vmware.com/vcloud/v1.5″&gt;
<Datastore href=”https://%5BYour vCloud Server FQDN]/api/admin/extension/datastore/6f16b43f-3c6d-433e-b4a2-55d0976dda9c”/>
</RelocateParams>

These are, of course, the destination datastore href values. Now you’re ready to give the actual svMotion command:

curl -i -k -H “Accept:application/*+xml;version=1.5” -H “x-vcloud-authorization: y27xZ3cYNlmmtMkR9oXhHwca2+U8pb3pIxWlFCXIn78=” -H “Content-Type:application/vnd.vmware.vcloud.relocateVmParams+xml” -X POST https://%5BYour vCloud Server FQDN]/api/vApp/vm-0302a63d-b749-4c10-9fb3-019ab1e2ef09/action/relocate -d @relocate-response-destinationdisk01.txt

You’ll have to do this for every VM, note the href ID’s of your list of VM’s. Please change the relocate file accordingly. You can see the vMotion execute  in the vSphere client:

On several occasions I saw the vCloud Director server issue a consolidate command on a VM before it executed the svMotion. Exactely the reason why you can’t do this through vSphere.

Because some of the svMotions took a long time, I had to get new a session key several times. Just execute the first curl command and replace the returned session key in your svMotion command.

Good luck and I want to give another shout out to William Lam for publishing his solutions to difficult, highly technical problems. Follow his blog here.

Deleting the undeletable datastore: The resource is in use; Call “HostDatastoreSystem.RemoveDatastore” for object datastoresystem on ESXi failed

While trying to delete a datastore I receive the error “The resource ‘[VMFS Volume ID]’ is in use”.

After checking the datastore I found a vmware.log file. The contents of this file pointed me to the VM that was using a hardcoded path to the datastore as log.filename parameter in it’s .vmx. Shutting down the VM, editing it’s .vmx with vi quickly fixed this problem.

After this, however, I was still receiving the same error. I decided to try to force the deletion of the datastore. After  browsing through several KB articles (KB2008021, KB1003344, KB1004230) and even trying partedUtil (KB1036609)

and dd (KB1008886)

without success, I was left baffled and empty handed. I decided to sleep on it for 1 night because there was probably something I did wrong or something I overlooked.

The next day brought me the OBVIOUS solution. Because we consolidated our log files a few weeks earlier onto a dedicated LUN, I was pretty sure those settings where correct. I executed the PowerCLI command

Get-VMHost | Get-VMHostAdvancedConfiguration -Name “ScratchConfig.ConfiguredScratchLocation”

to make sure. And indeed, everything seemed OK.

It wasn’t until I executed the command

Get-VMHost | Get-VMHostAdvancedConfiguration -Name “Syslog.Local.DatastorePath”

that I was presented with the obvious truth

A simple

Get-VMHost | Set-VMHostAdvancedConfiguration -Name “Syslog.Local.DatastorePath” -Value “[] /scratch/log/messages”

fixed this config error and after this I had no problem to delete the datastore.

What this experience really thought me is that if VMware says the resource is in use, IT ACTUALLY IS IN USE. This blogpost is a reminder to myself that sometimes taking a step back is better then to dive straight into the problem.

Oh yeah, and of course to READ the error.

High %CSTP in your VM. Delete snapshot failed: How to consolidate orphaned snapshots into your VM

This blogpost is based on a real life incident. The platform is vCenter Server & ESXi 4.1 Update 2.

Please note that vSphere 5 has an improved method for this situation: KB2003638.

So what do you do when you have to do maintenance to your Exchange VM? Stop all services and make a snapshot, right? It might be a good idea to stop the VM first, because snapshotting alot of RAM will take some time. After you’ve done your change and you are satisfied it’s working properly again, you delete the snapshot.

So what if you forget to delete the snapshot? Performance will become bad fairly quickly, depending on the number of users (= load). Exchange will become sluggish and esxtop will show you why:

66% CSTP. Not good. VMware made a nice KB article about it: KB2000058. Solution? Pretty simple, consolidate your snapshot.

But… What if that proces fails? In my case, the VM had high CPU load because it was running on a snapshot disk, high I/O load because a backup process was running. On top of that, I tried to delete the snapshot. When realising the backup was also running, it was immediately paused. Ofcourse, strictly speaking the 2 processes shouldn’t have any influence on each other, but the VM was really sluggish at this time. On top of that, during the consolidation process, vCenter Server lost the connection to the host. So the deletion of the snapshot timed out.

First thing to do is wait. Just because your vSphere client doesn’t report it, doesn’t mean the consolidation process has failed. The time you have to wait depends on the size of the snapshot (and the speed of your storage). In my case I waited another 20 minutes. The VM was still sluggish. I checked the hard disk location of the VM and saw it was running on <name>-00000x.vmdk files. Those are snapshot files, so I knew by then the snapshot consolidation process had really failed.

This is where it becomes interesting. You are running on snapshot files and your VM is sluggish because of that. Nothing really changes, you still have to consolidate the snapshot. But  that has become impossible to do by using your vSphere client because vCenter doesn’t ‘see’ the VM has a snaphot.

The solution to this problem is fairly simple and is explained in KB 1002310: Take a (new) snapshot (from the vSphere Client preferrably). When that finished, delete all snapshots. The ‘orphaned’ snapshot files will be consolidated together with the new snapshot.

In my case, I tried the command line approach and logged into the ESXi host. I executed the command

vim-cmd vmsvc/getallvms

to get the list of VM’s. I executed the command

vim-cmd vmsvc/snapshot.get [VMID]

to get the snapshot. For some unknown reason, which is still unknown to me at this point, no snapshot info was returned. Looking back, it might have been possible the snapshot only consolidated completely at that point. Unfortunately I didn’t check at that time. It might also be very possible that corruption had already occured at this point in time. Fact of the case is, the creation of the snapshot in the following step went wrong. So there was something wrong. It bothers me that I can’t pinpoint the problem till this date.

So I tried to create a snapshot using the command

vim-cmd vmsvc/snapshot.create 3 snapshot1 snapshot 0 0

The command failed with the ever so lovely error: ‘Snapshot creation failed’. I’m not sure this was the exact error returned, but the information given was basically the same. It remined me of the legendary Windows error ‘An error has occured.

Anyway, I also check KB1008058 and I might very well did something wrong while trying to make the snapshot from the command line, because I ended up with 2 snapshot files with wrong CID ID’s in the .vmdk descriptor files. ESXi also saw the error and shutdown the VM. This was actually very good, because a shutdown is always better then writing your datablocks in the wrong data file.

I removed the VM from the inventory and added it again. Of the 3 virtual HDD, one showed 0GB. This was the disk with the Exchange database.

I wasn’t too worried though, because I knew I still had all data. Time for some intense analysis and read up on KB1007969.

Note: The correct thing to do at this point, is to contact VMware support. They will basically do the same as I describe below, but it might be a good idea to let them do it. I only continued because I knew what the problem was and I was pretty sure I still had all my data.

I checked the .vmx file to determine the ‘lost’ disk. It was scsi0:1, the second disk. This seemed correct. -000002.vmdk showed to be the latest snapshot file the VM was running of.

My snapshots were not correctly ‘aligned’, so I couldn’t get any reliable info from the .vmsd and vmsn files.

I had 2 snapshot files; -000002 (which was the newest snaphost and which was the snapshot the VM was running on), and -000004. Unfortunately I don’t have a screenshot of the -000002 vmdk file.

The parent disk was on another LUN. It’s vmdk file seemed OK.

The problem was, -000002’s parentCID pointed to fcbc7dd4, and -000004 assumed CID fcbc7dd4 as you can see below.

Knowing that the CID identfiers change during boot , I was quite comfortable to change the CID (NOT the parentCID!). I changed -000004 parentCID to fcbc7dd4 and I changed -000004’s CID to 499d08dd and also changed the final 8 chars of the ddb.longContentID to 499d08dd. Basically, I swapped the ID’s. Now -000004 points to the correct parent. The path to the parent disk was already correct, so I didn’t change that.

Now I only had to point the -000002 to -000004 as parent. So I change the parentCID to 499d08dd. The path was correct. I made sure the -000002 and -000004 vmdk files listed createType as vmfsSparse and the extended description also included VMFSSPARSE.

I then removed the VM from the inventory and added it again. The disks were now all back to their original sizes. Because the mention of possible data corruption in KB1007969, we contacted VMware support.

After 1 hour Adrian White, one of VMware’s Technical Support Engineers in Ireland, assured us that data corruption was pretty unlikely (not impossible). He checked my steps above and verified they were correct.

We were now basically back to the point where there was only the problem of the orphaned snapshots files. The solution was still the same. Take a snapshot and consolidate all snapshots. This time, the vSphere client was used. Taking the snapshot only took a moment because the VM was still shutdown. Choosing the ‘Delete all’ option in the Snapshot Manager resulted in a slowly progressing progress bar, which, of course, was a good thing =)

This time around the deletion was succesful. The VM powered up and booted without any problems. There were no further incidents and the VM performed splendidly.

As of to date, there has been no report of data loss or data corruption. So inspite of loosing mail capabilities for 4 hours (1 hour waiting for VMware support), the organisation was satisfied with the performance once the mail functionality was restored. This helped alot in the acceptance of the loss of productivity.

I also want to mention KB1015180, which explains how snapshots work.