VMware Update Manager - The good, the bad, the ugly
Today I wanted to try out VMwares shiny new Update Manager for VI3 (ESX 3.5, Virtual Center 2.5) because I heard so many cool things about it on VMworld. Unfortunately, my expectations to this product were much higher than what I got out of it. Let me summarize what they should improve in the next version.
I have three ESX hosts, one "power" machine (dual core, each CPU has about 3GHz, 16GB RAM) and two smaller machines of equal size (dual core, each CPU has about 3GHz, 8GB ram). Previously I always had to manually hot migrate (using VMotion) my VMs from the ESX host to be updated to the other ESX hosts which will be upgraded later on in this process. After applying all the updates, I had to manually move them back, update the next server, etc. VMware Update Manager claims to do exactly all that for your virtual infrastructure making updates a lot easier because they can even be scheduled at times when the server load is not that high or at the weekends, etc.
Well, thanks god I didn't try that schedule thing but was in front of my Virtual Center when trying Update Manager.
To start the whole update procedure, I clicked on "Remediate..." on my ESX Cluster and Update Manager picked one of my ESX hosts (oddly it didn't pick the one with the lowest load), started opening firewall ports, installing some stuff and finally wanted to put the host into maintenance mode. Due to me having the DRS automation level set to "partially automated" (I'm paranoid, you know - I don't even trust my virtual infrastructure), entering the maintenance mode would have timed out because the DRS migrations to move all the VMs off of the ESX host that currently gets updated were just recommendations. I needed to manually apply the generated recommendations and then it started to migrate the VMs away.
My VMotion network is currently only connected via 100MBit/s which I know is not recommended by VMware but it works (migrations take longer, but that doesn't bother me that much), _BUT_ because migrations take longer, the "put $esxhost into maintenance mode" task times out and what's even much worse: The parent task of this update process (called "Remediate Entity") stalls at a certain percentage level and stops working. You can't cancel it, you can't restart it, in fact, trying to start a new remediation only makes things worse.
Another thing that isn't very smart would be the automatically generated DRS recommendations. At the time Update Manager tries to get one ESX host out of duty it scans the cluster for available resources and in my case, having two additional ESX servers with average (low) load, it only chose _ONE_ of them to host the VMs to be migrated. Bad idea. During the migration, the load on ESX host A started to increase and DRS moves machines from ESX A away to ESX B to "balance average CPU loads", as it said... Well, what about generating new DRS recommendations after having migrated two VMs off the target ESX server? Things might have changed after that... Never thought about that? Don't worry, I already know.
Anyway, what helped to get the Update Manager processes disappear was to manually kill the update-manager.exe service on the Virtual Center server (stopping the service also timed out), wait a few seconds, start it again and wait for Virtual Center to reintialize the Update Manager extension. If it doesn't, set it manually to "Enable" again and all the previous, stalled, tasks should now have been quit with "VMware Update Manager had a failure". That's good, because now you can start over with patching your VI.
After having all machines manually migrated off my target ESX host, I put it manually into maintenance mode and started a new remediation on cluster level to see if it would be so clever to choose the host already in maintenance mode but it didn't (OK, that might be a good idea; you never know _WHY_ this host is currently in maintenance mode). What puzzled me was VMware Update Manager's overestimation of its capabilities, because one of my hosts was put to maintenance mode (therefore it wasn't an active part of my cluster anymore) and the other two ESX hosts hosted all my virtual machines being under quite some load trying to cope with that. But not enough, it tried to consolidate the two remaining ESX servers to get one free for applying updates onto it.
VMware, could you please ask me if I really want to do this? Doing this causes my whole VI to simply stop working because one ESX host can't handle this load. It's simply ridiculous to start the update process when resources are that low...
So, after my first date with VMware Update Manager I decided to not trust it as much as I would have liked to.
What worked for me was to manually (!!) migrate all the VMs of one of my ESX servers onto the other two ESX servers (I used a very, very complex algorithm to find out what VMs to move onto what ESX server to "balance the average CPU loads") and afterwards started the remediation of the critical updates on the empty ESX server.
During the time I'm writing this I'm currently giving Update Manager a second chance to prove that it could be my friend. To make stuff easier, I changed the DRS automation level to "fully automated" and bingo, that worked now. Update Manager was able to put the host into maintenance mode and it did a fairly good job in migrating the VMs to the other hosts. It is currently installing the updates and maybe afterwards I'll do some tests on VMware HA (isolation and stuff seems interesting...).
Long story short: Do extensive testing on VMware Update Manager before letting it do its work unattended.