VMware ESX August 12 bug - Survived!
Some of you may have already heard about the VMware ESX Server 3.5 Update 2 August 12 bug
(lovely called KB1006716 by VMware) which has hit thousands of data center farms around the world.
I don't want to complain about bugs in software because we all know that bugs in software do exist and that you can never be sure that there are no bugs, but this bug was a timebomb - which makes it pretty hard to detect it, even after testing.
In my special case, I had upgraded one of my ESX servers to U2 a while ago and it was running very smoothly for more than one week (enough testing for me), so I upgraded the other ESX servers too, all was fine.
Yesterday I received a call from a client who told me that he was unable to start his virtual machine. I suspected the new VI client had issues with the rights management for restricted users and tried to start it with my user, but still no dice.
I then looked at the logfiles and tried several things and after about one hour of troubleshooting I googled for the error messages I got and was redirected to an unresponsive VMware KB article - well, that was the first indicator for something terrible. The site is still down (or at least, _VERY_ slow, waiting for about 5 minutes for it to load now) but fortunately the download of the patch itself was quite fast and so I'm currently upgrading one of my ESX servers to get it all back up and running.
What really bugs me is that you can't even migrate virtual machines away from one host to do the upgrade without downtimes - due to this bug, VMotion only works when migrating the VMs off to not affected hosts.
The good news is, that I have one server in my farm that does currently only host two productive machines (one webserver and one interface server) and both aren't _REALLY_ important, so it's no big deal when they're down for a few minutes after 4pm and that's where I'm currently at.
Server #4 is installing the patch through update manager and when it's back up, I can make all other servers free of virtual machines and do the upgrade without any further downtimes - so far the theory...