Originally published May 21, 2016 @ 11:13 pm
So, being a Unix admin on-call for the week, I just spent half of my Saturday fixing a dead Windows VM. Very annoying. The best I can tell, the issue occurred during the snapshot operation. Either the VM was powered down or powered back up by some genius while the snapshot process was still in progress. After that, the VM could not be started, or cloned, or snapped, or migrated to another cluster.
Any attempt to do so, would result in an “invalid option” or “VMDK locked” error. The only thing I could do was to migrate the VM from one ESX host to another within the same cluster in a fruitless attempt to find the host that was holding the VMDK lock. No luck there.
Checking all the hosts in the ESX cluster via SSH showed no “*.lck” files. Trying to migrate the VM to another cluster failed with the same error: some VMDK was apparently locked. Running “lsof” or “ps” on any of the cluster nodes showed no active processes holding the VMDK. VMWare KB articles, with their tiny gray-on-white fonts and full-screen width proved unhelpful. There were a bunch of <vm_name>-00000[0-9] files in the VM directory on the datastore and they were all “locked”.
The solution required two things: root SSH access to the ESX cluster host where the VM was registered. You would need to enable root SSH (from the stupid vSphere GUI) and then, if you’re feeling particularly efficient today, add your key to “/etc/ssh/keys-root/authorized_keys”
To find out which host holds your VM hostage, just SSH to each host and run:
vim-cmd vmsvc/getallvms | grep <vm_name>
The next step is to clone the main VMDK:
datastore=my_awesome_datastore vm_name=my_awesome_vm mkdir /vmfs/volumes/${datastore}/${vm_name}_clone vmkfstools -i /vmfs/volumes/${datastore}/${vm_name}/${vm_name}.vmdk / /vmfs/volumes/${datastore}/${vm_name}_clone/${vm_name}_clone.vmdk
Then, the epic fail: I had to use the vSphere Client GUI. Browse the datastore, drill down to “/vmfs/volumes/${datastore}/${vm_name}_clone”, right-click and “Add to Inventory”. Also in the GUI, edit the VM settings, remove the old “hard drive” and add the “/vmfs/volumes/${datastore}/${vm_name}_clone/${vm_name}_clone.vmdk” one.
Now, remove the original VM from the inventory and power on the clone. The stupid thing came right back up. You can move the old datastore VM folder to “something_fubar”. The last step is to rename the VM to its original name. Virtualization is for managers, GUIs are for amateurs, and Windows is for… never mind. Ugh…
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.