This is the first guest post of what I hope to be many from the great Herb Estrella:
In my personal experience Nutanix one-click upgrades work as advertised, but there are few items that should be accounted for in preparation of installing ESXi patches on a Nutanix cluster. This post will cover a few pre-requisites to look for, touch on the subtasks of the patching procedure, and finally close out with some troubleshooting tips and links to resources that I found helpful.
If you’ve seen the Dr. Strange movie you’ll find that going through the one-click upgrade process is loosely akin to reading from the “Book of Cagliostro” in that “the warnings come after you read the spell.”
There is a pre-upgrade process that is done prior patching that catches a few items but here a few pre-requisites that I found will set you up nicely for success:
- vSphere HA/DRS settings need to be set according to best practices aka “recommended practices” as these account for the CVM and a few other items that make a Nutanix cluster in vSphere different.
- DRS Rules (affinity/anti-affinity rules), if in use, can also cause problems. For example, if you have a 3 node cluster and 3 VMs part of anti-affinity rules, it is a good idea to temporarily disable the rules. Re-enable the rules when patching is complete.
- ISOs mounted (namely due to VMware Tools installs) are major culprits for VMs not moving when scheduled to by DRS or moved manually. I recommend to unmount any ISOs that aren’t accessible from all hosts within a cluster.
Subtasks are the steps in the one-click upgrade sequence from start to finish. Below are a listing of them with some observations from each.
- Downloading hypervisor bundle on this CVM
- When the patch is initially uploaded it is stored in a CVM’s following directory: /home/nutanix/software_uploads/hypervisor/. How you access Prism determines which CVM this hypervisor bundle (aka patch) will reside on first. This should be mostly transparent but this is one of those “good to know” items. The hypervisor bundle needs to be copied from the initial source location onto the CVM for which its host is being upgraded by the one-click upgrade process if this fails “no soup for you.”
- Waiting to upgrade the hypervisor
- …nothing to see here…
- Starting upgrade process on this hypervisor
- …keep it moving…
- Copying bundle to hypervisor
- …business as usual…
- Migrating user VMs on the hypervisor
- Huzzah! This is a good one to pay attention to especially if the pre-requisites previously covered are not addressed. The upgrade will most likely timeout/fail here and it may not give you any helpful information as to why.
- This is also a good spot to watch the Tasks/Events tab on the ESXi host being patched to get some better insight in the process.
- Installing bundle on the hypervisor
- If all VMs have been successfully migrated, the host should be in maintenance mode with the CVM shutdown. This step also takes the longest…so patience is key.
- Completed hypervisor upgrade on this node
- At this stage the host is ready to run VMs as it should now be out of maintenance mode with the CVM powered on.
In the test environment I was working with I made a lot of assumptions and just dove head first. The results as you can imagine were not good. Here are a few troubleshooting measures I used to help right my wrongs.
- The upload process for getting the ESXi patch to the CVM is straight forward; however there are two ways to do it: download a json direct from the Nutanix support portal or enter the MD5 info from the patch’s associated KB article. I chose to upload a json and purposefully use the wrong patch and now I can’t delete the json even after completing the upgrade. If I find out how to resolve this issue I’ll update this post. This is where knowing the file location of the patch on the CVM can be helpful (/home/nutanix/software_upload/hypervisor) because the patch can be deleted or replaced.
- Restarting Genesis! This one is a major key. For example, the one-click upgrade is stuck, a VN didn’t migrate, and even after the VM is manually migrated the one-click upgrade won’t just continue where it left off. In my experience to resolve this you’ll need to give it a little nudge in the form of a genesis restart. Run this command (genesis restart) on the CVM that failed, if that doesn’t work trying restarting genesis on the other hosts in the cluster. I was doing this in a test environment and did an allssh genesis restart and was able to get the process moving, but results may vary. If you err on the side of caution restart genesis one at a time manually.
- Some helpful commands to find errors in logs
- grep ERROR ~/data/logs/host_preupgrade.out
- grep ERROR ~/data/logs/host_upgrade.out
- For the admins that aren’t about that GUI life you can run the one-click upgrade command from a CVM
- cluster –md5sum=<md5 from vmware portal> –bundle<full path to the bundle location on the CVM> host_upgrade
- To check on the status host_upgrade_status
- One click upgrades via vmwaremine
- Troubleshooting KB article via Nutanix Support Portal, may require Portal access to view.
vSphere settings via Nutanix Support Portal, may require Portal access to view.
Bonus thoughts: Do I need vSphere Update Manager if I’m using Nutanix? This could be a post on it’s own (and it still might be) but I have some thoughts I’d like to share.
- In a traditional setup you will most likely have vSphere Update Manager installed on a supported Windows VM (unless VCSA 6.5) with either SQL Express or a DB created on an existing SQL server. One-click upgrade is built into Prism.
- Prism has visibility into the ESXi hosts for versioning so if a host was “not like the others” then it would pop up on a NCC check or in the Alerts in Prism.
- vCenter Plugin
- This one is worth mentioning but really not a huge deal. It’s one less thing to worry about and ties back into the resources statements above.
- My Answer
- It depends on if I’m all in with Nutanix because if my entire infrastructure were Nutanix hosts then I would not deploy vSphere Update Manager.