While working on some older DL360’s, I ran into the infamous Purple Screen of Death. Similar to Microsoft’s Blue Screen of Death, this occurs when there is a kernel panic in Linux variants that result in a system halt. This typically is due to a driver issue, hardware issue, or in my case, a recent patch.
When I patched the system, the ESXi host restarted as normal, but got stuck on a task appropriately called VMK Shutdown: World_DestroyAllUsersWorlds.
The task hung for a few minutes and then puked up the Purple Screen of Death. The specific updates in question were these version 8 patches:
It’s unclear to me which of these patches specifically caused the issue but thankfully, an additional reboot resolved the issue. The system cleaned up the failed task, restarted all services and was perfectly fine afterwards.
In retrospect, this incident underscores the need for maintaining an N+1 system configuration in a VMware vSphere cluster. While the crash I encountered was easily remediated with a reboot, a more prolonged outage without adequate resources on the remaining hosts could have put the entire cluster at risk. By implementing an N+1 configuration, you gain the flexibility to take a host out of production without compromising the stability and availability of your cluster.
The Purple Screen of Death is an unwelcome sight for any virtualization enthusiast. Through my firsthand encounter, I experienced the panic it can induce. However, by promptly addressing the issue and embracing an N+1 system configuration, we can minimize the potential impact of such incidents. Remember, in the realm of VMware, preparedness and redundancy are key to ensuring a stable and resilient virtual environment.