Had a ESXi PSOD today. That does not happened that often, so I was quite surprised to find out that it was not a hardware related issue that was the root cause.
VMware did an analysis of the memory dump, and it turned out to be a faulty driver. That made sense since the PSOD often comes from drivers og agents when it is not a hardware issue.
The PSOD i got was the following:
#PF Exception 14 in World xxxxxxx:vmnicX-pollw IP xxxxxxxxxx addr xxxxxxxx
Three things are worth noticing. First off the uptime of the server is more than 77 days. Initially I thought it might have been a memory leak, since no servers have had any issues for more than 77 days, and now servers started crashing. Initially we had two servers crash, but then we had a server crash that had just been rebooted! So we figured that was not the case.
Second the vmnicX-pollw was mentioned as the thing ESXi was working with at the time of the crash each time, and third the qfle3 driver is mentioned.
Turns out there is a VMware KB that might be relevant to this problem: https://kb.vmware.com/s/article/52044
The case mentions that during the upgrade from 6.5 to 6.7 the driver might change from bnx2x to qfle3 driver, and there is a way to switch back to the bnx2x driver.
#The commands to switch back to bnx2x driver is: esxcli system module set --enabled=true --module=bnx2x esxcli system module set --enabled=false --module=qfle3
After the commands are executed on the host, a reboot is needed.
This solution did not work in our case unfortunately. The bnx2x drivers was not compatible with our storage nics, so another solution was required.
Also HPE no longer supports that driver for certain network adaptors: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00044039en_us&docLocale=en_US
I our case we are using the adaptor type for both network, and FCoE storage, and there is something special to notice here. The 126.96.36.199 driver is unfortunatly not supported for FCoE, so it is a little unfortunate that HPE provides that driver using their vibsdepot, since you then cannot assume that your are in a supported state, you really have to pay attention and micro manage your drivers.
To downgrade to the 188.8.131.52 version. You can download the driver from VMware. The driver can be downloaded here: https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI67-QLOGIC-QCNIC-20647&productId=742
To install the driver I found that the easiest way was to unpack it and upload the offline bundle to the datastore, and then install it using PowerCLI after putting the host into maintenance mode.
Here are the commands needed to install the bundle:
$vmhost = Get-VMHost "<HostName>" $esxcli = Get-EsxCli -VMHost $vmhost $esxcli.software.vib.install("<datastore path>\driver-offline-bundle.zip>")
Afterwards a reboot of the host is needed.
I hope this helps you out. If it turns out that this problem persists, I will update this article with relevant information.
Please perform any operations seen in this article at your own risk.
Sergei in the comments pointed me to this HPE advisory: a00077550en_us that states that you need to upgrade to 184.108.40.206, but also that the problem should only be with 220.127.116.11 drivers and older. That does not seem to be entirely correct since we had the problem with the 18.104.22.168 driver.
vSphere 6.5 Driver Download Link
vSphere 6.7 Driver Download Link