ESXI 6.7: Strange storage issues

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

m3xiz

New Member
Jan 21, 2019
11
1
3
Hi,

I am running a whitebox server since 2013. The esxi version has being upgraded up to now 6.7.

Since Dec last year, I am experiencing a strange issue and despite searches on Google and different forums I could not find guidance on how to understand/correct the issue.

I have currently 4 hard drive attached to my systems. Sometimes - and I have no clue about the root cause - I lose the storages: the gui shows the 4 disks but they have all 0 GB of capacity/provisioned/free. A "rescan" of the storages or the adaptors does nothing. There is no message on the console.

Of course,
- all VM accessing the drives have issues and most of them crash.
- I have no log as the disk is not accessible (I am currently trying sending logs on a separated syslog server)

If I reboot my esxi, all comes back to normal.

I though first to an hardware issue but if this is the case why a reboot solve the issue? If it is not hardware, what could it be?

Thanks in advance.
 

zedascuras

New Member
Feb 15, 2015
12
1
3
39
Are you able to let us know what storage hardware you have on that whitebox?
Do you have an HBA, or are you connecting directly to the motherboard?
 

m3xiz

New Member
Jan 21, 2019
11
1
3
Hdd are directly connected on the motherboard. This is a GIGABYTE GA-Z87-D3HP.
 

m3xiz

New Member
Jan 21, 2019
11
1
3
The issue occurs again. I was able to get some logs by externalising the syslog. However this is meaningless to me:
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:hotplug port status: 400040, IPM(0), SPD(0), DET(0)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: LogExceptionSignal:port 0, Signal: RM|--|--|--|--|--|--|--|--|--|--|-- (0x0001) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:port Status Reporting Port Connect Enable: Clearing PxSERR.DIAG.x
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:port Status Reporting Phy Ready Enable: Clearing PxSERR.DIAG.n
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:hotplug port status: 400040, IPM(0), SPD(0), DET(0)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: LogExceptionSignal:port 1, Signal: RM|--|--|--|--|--|--|--|--|--|--|-- (0x0001) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu5:2097598)vmw_ahci[0000001f]: LogExceptionProcess:port 0, Process: RM|--|--|--|--|--|--|--|--|--|--|-- (0x0001) Curr: RM|--|--|--|--|--|--|--|--|--|--|-- (0x0001)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:port Status Reporting Port Connect Enable: Clearing PxSERR.DIAG.x
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:port Status Reporting Phy Ready Enable: Clearing PxSERR.DIAG.n
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:hotplug port status: 400040, IPM(0), SPD(0), DET(0)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: LogExceptionSignal:port 2, Signal: RM|--|--|--|--|--|--|--|--|--|--|-- (0x0001) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu0:2097599)vmw_ahci[0000001f]: LogExceptionProcess:port 1, Process: RM|--|--|--|--|--|--|--|--|--|--|-- (0x0001) Curr: RM|--|--|--|--|--|--|--|--|--|--|-- (0x0001)
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu5:2097598)vmw_ahci[0000001f]: ExceptionHandlerWorld:processing device removal...
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu0:2097599)vmw_ahci[0000001f]: ExceptionHandlerWorld:processing device removal...
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:port Status Reporting Port Connect Enable: Clearing PxSERR.DIAG.x
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:port Status Reporting Phy Ready Enable: Clearing PxSERR.DIAG.n
Jan 21 23:42:13 esxi.mno.nul vmkernel: cpu2:2099916)vmw_ahci[0000001f]: CompletionBottomHalf:hotplug port status: 400040, IPM(0), SPD(0), DET(0)


Those message are of course only the first ones. My guess is that I have an issue with the vmw_haci. I did check the HA is a Lynx and the driver under esxi is the very latest one
[root@esxi:~] esxcli system module get -m vmw_ahci
Module: vmw_ahci
Module File: /usr/lib/vmware/vmkmod/vmw_ahci
License: BSD
Version: 1.2.3-1vmw.670.1.28.10302608
Build Type: release
Provided Namespaces:
Required Namespaces: com.vmware.vmkapi@v2_5_0_0
Containing VIB: vmw-ahci
VIB Acceptance Level: certified

A clue? Anyone?

Thanks in advance
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
Reminds me of an old promise raid controller failing...

Does this still happen with an earlier version of esxi, ver 6 or 6.5 for example?
 

m3xiz

New Member
Jan 21, 2019
11
1
3
I was running 6.0 before and upgraded immediately to 6.7. 6.0 had no such issue. 6.7 running stable during 1 month before having instabilities.

If I dot found any 6.7 solution, I believe I will downgrade to 6.5. I would to avoid it but since yesterday eve, I had to reboot my server 3 times: this is not viable.
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
What media is ESXi booting from? Try using a different media to boot ESXi, it's possible that your installation is simply corrupted due to some external event, like a power outage or random reboot for example :)
 

m3xiz

New Member
Jan 21, 2019
11
1
3
I am booting from a USB key. Your advise is quite interesting.

So far, I change the device driver for the vmw_haci.

If this is not solving the issue, I will follow your advice and reinstalling on another usb key.

If it is still unstable, I will downgrade to 6.5. If still not stable...not clue:-(

Thx
 

m3xiz

New Member
Jan 21, 2019
11
1
3

m3xiz

New Member
Jan 21, 2019
11
1
3
For the record: since the latest message my server seems to be completely stable.

Thank all for your help
 
  • Like
Reactions: ecosse