ESXi 7 host unresponsive after approx. a week?

lovschal · Jan 16, 2023

Hi,

I've updated my hardware some weeks ago, and now my ESXi host becomes unresponsive every week. Like I have to kill the server by the powerbutton and then boot it again.

My old host (running ESXi 6.7) didn't do it, but the new one does.

The new one runs ESXi 7.0U3g.

It's a HP ProDesk 600 G5 SFF with 6xCPU Intel Core i5-9500 CPU and 16GB of RAM, so plenty of power for my two VMs (Ubuntu and Raspbian).

I'm rather new to ESXi so I don't know how and where to find the reason for the issue.

Can You guys point me in the right direction? Please let me know what info You might need??

Thanks in advance.

Best regards

A. Lovschal

Rand__ · Jan 16, 2023

anything on gpu output (purple screen of death)? (assuming u dont have ipmi/bmc console)

Checked logs for errors yet?

CPU/SSD/NVME thermals? Defective memory?

lovschal · Jan 16, 2023

Normally I don't have a screen on it, but I will connect one next time it becomes unresponsive. Btw. It happened earlier today, and I, again powered it off and on again. Everything boots up normally afterwards.

Which logs to focus on??

lovschal · Jan 16, 2023

These are the last lines from hostd.log before it became unresponsive:

Code:

2023-01-16T01:57:06.459Z info hostd[265223] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 143 : Hardware Sensor Status: Processor Green, Memory Green, Fan Green, Voltage Green, Temperature Green, Power Green, System Board Green, Battery Green, Storage Green, Other Green
2023-01-16T01:57:06.459Z warning hostd[265223] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T01:57:06.459Z error hostd[265223] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T02:07:06.463Z warning hostd[265110] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T02:07:06.463Z error hostd[265110] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T02:17:06.467Z warning hostd[264250] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T02:17:06.467Z error hostd[264250] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T02:27:06.470Z warning hostd[265108] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T02:27:06.470Z error hostd[265108] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T02:37:06.474Z warning hostd[265110] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T02:37:06.474Z error hostd[265110] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T02:47:06.479Z warning hostd[265108] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T02:47:06.479Z error hostd[265108] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T02:52:40.790Z info hostd[264245] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/63ac9540-ffd738e8-1dbf-e8d8d1bfedd6/Ubuntu server/Ubuntu server.vmx opID=vim-cmd-dd-eafb-eafc] Send config update invoked
--> [context]zKq7AVICAgAAAKEvNgEPaG9zdGQAAA2hQmxpYnZtYWNvcmUuc28AADJoHQB90RsBlffAaG9zdGQAAZHiwQFP6sEBGfvBAVQg3wEHLIUBNNDNAKzHLQA0Ay4A4hA/Ajt9AGxpYnB0aHJlYWQuc28uMAADbdEObGliYy5zby42AA==[/context]
2023-01-16T02:52:40.803Z verbose hostd[264245] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/63ac9540-ffd738e8-1dbf-e8d8d1bfedd6/Ubuntu server/Ubuntu server.vmx opID=vim-cmd-dd-eafb-eafc] Time to gather config: 11 (msecs)
2023-01-16T02:57:06.484Z warning hostd[265226] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T02:57:06.484Z error hostd[265226] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T03:07:06.488Z warning hostd[265219] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T03:07:06.488Z error hostd[265219] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T03:17:06.492Z warning hostd[265221] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T03:17:06.492Z error hostd[265221] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m
2023-01-16T03:27:06.496Z warning hostd[265224] [Originator@6876 sub=Cimsvc] Numeric sensors reset to unknown state
2023-01-16T03:27:06.496Z error hostd[265224] [Originator@6876 sub=Default] IpmiIfcOpenIpmiOpen: open(/dev/ipmi0, RDWR) failed 2 m

These are the last lines from syslog.log before it became unresponsive:

Code:

2023-01-16T03:20:00.481Z crond[262730]: USER root pid 311974 cmd /bin/crx-cli gc
2023-01-16T03:25:00.493Z crond[262730]: USER root pid 311987 cmd /bin/hostd-probe.sh ++group=host/vim/vmvisor/hostd-probe/stats/sh
2023-01-16T03:28:22.134Z smartd[264445]: [warn] t10.NVMe____SK_hynix_PC601_HFS256GD9TNG2DL2A0A_______3C9A080A002EE4AC: REALLOCATED SECTOR CT below threshold (0 < 95)
2023-01-16T03:30:00.504Z crond[262730]: USER root pid 311999 cmd /bin/hostd-probe.sh ++group=host/vim/vmvisor/hostd-probe/stats/sh
2023-01-16T03:30:00.505Z crond[262730]: USER root pid 312000 cmd /bin/crx-cli gc
2023-01-16T03:35:00.513Z crond[262730]: USER root pid 312013 cmd /bin/hostd-probe.sh ++group=host/vim/vmvisor/hostd-probe/stats/sh

These are the last lines from vmkernel.log before it became unresponsive:

Code:

2023-01-16T03:34:31.648Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:34:45.221Z cpu0:262283)StorageDevice: 7059: End path evaluation for device t10.ATA_____ST9320423AS_________________________________________5VH3RQR6
2023-01-16T03:34:45.221Z cpu0:262283)StorageDevice: 7059: End path evaluation for device t10.NVMe____SK_hynix_PC601_HFS256GD9TNG2DL2A0A_______3C9A080A002EE4AC
2023-01-16T03:34:46.748Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:34:52.848Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:04.948Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:11.049Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:17.149Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:32.250Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:38.350Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:44.450Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:50.551Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:35:56.651Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:36:05.751Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:36:11.852Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1
2023-01-16T03:36:29.952Z cpu0:262523)INFO (ne1000): false RX hang detected on vmnic1

These are the last lines from vpxa.log before it became unresponsive:

Code:

2023-01-16T03:30:46.419Z info vpxa[264788] [Originator@6876 sub=vpxLro opID=1ab86873] [VpxLRO] -- BEGIN lro-1820 -- vpxa -- vpxapi.VpxaService.getVpxaInfo -- 52ef656e-4a6a-b112-83fb-c9dc0c4a6737
2023-01-16T03:30:46.419Z info vpxa[264788] [Originator@6876 sub=vpxLro opID=1ab86873] [VpxLRO] -- FINISH lro-1820
2023-01-16T03:35:46.425Z info vpxa[264790] [Originator@6876 sub=vpxLro opID=40d163b9] [VpxLRO] -- BEGIN lro-1821 -- vpxa -- vpxapi.VpxaService.getVpxaInfo -- 52121dfa-2c65-ef3e-9eaa-efaf81db8f4d
2023-01-16T03:35:46.425Z info vpxa[264790] [Originator@6876 sub=vpxLro opID=40d163b9] [VpxLRO] -- FINISH lro-1821

Hope that came give some clue to You??

All other logfiles don't contain anything at the given point where it became unresponsive.

Rand__ · Jan 16, 2023

normally vmkernel.log should give info.
Nothing causing any major issues in these logs...

Console output (ALT-F10 or F11 iirc) might helpf if its not PSOD

lovschal · Jan 16, 2023

What do You mean by " Console output (ALT-F10 or F11 iirc) "??

Sorry for the NOOB-question..

Rand__ · Jan 16, 2023

https://kb.vmware.com/s/article/2148363

lovschal · Jan 16, 2023

Ahh, makes sense... Thanks for clarifying.

I just had a thought, can it be that the NVMe disk is disconnecting after some time and therefore the host is unable to read and write??
Because it's the major change in hardware from my old setup to this one..

Rand__ · Jan 16, 2023

Yes that might be a reason

lovschal · Jan 16, 2023

During the night I just, moved everything away from the NMVe.
So now I just have to wait to see if it happens again.

Search

ESXi 7 host unresponsive after approx. a week?

lovschal

New Member

Rand__

Well-Known Member

lovschal

New Member

lovschal

New Member

Rand__

Well-Known Member

lovschal

New Member

Rand__

Well-Known Member

lovschal

New Member

Rand__

Well-Known Member

lovschal

New Member