Looking for some help here from more proficient VMware guys...
So I made a stupid mistake today and hot-unplugged a Mellnox card from my box running ESX 6.5.
I accidentally also shook loose a network cable, but that happens once in a while so I was not worried when the box didnt react. Didnt think of it at first though and rebooted it via power switch. Never had an issue with that before.
Now after reboot the box came up but only some of my vms where starting as expected, some indicated a missing datastore.
Turns out the nvme drive (Intel P3500) those (o/c important vms) resided on was not mounted as datastore any more.
In fact its not recognized as datastore/storage adapter at all any more.
Its still there - visible in lspci and pass through but not in storage adapters.
I then went on to remove potential older pci pass through xonfigs and rebooted - nothing.
I moved the adapter to another host, same issue.
I see some errors in vmkernel log but can't make much of them
017-01-13T21:12:02.675Z cpu4:66024)VMK_PCI: 915: device 0000:07:00.0 pciBar 0 bus_addr 0xfb710000 size 0x4000
2017-01-13T21:12:02.675Z cpu4:66024)DMA: 646: DMA Engine 'nvmeCtrlrDmaEngine' created using mapper 'DMANull'.
2017-01-13T21:12:02.675Z cpu4:66024)VMK_PCI: 765: device 0000:07:00.0 allocated 2 MSIX interrupts
2017-01-13T21:12:08.393Z cpu2:65641)nvme:nvmeCoreLogError:370:command failed: 0x4305592b8f90.
2017-01-13T21:12:08.393Z cpu2:65641)nvme:nvmeCoreLogError:370:command failed: 0x4305592b9110.
2017-01-13T21:12:08.394Z cpu2:65551)nvme:nvmeCoreLogError:370:command failed: 0x4305592b9290.
2017-01-13T21:12:10.396Z cpu4:66024)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9410 failed, putting to abort queue.
2017-01-13T21:12:10.396Z cpu4:66024)nvme:NvmeCtrlr_RequestIoQueues:1164:Failed requesting nr_io_queues 0x0
2017-01-13T21:12:10.396Z cpu4:66024)nvme:NvmeCtrlr_Start:1647:Failed to allocate hardware IO queues.
2017-01-13T21:12:10.396Z cpu7:66275)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9590 failed, putting to abort queue.
2017-01-13T21:12:12.496Z cpu7:66275)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9710 failed, putting to abort queue.
2017-01-13T21:12:12.496Z cpu7:66275)nvme:NvmeCtrlr_ConfigAsyncEvents:2763:Async event config failed
2017-01-13T21:12:14.498Z cpu7:66275)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9890 failed, putting to abort queue.
So I am kinda stumped now ... any Ideas?
And no, o/c I dont have a recent backup due to the fact I am currently redesigning the whole env...
And no if those vms are gone then shit, but I'll survive, its just 'vcenter, PDC and all end user vms'. Just means a whole bunch of work and some complaints from the in house (home) users.
Edit:
I have now booted up win on that box and it does see the controller but does not see it as drive either.
Not good :/
So I dont think its a vmware issue at all, more like a hardware problem
Will run Intel Drive tool when all the prereqs are installed...
So I made a stupid mistake today and hot-unplugged a Mellnox card from my box running ESX 6.5.
I accidentally also shook loose a network cable, but that happens once in a while so I was not worried when the box didnt react. Didnt think of it at first though and rebooted it via power switch. Never had an issue with that before.
Now after reboot the box came up but only some of my vms where starting as expected, some indicated a missing datastore.
Turns out the nvme drive (Intel P3500) those (o/c important vms) resided on was not mounted as datastore any more.
In fact its not recognized as datastore/storage adapter at all any more.
Its still there - visible in lspci and pass through but not in storage adapters.
I then went on to remove potential older pci pass through xonfigs and rebooted - nothing.
I moved the adapter to another host, same issue.
I see some errors in vmkernel log but can't make much of them
017-01-13T21:12:02.675Z cpu4:66024)VMK_PCI: 915: device 0000:07:00.0 pciBar 0 bus_addr 0xfb710000 size 0x4000
2017-01-13T21:12:02.675Z cpu4:66024)DMA: 646: DMA Engine 'nvmeCtrlrDmaEngine' created using mapper 'DMANull'.
2017-01-13T21:12:02.675Z cpu4:66024)VMK_PCI: 765: device 0000:07:00.0 allocated 2 MSIX interrupts
2017-01-13T21:12:08.393Z cpu2:65641)nvme:nvmeCoreLogError:370:command failed: 0x4305592b8f90.
2017-01-13T21:12:08.393Z cpu2:65641)nvme:nvmeCoreLogError:370:command failed: 0x4305592b9110.
2017-01-13T21:12:08.394Z cpu2:65551)nvme:nvmeCoreLogError:370:command failed: 0x4305592b9290.
2017-01-13T21:12:10.396Z cpu4:66024)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9410 failed, putting to abort queue.
2017-01-13T21:12:10.396Z cpu4:66024)nvme:NvmeCtrlr_RequestIoQueues:1164:Failed requesting nr_io_queues 0x0
2017-01-13T21:12:10.396Z cpu4:66024)nvme:NvmeCtrlr_Start:1647:Failed to allocate hardware IO queues.
2017-01-13T21:12:10.396Z cpu7:66275)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9590 failed, putting to abort queue.
2017-01-13T21:12:12.496Z cpu7:66275)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9710 failed, putting to abort queue.
2017-01-13T21:12:12.496Z cpu7:66275)nvme:NvmeCtrlr_ConfigAsyncEvents:2763:Async event config failed
2017-01-13T21:12:14.498Z cpu7:66275)nvme:NvmeCore_SubmitCommandWait:1044:command 0x4305592b9890 failed, putting to abort queue.
So I am kinda stumped now ... any Ideas?
And no, o/c I dont have a recent backup due to the fact I am currently redesigning the whole env...
And no if those vms are gone then shit, but I'll survive, its just 'vcenter, PDC and all end user vms'. Just means a whole bunch of work and some complaints from the in house (home) users.
Edit:
I have now booted up win on that box and it does see the controller but does not see it as drive either.
Not good :/
So I dont think its a vmware issue at all, more like a hardware problem
Will run Intel Drive tool when all the prereqs are installed...
Last edited: