LSI 9211-8i IT Mode getting driver error and no drives are available CentOS

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Ch33rios

Member
Nov 29, 2016
102
6
18
43
I seem to be having an intermittent problem (potentially related to an update although I cant figure out which one) with my recently purchased LSI 9211-8i. The card came pre-flashed into IT mode and after installing it in my ESXi 6.5 server and passing it through to my NAS install (I use RockStor which is CentOS 6 IIRC), it worked fine....

I perform normal update cycles of at least monthly and while the last one I did was about that time ago, because there was no reboot, apparently nothing was impacted. Fast forward to now and after another weird issue that caused me to need to reboot the entire ESXi host, the drives in my NAS show as 'detached' in the UI.

I looked at lsblk output and sure, enough, my drives are not there. Then I perused through dmesg and found the following:

Code:
mpt2sas_cm0: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:8937/_scsih_probe()!
sd 3:0:0:0: Attached scsi generic sg0 type 0
sd 3:0:1:0: Attached scsi generic sg1 type 0
sd 3:0:2:0: Attached scsi generic sg2 type 0
sd 3:0:3:0: Attached scsi generic sg3 type 0
sd 3:0:4:0: Attached scsi generic sg4 type 0
sd 3:0:5:0: Attached scsi generic sg5 type 0
Soooo oddly enough it seems to detect that there are attached devices (I assume thats what the "sd 3:.....: Attached..." indicates) but at the same time due to the full on scsi driver failure for the passed through 9211 card it doesn't make them available.

I am going to try and do a snapshot reversal to see what I get but any thoughts on this error?

UPDATE: Snapshot reversal didn't help. This happened once before and it ended up fixing itself somehow....is it something on the ESXi side?
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
My money would be on a hardware failure with an error like that currently... assuming you've pasted straight out of dmesg the drives re-attach straight after the failure of the _scsih_probe function.

What kernel are you currently running under? Looking at the function call (in the latest code as of writing starting at line 10574 rather than 8937) the relevant code block is this bit (starting at line 10760) mentioned in your message:

Code:
	/* event thread */
	snprintf(ioc->firmware_event_name, sizeof(ioc->firmware_event_name),
	    "fw_event_%s%d", ioc->driver_name, ioc->id);
	ioc->firmware_event_thread = alloc_ordered_workqueue(
	    ioc->firmware_event_name, 0);
	if (!ioc->firmware_event_thread) {
		pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
		    ioc->name, __FILE__, __LINE__, __func__);
		rv = -ENODEV;
		goto out_thread_fail;
	}

	ioc->is_driver_loading = 1;
	if ((mpt3sas_base_attach(ioc))) {
		pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
		    ioc->name, __FILE__, __LINE__, __func__);
		rv = -ENODEV;
		goto out_attach_fail;
	}

	if (ioc->is_warpdrive) {
		if (ioc->mfg_pg10_hide_flag ==  MFG_PAGE10_EXPOSE_ALL_DISKS)
			ioc->hide_drives = 0;
		else if (ioc->mfg_pg10_hide_flag ==  MFG_PAGE10_HIDE_ALL_DISKS)
			ioc->hide_drives = 1;
		else {
			if (mpt3sas_get_num_volumes(ioc))
				ioc->hide_drives = 1;
			else
				ioc->hide_drives = 0;
		}
	} else
		ioc->hide_drives = 0;

	rv = scsi_add_host(shost, &pdev->dev);
	if (rv) {
		pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
		    ioc->name, __FILE__, __LINE__, __func__);
		goto out_add_shost_fail;
	}
I'm no kernel or C expert but were there any preceding messages to your error above? There's three possible events that can cause that message and the "failure at..." strings here should al be preceded by further details of the error.

Edit: found a couple of bug reports reporting similar issues under various ESX + passthrough combinations. The workaround there was to add the following as a boot parameter:
mpt3sas.msix_disable=1
e.g. [Solution] No disks detected after 6.2 beta upgrade (ESXi passthrough issue)
 
Last edited:

Ch33rios

Member
Nov 29, 2016
102
6
18
43
My money would be on a hardware failure with an error like that currently... assuming you've pasted straight out of dmesg the drives re-attach straight after the failure of the _scsih_probe function.

What kernel are you currently running under? Looking at the function call (in the latest code as of writing starting at line 10574 rather than 8937) the relevant code block is this bit (starting at line 10760) mentioned in your message:

Code:
    /* event thread */
    snprintf(ioc->firmware_event_name, sizeof(ioc->firmware_event_name),
        "fw_event_%s%d", ioc->driver_name, ioc->id);
    ioc->firmware_event_thread = alloc_ordered_workqueue(
        ioc->firmware_event_name, 0);
    if (!ioc->firmware_event_thread) {
        pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
            ioc->name, __FILE__, __LINE__, __func__);
        rv = -ENODEV;
        goto out_thread_fail;
    }

    ioc->is_driver_loading = 1;
    if ((mpt3sas_base_attach(ioc))) {
        pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
            ioc->name, __FILE__, __LINE__, __func__);
        rv = -ENODEV;
        goto out_attach_fail;
    }

    if (ioc->is_warpdrive) {
        if (ioc->mfg_pg10_hide_flag ==  MFG_PAGE10_EXPOSE_ALL_DISKS)
            ioc->hide_drives = 0;
        else if (ioc->mfg_pg10_hide_flag ==  MFG_PAGE10_HIDE_ALL_DISKS)
            ioc->hide_drives = 1;
        else {
            if (mpt3sas_get_num_volumes(ioc))
                ioc->hide_drives = 1;
            else
                ioc->hide_drives = 0;
        }
    } else
        ioc->hide_drives = 0;

    rv = scsi_add_host(shost, &pdev->dev);
    if (rv) {
        pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
            ioc->name, __FILE__, __LINE__, __func__);
        goto out_add_shost_fail;
    }
I'm no kernel or C expert but were there any preceding messages to your error above? There's three possible events that can cause that message and the "failure at..." strings here should al be preceded by further details of the error.

Edit: found a couple of bug reports reporting similar issues under various ESX + passthrough combinations. The workaround there was to add the following as a boot parameter:
mpt3sas.msix_disable=1
e.g. [Solution] No disks detected after 6.2 beta upgrade (ESXi passthrough issue)
I'll check tonight for other errors but when you say add it as a boot parameter do you mean in the VM advanced settings?

One other note is I tried attaching the card to a Windows server 2016 install and it has a similar problem where it could see the card but not "activate it".
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
I'll check tonight for other errors but when you say add it as a boot parameter do you mean in the VM advanced settings?
From what I've read it should be added as a (grub?) boot param to the OS that you're passing through the HBA to. I'm not familiar with how centos does it, but you can test it for a single boot by editing the boot param in whatever bootloader you're using and changing the kernel command line to add mpt3sas.msix_disable=1. e.g. on my debian install I'd change the following line:
Code:
linux   /vmlinuz-4.9.0-6-amd64 root=/dev/mapper/vg_root-lv_root ro quiet
...like so:
Code:
linux   /vmlinuz-4.9.0-6-amd64 root=/dev/mapper/vg_root-lv_root ro quiet mpt3sas.msix_disable=1
To make it permanent you'd edit /etc/default/grub to read:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet mpt3sas.msix_disable=1"
...and then run update-grub to recreate grub.cfg and make it permanent. If you want to double-check what parameters the kernel was booted with, cat /proc/cmdline.

I assume centos has something similar...?
 

Ch33rios

Member
Nov 29, 2016
102
6
18
43
I tried updating the Grub2 boot config and no go...

Here's a more detailed output from DMESG:

Code:
[    1.247397] sd 3:0:5:0: [sdf] Attached SCSI disk                                                                                                                      
[    1.247415] sd 3:0:1:0: [sdb] Attached SCSI disk                                                                                                                      
[    1.284196] usb 2-2: New USB device found, idVendor=0e0f, idProduct=0002                                                                                              
[    1.284197] usb 2-2: New USB device strings: Mfr=0, Product=1, SerialNumber=0                                                                                         
[    1.284198] usb 2-2: Product: VMware Virtual USB Hub                                                                                                                  
[    1.293661] hub 2-2:1.0: USB hub found                                                                                                                                
[    1.298715] hub 2-2:1.0: 7 ports detected                                                                                                                             
[    1.369927] random: crng init done                                                                                                                                    
[    1.384318] tsc: Refined TSC clocksource calibration: 3407.997 MHz                                                                                                    
[    1.384323] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd171fc9, max_idle_ns: 440795303639 ns                                                        
[    1.384331] clocksource: Switched to clocksource tsc                                                                                                                  
[    1.398192] ata7: SATA link down (SStatus 0 SControl 300)                                                                                                             
[    1.398225] ata8: SATA link down (SStatus 0 SControl 300)                                                                                                             
[    1.398263] ata9: SATA link down (SStatus 0 SControl 300)                                                                                                             
[    1.398284] ata4: SATA link down (SStatus 0 SControl 300)                                                                                                             
[    1.398310] ata14: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398344] ata12: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398349] ata17: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398353] ata10: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398357] ata3: SATA link down (SStatus 0 SControl 300)                                                                                                             
[    1.398362] ata15: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398368] ata11: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398374] ata13: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398380] ata16: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.398386] ata6: SATA link down (SStatus 0 SControl 300)                                                                                                             
[    1.398391] ata5: SATA link down (SStatus 0 SControl 300)                                                                                                             
[    1.405762] ata21: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.405795] ata20: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.405834] ata18: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.405839] ata19: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.413738] ata25: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.413765] ata24: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.413786] ata22: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.413812] ata23: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.413816] ata26: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.421826] ata32: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.421836] ata28: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.421841] ata30: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.421846] ata31: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.421850] ata27: SATA link down (SStatus 0 SControl 300)                                                                                                            
[    1.421858] ata29: SATA link down (SStatus 0 SControl 300)                                                                                                            
[   16.509668] mpt2sas_cm0: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:8937/_scsih_probe()!                                                                         
[   16.537234] raid6: sse2x1   gen() 12574 MB/s                                                                                                                          
[   16.554232] raid6: sse2x1   xor()  9501 MB/s                                                                                                                          
[   16.571234] raid6: sse2x2   gen() 15597 MB/s                                                                                                                          
[   16.588233] raid6: sse2x2   xor() 10789 MB/s                                                                                                                          
[   16.605231] raid6: sse2x4   gen() 17792 MB/s                                                                                                                          
[   16.622233] raid6: sse2x4   xor() 12289 MB/s                                                                                                                          
[   16.639233] raid6: avx2x1   gen() 25214 MB/s                                                                                                                          
[   16.656232] raid6: avx2x1   xor() 18134 MB/s                                                                                                                          
[   16.673230] raid6: avx2x2   gen() 28851 MB/s                                                                                                                          
[   16.690231] raid6: avx2x2   xor() 19783 MB/s                                                                                                                          
[   16.707231] raid6: avx2x4   gen() 31480 MB/s                                                                                                                          
[   16.724231] raid6: avx2x4   xor() 23357 MB/s                                                                                                                          
[   16.724232] raid6: using algorithm avx2x4 gen() 31480 MB/s                                                                                                            
[   16.724232] raid6: .... xor() 23357 MB/s, rmw enabled                                                                                                                 
[   16.724233] raid6: using avx2x2 recovery algorithm
I should have noted earlier too that I have an ICY-DOCK MB326SP-B rack for the 6 SSD drives which are connect to the 9211-8i. I'm wondering if there's maybe something off with the rack so I'll have to test it out on a separate machine perhaps.
 

Ch33rios

Member
Nov 29, 2016
102
6
18
43
Now I cant get ESXi to even toggle passthrough mode off. I think I'm going to try and move the card to a different pci slot and see what happens.
 

Ch33rios

Member
Nov 29, 2016
102
6
18
43
Full power down. Remove 9211 card from top PCIe slot down 1 to second slot. Re-seat the cables. Power it back up...drives are reported successfully after waiting several minutes (why the heck does it take so long to get this thing to boot?)...

drives_detected_cold_boot.png

Let ESXi boot up and look in the storage devices and they all show up.

drives_esxi.PNG

So far so good. I'm calling it a night for now but tomorrow I'll give it a whirl with re-enabling passthrough and attaching it to my NAS VM.
 

Ch33rios

Member
Nov 29, 2016
102
6
18
43
Welp everything is back to operating as expected. Re-enabled passthrough, configured my CentOS VM and viola, there they be! The real trick now will be when I power it down and then power it back up to see what happens. Its the powering off that in the past has caused some weird instability.

Thanks for the input @EffrafaxOfWug for the input!