LSI9211-8i on Ubuntu 15.10 timeouts

sullivan

New Member
Mar 27, 2016
24
16
3
I believe this problem likely turns out to be an ESXI bug with message-signaled interrupts that was exposed by a change to the linux-x86 interrupt handling code post kernel 4.1.

There is a thread discussing the issue on the linux-scsi mailing list:

http://www.spinics.net/lists/linux-scsi/msg95506.html

If you follow the links in that message there is a description (quoted below.) The claim is that it will be fixed in ESXI 6.0p3/5.5p8 mid-year.

The workaround that is working for me now is to add "mpt2sas.msix_disable=1" to the kernel command-line. Note that for 4.4 and later kernels, I think you will need to use "mpt3sas.msix_disable=1" as the mpt2/3 drivers are merged at that point. Actually, you could add both and nothing bad will happen.

Note that there is some odd behavior to this bug. The problem seems to be related to how the interrupt logic is reinitialized (or not reinitialized) on VM reboots. So if the previous reboot (different kernel, command-line options) left the HW in a "good" state you might not see the issue on the next boot. But then subsequent boots will fail.

For what it's worth, I am seeing this on a Fedora install. I use a custom configured kernel with the drivers linked directly in, so it's not a systemd/module issue. I am using P19 LSI firmware on LSI 9211 cards.

List: linux-kernel
Subject: Re: VMware PCI passthrough regression
From: Thomas Gleixner <tglx () linutronix ! de>
Date: 2016-01-14 21:15:59


Jason,

On Thu, 14 Jan 2016, Jason Taylor wrote:

> I've run into a regression using PCI passthrough with the 4.4
> kernel.

Actually that is a 4.2 kernel according to the dmesg in the bug tracker.

> Attempting to passthrough an LSI card with ESXi version 6. Seeing
> timeouts and oops in the log and the disks do not show up.

The timeouts are probably related to missing irq delivery, so it might be
related to the big overhaul of the x86 irq subsystem.

The oops is a genuine driver bug probably unearthed by the irq issue. That
should be reported seperately to the megasas folks.

> I performed a bisect which tracked down the issue to the commit below.
>
> More details are available in a bug report I filed with Ubuntu:
> Bug #1528849 “PCI passthrough of LSI card fails” : Bugs : linux package : Ubuntu
>
> commit 52f518a3a7c2f80551a38d38be28bc9f335e713c
> x86/MSI: Use hierarchical irqdomains to manage MSI interrupts

I have no idea how that breaks the vmware passthrough. Can you please verify

- whether that kernel works on the real hardware with that LSI card

- whether that kernel works in a KVM guest with that card passed through

If one of those things break, we can certainly help to analyse that. If not,
then you need to talk to vmware.
 

canta

Well-Known Member
Nov 26, 2014
1,028
198
63
39
I believe this problem likely turns out to be an ESXI bug with message-signaled interrupts that was exposed by a change to the linux-x86 interrupt handling code post kernel 4.1.

There is a thread discussing the issue on the linux-scsi mailing list:

http://www.spinics.net/lists/linux-scsi/msg95506.html

If you follow the links in that message there is a description (quoted below.) The claim is that it will be fixed in ESXI 6.0p3/5.5p8 mid-year.

The workaround that is working for me now is to add "mpt2sas.msix_disable=1" to the kernel command-line. Note that for 4.4 and later kernels, I think you will need to use "mpt3sas.msix_disable=1" as the mpt2/3 drivers are merged at that point. Actually, you could add both and nothing bad will happen.

Note that there is some odd behavior to this bug. The problem seems to be related to how the interrupt logic is reinitialized (or not reinitialized) on VM reboots. So if the previous reboot (different kernel, command-line options) left the HW in a "good" state you might not see the issue on the next boot. But then subsequent boots will fail.

For what it's worth, I am seeing this on a Fedora install. I use a custom configured kernel with the drivers linked directly in, so it's not a systemd/module issue. I am using P19 LSI firmware on LSI 9211 cards.

that is the workaroud to disable MSI-X.
mpt2sas or mpt3sas is mostly identical. since 2 versus 3 is nothing differences in the source..
the compiled source-code will load newer mptsas 2.5 code when detected.

MSI-X is (pseudo/emulated PC interupts that support more than limited hardware interrupts).

I am still doubtlfull MSI-X and passthrough (esxi-6) are the real problem,
esxi can do workaround on the background. since not open-source.
*this is the reason I jumped to proxmox* last year.I can debug and knowing not corrects.

if you looks on PCI loading in linux kernel, this will loaded first, before compiled kernel module inside the kernel.

I still has a bet, systemd mess something up in sequences. or update on mpt3sas has a glitch.
the easy to test is downgrading the kernel to 3.1 in system (that has this issue) uses systemd,
if works on every softboot or hardboot, I will admit that I wrong :D

btw, good workaround for temporary solution.
 

whitey

Moderator
Jun 30, 2014
2,770
865
113
37
I believe this problem likely turns out to be an ESXI bug with message-signaled interrupts that was exposed by a change to the linux-x86 interrupt handling code post kernel 4.1.

There is a thread discussing the issue on the linux-scsi mailing list:

http://www.spinics.net/lists/linux-scsi/msg95506.html

If you follow the links in that message there is a description (quoted below.) The claim is that it will be fixed in ESXI 6.0p3/5.5p8 mid-year.

The workaround that is working for me now is to add "mpt2sas.msix_disable=1" to the kernel command-line. Note that for 4.4 and later kernels, I think you will need to use "mpt3sas.msix_disable=1" as the mpt2/3 drivers are merged at that point. Actually, you could add both and nothing bad will happen.

Note that there is some odd behavior to this bug. The problem seems to be related to how the interrupt logic is reinitialized (or not reinitialized) on VM reboots. So if the previous reboot (different kernel, command-line options) left the HW in a "good" state you might not see the issue on the next boot. But then subsequent boots will fail.

For what it's worth, I am seeing this on a Fedora install. I use a custom configured kernel with the drivers linked directly in, so it's not a systemd/module issue. I am using P19 LSI firmware on LSI 9211 cards.
mpt2sas.msix_disable=1
THAT...FREAKING ...WORKED on ubuntu LTS 14.04.4 which was broke before, let me try mpt3sas.msix_disable=1 on ubuntu Xenial 16.04 w/ newer 4.4 kernel and see what she does there but you are THE MAN in my book.

ubuntu 16.04 works as well w/ mpt2sas_msix_disable=1 although my boot device seems to have switched up to /dev/sdc and the 2 pass-thru HBA devices as sda/sdb...weird disk enumeration but I'll take it woohoooo

NICE @sullivan!
 
Last edited:
  • Like
Reactions: rubylaser

whitey

Moderator
Jun 30, 2014
2,770
865
113
37
Update/more info

mpt2sas.msix_disable=1 (for 4.3 or older kernels)
or
mpt3sas.msix_disable=1 (for 4.4 or newer kernels)

Added to kernel boot line parameters, adding here as well for prosperity's sake.

Under ubuntu or CentOS using grub2 to make it stick edit the /etc/default/grub to the following:

GRUB_CMDLINE_LINUX_DEFAULT="mpt2sas.msix_disable=1"
(mpt3sas.msix_disable=1 for 4.4 or newer kernels)

Save file:

Then update grub2 files:

Ubuntu - update-grub
CentOS - grub2-mkconfig -o /boot/grub2/grub.cfg (BIOS based machines)
grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg (UEFI based machines)
 
Last edited:

DigitalDJ

New Member
Apr 23, 2016
3
0
1
40
Has anyone got this working with the final release of 16.04 and ESX6.0U1?

I tried the MSIX fixes in this thread (I also had the problem on 15.10) but I still can't seem to get it to work :(

Usually, I would boot a 15.04 Server Live CD then reboot to get things to work...but now that isn't even working...

Have 3 9311-8is passed through....

dmesg as follows:

Code:
[  33.131900] mpt3sas 0000:0c:00.0: enabling device (0000 -> 0002)
[  33.134841] mpt3sas_cm1: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (32929796 kB)
[  33.190699] mpt3sas1: IO-APIC enabled: IRQ 19
[  33.191092] mpt3sas_cm1: iomem(0x00000000fe140000), mapped(0xffffc90003200000), size(65536)
[  33.191437] mpt3sas_cm1: ioport(0x0000000000006000), size(256)
[  33.295004] mpt3sas_cm1: Allocated physical memory: size(10931 kB)
[  33.295378] mpt3sas_cm1: Current Controller Queue Depth(3960),Max Controller Queue Depth(4096)
[  33.295785] mpt3sas_cm1: Scatter Gather Elements per IO(128)
[  63.340341] mpt3sas_cm1: _base_event_notification: timeout
[  63.340817] mf:

[  63.341182] 07000000
[  63.341575] 00000000
[  63.341579] 00000000
[  63.341975] 00000000
[  63.341978] 00000000
[  63.342347] 0f2f7fff
[  63.342350] ffffff7c
[  63.342769] ffffffff
[  63.342772]

[  63.343542] ffffffff
[  63.343545] 00000000
[  63.343943] 00000000

[  63.344814] mpt3sas_cm1: sending diag reset !!
[  64.478172] mpt3sas_cm1: diag reset: SUCCESS
[  64.917328] mpt3sas_cm1: failure at /build/linux-Ay7j_C/linux-4.4.0/drivers/scsi/mpt3sas/mpt3sas_scsih.c:8800/_scsih_probe()!
 
Last edited:

ArmoredDragon

New Member
Apr 24, 2016
1
0
1
38
For what it's worth, I've been able to make this fail reliably if VM's firmware is configured as EFI, meanwhile it works reliably with BIOS. YMMV of course, and there's no apparent reason why.

Also another FWIW, my ESXi host is at build 3620759 (15-mar-2016 build, which is latest as of this post) and I'm using a new copy (as in, the ISO has all updates til this date) of Ubuntu 16.04 server.

Does anybody know if VMware has posted a KB for this? If so, could you link it? It's possible they'll post a link to a VIB patch without having to wait for a new build.
 
Last edited:

DigitalDJ

New Member
Apr 23, 2016
3
0
1
40
Amazing! It worked. EFI booting definitely doesn't work (at least with multiple controllers), even with the msix_disable flag.

Converting my install back to BIOS boot seems to work with Linux Kernel 4.4, but you still need to use the msix_disable flags for the kernel, otherwise it won't work.

You don't need to set pciPassthruX.msiEnabled= 'FALSE' in the VM VMX.

First, I converted my EFI Ubuntu install to BIOS.
- Expand the drive by 1MB in VMware, so I could add a bios_grub partition to the end of the drive (since it's a GPT disk not MBR)
- Rescan the host bus in the guest VM to pick up on the new drive size
Code:
echo "- - -" > /sys/class/scsi_host/host0/scan
- Used gdisk to add a 1MB partition (Partition type 0xEF02)
Code:
gdisk /dev/sda
n
<enter> (default partition number)
<enter> (default first sector)
<enter> (default last sector -- end of drive)
ef02 (Hex Code for bios_grub)
w
- Install BIOS grub2, this removes grub-efi-amd64, so you will need to reinstall it if you want to switch back to EFI
Code:
apt-get install grub-pc
- Edit /etc/default/grub and add/modify GRUB_CMDLINE_LINUX_DEFAULT to disable msix in the mpt driver (use mpt2sas, if you're using the older driver - 4.4 comes with mpt3sas):
Code:
GRUB_CMDLINE_LINUX_DEFAULT="mpt3sas.msix_disable=1"
- Update grub.cfg
Code:
update-grub
- Do a FULL shutdown of the VM, to clean up any residual initialization of the controller, and boot the VM back up again :)

If converting to BIOS doesn't work, unfortunately, the best solution is to revert back to kernel 4.1 (which is the last known working).

You can manually install the debs on 16.04 from here:

linux-headers-4.1.22-040122-generic_4.1.22-040122.201604200432_amd64.deb
linux-headers-4.1.22-040122_4.1.22-040122.201604200432_all.deb
linux-image-4.1.22-040122-generic_4.1.22-040122.201604200432_amd64.deb

Just dpkg -i <deb> in the order listed above.
 

sullivan

New Member
Mar 27, 2016
24
16
3
It looks like this may finally be fixed in the latest ESXi 6.0.0 patch (build 4192238) released on 8/5.

I am able to boot 10 out of 10 times with MSIX re-enabled and I am not seeing any hangs.