having some instability during warm boot...

BLinux · Dec 16, 2017

i have a (new to me) system i've been building over the last few months, based on Supermicro 846 chassis. here are the specs:

Supermicro X9DR3-LN4F+ (rev 1.10 w/ latest BIOS 3.2)
dual E5-2660 (v1)
256GB (16x16GB)
peripherals are:
- USB 3.0 PCI-E card
- Fusion IO IoDrive2 1.2TB
- Mellanox ConnectX-2 dual port NIC
- Adaptec 71605H HBA

The 846A backplane has 4 ports connected to the 71605H, and 2 ports connected to the onboard SAS of the motherboard. There are 2x Intel S3710 SSDs in a mirror boot drive setup. The backplane has all 8TB WD Reds.

The OS is CentOS 7.4 During previous testing, I've never noticed any problems, but today I decided to update the system from CentOS 7.3 to CentOS 7.4. I finally got to power it up again to work on it and I noticed some really strange behavior after the upgrade to CentOS 7.4 and *ONLY ON WARN REBOOT*. There's no problem from a cold start and I've tested this several times and is actually my current workaround.

So, when I do a warm reboot, the system seems to get very erratic, mostly during the initramfs:

1) some times it gets stuck for a very long time at "reached basic target". when I reboot (hard reset button) and disable "rhgb" and "quiet" option in grub to see what's going on, systemd appears to be stuck somewhere in the "initializing component devices" and in particular, there's a service that keeps retrying and failing related to "iWarp/Infiniband/RDMA".

2) some times while it is struggling with the iWarp/Infiniband/RDMA thing, it does continue to boot, but then gets stuck when it tries to mount the root mirror off the S3710 SSDs.

3) some times, after a long delay, it gets pass mounting root mirror and starts to load the rest of the OS and services, but then I noticed 2 of my 8TB HDD have disappeared and ZFS complaining my pool is degraded (raidz2). this part is particularly concerning because if this problem is going to randomly drop off HDDs, i have a risk of data loss. a cold reboot completely resolves this and system comes up just fine.

4) some times, it gets pass mounting root mirror, but then randomly has I/O errors to several of the hard drives, which is probably another way #3 is manifesting. eventually it's broken enough that it drops into emergency mode.

Again, all these problems go away if I power off and then power on - system boots right up and quite quickly at that. After a proper cold boot up, I can run heavy, heavy I/O benchmarks on the ZFS pool without any instability. Once I warm reboot, all the problems come back.

So, any thoughts or suggestions as to what to look into?

This "iWarp/Infiniband/RDMA" service is something i've never seen before as I don't really use that. I wonder if it is new from the CentOS 7.3->7.4 update? I'm also not sure exactly what service that is and if I can disable it from starting and wondering if that would help?

I have strange feeling this all has something to do with the Mellanox card? Some times when symptom #3+#4 happen, the system starts booting more normally (with the eventual fail) after the mlnx.d script runs. That and the iWarp/RDMA stuff just has me wondering if Mellanox card gets in some weird state during a warm reboot and this clears when it is cold booted? One thing I want to try later is removing the Mellanox card and see if the problem goes away...

Rand__ · Dec 17, 2017

So why wait for removing the card? Should be quicker to test than the other options

Bios update? AIC firmware update? Different OS again (usb stick)?

RedX1 · Dec 17, 2017

Hi

You may find this useful background. I have the same problem with a SM X9DRi-F and Win10
Supermicro X9DRL-IF & Windows 10 Issue

Good luck

RedX1

BLinux · Dec 18, 2017

Rand__ said:
So why wait for removing the card? Should be quicker to test than the other options
Bios update? AIC firmware update? Different OS again (usb stick)?

well, removed the NIC today and the problem still persists.

not only that, found something strange. the "iWarp/Infiniband/RDMA" messages is apparently the rdma.service, which is disabled and yet it is starting during boot? my guess is that there's something else that depends on it and systemd is trying to be smart and starting up dependencies?

i'm not sure at this point if it has anything to do with the Mellanox card. but the Mellanox driver install sure did leave a lot of stuff around that isn't clear to me - i think i'm going to uninstall it and just use the stock CentOS 7 kernel drivers for it.

RedX1 said:
Hi

You may find this useful background. I have the same problem with a SM X9DRi-F and Win10
Supermicro X9DRL-IF & Windows 10 Issue

Good luck

RedX1

thanks @RedX1 but that almost sounds like the opposite to my issue... no cold boot, but ok warm boot on reset. my issue is cold boot is fine, but warm reboot fails and requires hard reset.

anyway, the problem still persists and I can't figure out what it is... the last couple of times it got stuck after detecting the boot drive SSD mirror pair... it says it detected change in capacity of 0 to X bytes. do the Intel S3710 SSDs have known issues like this?

MiniKnight · Dec 18, 2017

I've seen SSDs not initialize fast enough as ZFS mirrors even on warm boots. Happens on 1 of 5 Proxmox installs I do nowadays.

The fix is adding bootdealay in GRUB.

BLinux · Dec 18, 2017

MiniKnight said:
I've seen SSDs not initialize fast enough as ZFS mirrors even on warm boots. Happens on 1 of 5 Proxmox installs I do nowadays.

The fix is adding bootdealay in GRUB.

what do you mean by boot delay in grub? wouldn't it need to be done in the initramfs?

BLinux · Dec 20, 2017

so the problem continues...

yanked the 2x S3710 SSDs and put in a pair of S3500 I have and re-installed the OS without updates or adding 3rd party/vendor drivers. that made no difference and it is still having the same problem. so, that rules out the mellanox driver setup, or fusion-io setup, or the latest OS update causing the issue.

I turned on OptionROM on all the PCI-E slots and noticed the Adaptec 71605H complaining about temperatures "violating threshold", but I don't think that is a real problem with temperature / the rest of the server's temp sensors are all reading < 40C. What I have noticed is that every time I see those "violating threshold" messages during POST, the boot process gets stuck as described above. But, I don't think it's overheating issue because all i have to do is a power reset, no cool down period, and the next boot up the messages are gone and everything boots up to the OS fast and easy. I'm sort of thinking the 71605H is getting in some weird stuck state during a warm reboot, and systemd is stuck at waiting for components to initialize because the pm80xx driver for the 71605H is stalling, which explains why some times it eventually boots up (the driver times out and finally continues) but is missing a few drives. I also see error messages in dmesg from the pm80xx driver.

so, now my suspect is the 71605H card. going to try swapping that card out for some LSI cards I have around, but have to see if I can find the right cabling. will probably also check the heatsink of the 71605H and see if it is properly attached, just in case.

anyone else ever use a 71605H with CentOS or any flavor of Linux?

BLinux · Dec 21, 2017

ok. just as follow up... swapped the 71605H with a pair of H330 (LSI SAS3008) spares I had since they had the right 8643 connector and the problem is completely gone. not sure what's up with the 71605H, but at least I've pinpointed the problem down to one component.

Rand__ · Dec 21, 2017

Search

having some instability during warm boot...

BLinux

cat lover server enthusiast

Rand__

Well-Known Member

RedX1

Active Member

BLinux

cat lover server enthusiast

MiniKnight

Well-Known Member

BLinux

cat lover server enthusiast

BLinux

cat lover server enthusiast

BLinux

cat lover server enthusiast

Rand__

Well-Known Member