i have a (new to me) system i've been building over the last few months, based on Supermicro 846 chassis. here are the specs:
Supermicro X9DR3-LN4F+ (rev 1.10 w/ latest BIOS 3.2)
dual E5-2660 (v1)
256GB (16x16GB)
peripherals are:
- USB 3.0 PCI-E card
- Fusion IO IoDrive2 1.2TB
- Mellanox ConnectX-2 dual port NIC
- Adaptec 71605H HBA
The 846A backplane has 4 ports connected to the 71605H, and 2 ports connected to the onboard SAS of the motherboard. There are 2x Intel S3710 SSDs in a mirror boot drive setup. The backplane has all 8TB WD Reds.
The OS is CentOS 7.4 During previous testing, I've never noticed any problems, but today I decided to update the system from CentOS 7.3 to CentOS 7.4. I finally got to power it up again to work on it and I noticed some really strange behavior after the upgrade to CentOS 7.4 and *ONLY ON WARN REBOOT*. There's no problem from a cold start and I've tested this several times and is actually my current workaround.
So, when I do a warm reboot, the system seems to get very erratic, mostly during the initramfs:
1) some times it gets stuck for a very long time at "reached basic target". when I reboot (hard reset button) and disable "rhgb" and "quiet" option in grub to see what's going on, systemd appears to be stuck somewhere in the "initializing component devices" and in particular, there's a service that keeps retrying and failing related to "iWarp/Infiniband/RDMA".
2) some times while it is struggling with the iWarp/Infiniband/RDMA thing, it does continue to boot, but then gets stuck when it tries to mount the root mirror off the S3710 SSDs.
3) some times, after a long delay, it gets pass mounting root mirror and starts to load the rest of the OS and services, but then I noticed 2 of my 8TB HDD have disappeared and ZFS complaining my pool is degraded (raidz2). this part is particularly concerning because if this problem is going to randomly drop off HDDs, i have a risk of data loss. a cold reboot completely resolves this and system comes up just fine.
4) some times, it gets pass mounting root mirror, but then randomly has I/O errors to several of the hard drives, which is probably another way #3 is manifesting. eventually it's broken enough that it drops into emergency mode.
Again, all these problems go away if I power off and then power on - system boots right up and quite quickly at that. After a proper cold boot up, I can run heavy, heavy I/O benchmarks on the ZFS pool without any instability. Once I warm reboot, all the problems come back.
So, any thoughts or suggestions as to what to look into?
This "iWarp/Infiniband/RDMA" service is something i've never seen before as I don't really use that. I wonder if it is new from the CentOS 7.3->7.4 update? I'm also not sure exactly what service that is and if I can disable it from starting and wondering if that would help?
I have strange feeling this all has something to do with the Mellanox card? Some times when symptom #3+#4 happen, the system starts booting more normally (with the eventual fail) after the mlnx.d script runs. That and the iWarp/RDMA stuff just has me wondering if Mellanox card gets in some weird state during a warm reboot and this clears when it is cold booted? One thing I want to try later is removing the Mellanox card and see if the problem goes away...
Supermicro X9DR3-LN4F+ (rev 1.10 w/ latest BIOS 3.2)
dual E5-2660 (v1)
256GB (16x16GB)
peripherals are:
- USB 3.0 PCI-E card
- Fusion IO IoDrive2 1.2TB
- Mellanox ConnectX-2 dual port NIC
- Adaptec 71605H HBA
The 846A backplane has 4 ports connected to the 71605H, and 2 ports connected to the onboard SAS of the motherboard. There are 2x Intel S3710 SSDs in a mirror boot drive setup. The backplane has all 8TB WD Reds.
The OS is CentOS 7.4 During previous testing, I've never noticed any problems, but today I decided to update the system from CentOS 7.3 to CentOS 7.4. I finally got to power it up again to work on it and I noticed some really strange behavior after the upgrade to CentOS 7.4 and *ONLY ON WARN REBOOT*. There's no problem from a cold start and I've tested this several times and is actually my current workaround.
So, when I do a warm reboot, the system seems to get very erratic, mostly during the initramfs:
1) some times it gets stuck for a very long time at "reached basic target". when I reboot (hard reset button) and disable "rhgb" and "quiet" option in grub to see what's going on, systemd appears to be stuck somewhere in the "initializing component devices" and in particular, there's a service that keeps retrying and failing related to "iWarp/Infiniband/RDMA".
2) some times while it is struggling with the iWarp/Infiniband/RDMA thing, it does continue to boot, but then gets stuck when it tries to mount the root mirror off the S3710 SSDs.
3) some times, after a long delay, it gets pass mounting root mirror and starts to load the rest of the OS and services, but then I noticed 2 of my 8TB HDD have disappeared and ZFS complaining my pool is degraded (raidz2). this part is particularly concerning because if this problem is going to randomly drop off HDDs, i have a risk of data loss. a cold reboot completely resolves this and system comes up just fine.
4) some times, it gets pass mounting root mirror, but then randomly has I/O errors to several of the hard drives, which is probably another way #3 is manifesting. eventually it's broken enough that it drops into emergency mode.
Again, all these problems go away if I power off and then power on - system boots right up and quite quickly at that. After a proper cold boot up, I can run heavy, heavy I/O benchmarks on the ZFS pool without any instability. Once I warm reboot, all the problems come back.
So, any thoughts or suggestions as to what to look into?
This "iWarp/Infiniband/RDMA" service is something i've never seen before as I don't really use that. I wonder if it is new from the CentOS 7.3->7.4 update? I'm also not sure exactly what service that is and if I can disable it from starting and wondering if that would help?
I have strange feeling this all has something to do with the Mellanox card? Some times when symptom #3+#4 happen, the system starts booting more normally (with the eventual fail) after the mlnx.d script runs. That and the iWarp/RDMA stuff just has me wondering if Mellanox card gets in some weird state during a warm reboot and this clears when it is cold booted? One thing I want to try later is removing the Mellanox card and see if the problem goes away...