Hi everyone, first time poster here, please be gentle ;-)
I have a setup made of one Dell R620 running RockyLinux 9 and one KTN-STL3 full of 3TB SAS HGST drives (formatted in 512 bytes) and it is causing me great headaches.
I am trying to run OpenZFS (2.1.6) and while it works perfectly well under normal use, as soon as I enable ZED (for monitoring) or request fault/locate led status, then I start getting mpt2sas errors and IO errors, up to the point where I completely loose the enclosure under /sys/class/enclosure. This happens in about 10 seconds and a full log of such an event is attached.
What I have tried so far:
- I initially tested with a Dell PERC H810 reflashed to IT mode. I have then switched to an LSI 9207-8e (running P20) without luck.
- As this HBA is not officially supported on RHEL9, I tested on RHEL8 and RHEL7, without luck (exact same errors on the 3 OS)
- I changed the single SAS cable that I am using and tried controller B instead of controller A, without success
- I actually have a second enclosure and have the problem on both.
- I have tested all the disks for read/write badblocks for several days.
- I have tested my memory with Memtest86+ for several days as well.
I first had the impression that ZED was the culprit and decided to stop using it. Later, I discovered that running the command "zpool status -c locate_led" (as a non-root user) would also crash this setup and this makes it too fragile for production. I therefore need to keep looking for a solution (or call it a day and give up with this setup).
This "zpool status" command is mostly iterating over the block devices and issuing "cat /sys/.../locate" and the strange part is that so far I have been unable to crash the system by manually hitting the /sys interface with "cat" or "echo" commands on the fault/locate leds. Only zed/zpool seem to be able to trigger that behavior.
Also, I have been unable to crash a small pool with up to 4 VDEVs, only larger pools seem to have this weakness.
I think my question is: did any of you experience strange SAS errors while running led-related commands on this enclosure and did you ever find a solution ?
Thanks for any hint/input. Cheers. Patrick!