LSI 9400-16i Power-on or device reset occurred

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Darkwing314

New Member
Oct 30, 2025
29
8
3
You are absolutely sure you cannot reproduce the issue on the rear backplane?

I am suspecting it has something to do with the load the command puts on the PM8044 expander chip. The PM8043 in the backplane is probably more resilient, as it does not have to manage so many drives. I had a look at what changed between TrueNAS 25.4 and 25.10, and it’s a lot. It might be stricter timings for SAS/SATA commands that work on the smaller backplane but fail on the bigger one. If you checked with other Linux distributions with kernel 6.12 or higher and could not reproduce the error, chances are the TrueNAS 25.10 6.12.33 LTS kernel has a regression that was fixed in newer LTS versions. I’d suggest you open a bug ticket and stay with 25.04 for the time being.

Honestly, I am a bit bummed they do not update minor LTS kernel releases regularly. The 6.12.33 release is a year old, and 6.12.80 is current.
I've just spent a couple hours doing some more testing. Now that I have a way to reproduce the issue on-demand this is making it much easier to test. Up until now I had not gone back into TrueNAS to test with the command (smartctl -l scterc /dev/sdX) so I've done that now.
  • TrueNAS 25.10 has the problem every 90 minutes and also when I trigger with the command.
  • TrueNAS 25.04 also DOES have the problem when I'm triggering it manually. It doesn't trigger the problem by itself every 90 minutes though so I conclude that 25.10 changed something in it's drive health checking that now triggers the issue.
  • I tried to reproduce on the rear backplane as well but was unable to do so (production install that is connected to rear backplane is on 25.10 as well). I have both a 8x RAIDZ2 pool and a single drive pool and I can't reproduce it on either of them. I can call the problematic command all day while they're under load and it never times out.
  • I also reproduced it using a much lower write speed. Up until now, my testing has involved copying from my production pool to the test pool (across VMs, but on same proxmox host) and this happened at about 400MB/s typically which I thought maybe was contributing. Anyways, so I put some test files on an old slow USB drive and copied from that to the test pool at about 20MB/s. Observing iostats for the pool looks to me like it's being cached somewhere (RAM most likely?), then writes it quickly once enough is cached. If I query while it's caching (which is most of the time for this slow transfer) then it does not timeout. But if I spam the command and catch it while it's writing then it times out just like when it's under constant load.
I can only conclude at this point that this issue must still be backplane related in some way due to the fact that it only occurs with the front backplane. That leaves me with a couple options:
  1. Stick with 25.04 indefinitely and rely on the fact that it isn't issuing this problematic command. I might get stuck here forever though which isn't great.
  2. Figure out how to fix this problem in 25.10 (and future versions). I suppose I could open a bug and see if it goes anywhere, but I suspect they're going to just say it's the backplane and leave it at that.
  3. Try again to get in contact with the backplane manufacturer and get some support on it (maybe they have a new firmware?)
  4. Scrap this chassis and pickup a proper SuperMicro one (CSE-847). To be honest I'm tempted to pickup a Supermicro backplane and try that first - maybe I can even retrofit it into this chassis (although it lining up with the drive cages and LED indicators for this one seems unlikely ... maybe I'm better off just biting the bullet and getting the proper chassis that I can have confidence will work properly?
Not a huge fan of options (1) and (2), especially because I had actual data loss on my test pool at one point through this so I don't want to risk it ever being an issue.
 

TrevorH

Active Member
Oct 25, 2024
222
96
28
There is some stuff in `man smartctl` that might possibly help. It says you can use it to change the timeout values.
scterc[,READTIME,WRITETIME][,p|reset] - [ATA only] prints values and descriptions of the SCT Error Recovery Control settings. These are equivalent to TLER (as used by Western Digital), CCTL (as used by Samsung and Hitachi/HGST) and ERC (as used by Seagate). READTIME and WRITETIME arguments (deciseconds) set the specified values. Values of 0 disable the feature, other values less than 65 are probably not supported. For RAID configurations, this is typically set to 70,70 deciseconds.
If 'scterc,READTIME,WRITETIME,p' is specified, these time values will be persistent over a power-on reset. If 'scterc,p' is specified, the persistent over power-on values are printed. If 'scterc,reset' is specified, all SCT timer settings are restored to the manufacturer's default value. The ',p' and ',reset' options require the device to support ATA ACS-4 or higher.
 

Darkwing314

New Member
Oct 30, 2025
29
8
3
There is some stuff in `man smartctl` that might possibly help. It says you can use it to change the timeout values.
Yes, I've actually done that to set the timeouts on my WD drives that had timeouts disabled by default. But these are different timeouts we're talking about. This command queries and sets the TLER timeout set in the drives which for me is set to 10 seconds for my EXOS drives (they came like that) or 7 seconds for the WD drives. TLER as I understand it is the timeout for when the drive stops trying to solve a write issue itself and defers back to the host to sort it out. When disabled, the drive will just keep trying forever which causes problems with RAID cards and ZFS where we want this enabled so that the drives will fail the write and then let ZFS decide what to do about it.

The problem I'm having though is that simply querying the value of this timeout while a drive is writing (or querying the temperature history of the drive) triggers a communications timeout of some sort at the device level which then causes a complete device reset. This timeout happens after 30 seconds (not the 7 or 10 I have configured for TLER which is for individual writes) so I don't think it's related to the TLER timeouts themselves.
 

Darkwing314

New Member
Oct 30, 2025
29
8
3
Good news everyone! I think my problem is solved!

Gooxi (manufacturer of the RMC4136-670-HSE server chassis I have) finally got back to me a few days ago and after a bit of back and forth they provided a newer backplane firmware and IT SOLVED MY PROBLEMS ENTIRELY!!!! You have no idea how excited I am that this is actually finally fixed. I have spent so many hours trying to solve this.

Maybe it's too soon to say for sure, but I've been transferring continuously for the last 24 hr on TrueNAS 25.10 and I cannot trigger the problem manually and it does not happen every 90 minutes either.

The firmware of the backplane is shown in the various logs I've been sharing all along or you can use lsscsi -g to also print this out as shown below. On the enclosu line it shows the firmware version (B137 is the firmware revision in the listing below). Mine was originally B134. Upgrading to B137 solved the issue. Gooxi wasn't able to provide a changelog or anything like that, and warned me that I proceed with the upgrade at my own risk and they can't help me if it makes things worse. They also provided the same firmware for the rear backplane so I updated that as well to be safe.

Code:
admin@ubuntu:~$ lsscsi -g
[0:0:0:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdb /dev/sg1
[0:0:1:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdc /dev/sg2
[0:0:2:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdd /dev/sg3
[0:0:3:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sde /dev/sg4
[0:0:4:0] disk ATA WDC WD140EDFZ-11 0A81 /dev/sdf /dev/sg5
[0:0:5:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdg /dev/sg6
[0:0:6:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdh /dev/sg7
[0:0:7:0] disk ATA ST14000NM005G-2K CAP3 /dev/sdi /dev/sg8
[0:0:8:0] enclosu GOOXIBM 4U24SXP 36Sx12G B137 - /dev/sg9
[1:0:0:0] disk QEMU QEMU HARDDISK 2.5+ /dev/sda /dev/sg0
[3:0:0:0]    cd/dvd  QEMU     QEMU DVD-ROM     2.5+  /dev/sr0   /dev/sg10
If you find yourself with this same chassis and the same problem, hit me up and I can share the firmware files they sent me. Or email their support@ email a few times and hopefully they will get back to you.
 

nredar

New Member
Apr 13, 2026
1
0
1
I had the very same issue with the same backplane expander just yesterday. Upgrading to B137 also resolved my issues, I came from B015.

My case is an X-Case RM424 Pro-EX V2 (a rebranded RMC4124-670-HSE) with the same expander.

Many thanks @Darkwing314!
 
Last edited:

BLinux

cat lover server enthusiast
Jul 7, 2016
2,770
1,137
113
artofserver.com
@Darkwing314 Congrats on figuring out a way to trigger the problem and finding a resolution. when I said:

1) Could be a firmware or interoperability bug between the SAS expander chip and the HBA. Your post seems to suggest it's a PMC SAS expander chip, so Broadcom never tested their card with that.
I was hoping it wouldn't be the case because getting a resolution via firmware update can be difficult sometimes. It's really great that Gooxi worked with you to get you the firmware update that fixed this issue.

I'm going to have to tuck away this little bit of knowledge in case any of my customers run into this. Thank you for sharing the resolution with us here. :)
 

Darkwing314

New Member
Oct 30, 2025
29
8
3
@Darkwing314 Congrats on figuring out a way to trigger the problem and finding a resolution. when I said:

I was hoping it wouldn't be the case because getting a resolution via firmware update can be difficult sometimes. It's really great that Gooxi worked with you to get you the firmware update that fixed this issue.

I'm going to have to tuck away this little bit of knowledge in case any of my customers run into this. Thank you for sharing the resolution with us here. :)
Ha, yeah you and me both. When I emailed Gooxi I was not too optimistic that they would be able to or willing to help me. I was already measuring things and planning for how to retrofit two of the smaller rear backplanes into the front. I'm glad I made one last attempt to reach them because once they responded they were very willing to at least get me the latest firmware which fortunately already solved this problem. I imagine if the newer firmware had the same problem then I'd be out of luck.

In the end I'm super happy that this chassis worked out. 36 drive bays for $200 (CAD too! that's like $150 USD!) Way better than the $1,000 I would have spent getting a CSE-847 here from the US.
 
  • Like
Reactions: BLinux

Minxster

New Member
May 15, 2026
1
0
1
Good news everyone! I think my problem is solved!

Gooxi (manufacturer of the RMC4136-670-HSE server chassis I have) finally got back to me a few days ago and after a bit of back and forth they provided a newer backplane firmware and IT SOLVED MY PROBLEMS ENTIRELY!!!! You have no idea how excited I am that this is actually finally fixed. I have spent so many hours trying to solve this.

Maybe it's too soon to say for sure, but I've been transferring continuously for the last 24 hr on TrueNAS 25.10 and I cannot trigger the problem manually and it does not happen every 90 minutes either.

The firmware of the backplane is shown in the various logs I've been sharing all along or you can use lsscsi -g to also print this out as shown below. On the enclosu line it shows the firmware version (B137 is the firmware revision in the listing below). Mine was originally B134. Upgrading to B137 solved the issue. Gooxi wasn't able to provide a changelog or anything like that, and warned me that I proceed with the upgrade at my own risk and they can't help me if it makes things worse. They also provided the same firmware for the rear backplane so I updated that as well to be safe.

Code:
admin@ubuntu:~$ lsscsi -g
[0:0:0:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdb /dev/sg1
[0:0:1:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdc /dev/sg2
[0:0:2:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdd /dev/sg3
[0:0:3:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sde /dev/sg4
[0:0:4:0] disk ATA WDC WD140EDFZ-11 0A81 /dev/sdf /dev/sg5
[0:0:5:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdg /dev/sg6
[0:0:6:0] disk ATA WDC WD140EDGZ-11 0A85 /dev/sdh /dev/sg7
[0:0:7:0] disk ATA ST14000NM005G-2K CAP3 /dev/sdi /dev/sg8
[0:0:8:0] enclosu GOOXIBM 4U24SXP 36Sx12G B137 - /dev/sg9
[1:0:0:0] disk QEMU QEMU HARDDISK 2.5+ /dev/sda /dev/sg0
[3:0:0:0]    cd/dvd  QEMU     QEMU DVD-ROM     2.5+  /dev/sr0   /dev/sg10
If you find yourself with this same chassis and the same problem, hit me up and I can share the firmware files they sent me. Or email their support@ email a few times and hopefully they will get back to you.
Wow, if ever there was a reason to join a forum, its because of your post :D

I've had this problem on/off for some time. I even replaced my HBA in the process of diagnosing (I kept the old card :)). I found I could trigger the problem by just probing the disks to read the temperatures at high frequencies, or should I say, it would make matters worse. Today was the day to start re-Googling again, and found your post...

I'm trying to get the firmware, I've emailed support [at] gooxi.com and gooxi.us, so I'm really hoping I get a response soon.
 

Cortexian

New Member
May 24, 2026
2
0
1
I also just picked up a Gooxi 36-bay chassis and have been fighting to determine what the issue is with my 9400-16i! @Darkwing314 if you're able to share the B137 firmware and flashing instructions from Gooxi I'd appreciate it! I'm on B134 as well with lots of issues.
 

Cortexian

New Member
May 24, 2026
2
0
1
Flash successful and verified on B137! Let's see if this fixes my issues... Symptoms were almost identical so here's hoping!

Edit: Almost 48 hours with no error log spam or crashes!
 
Last edited: