Are HBA data cables extremely susceptible to EMI?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Levent

New Member
May 5, 2024
5
0
1
I have a Corsair 230T filled to brim with 4x HDDs and 26 SSDs. Due to physical limitations I had to "cable manage" HBA data cables every gap I could find.

I am running an 5700G, B550 PG Riptide, 128G RAM, 9300-16i, 9207-8i along with onboard sata controller.

I have 26 dramless hikvision 1tb ssds and 4 4tb reds.

Problem lies with the 26x disk RAIDZ2 array. Once every 3-4 days, array starts to complain about write error, read errors or checksum errors. I swap the disk with whatever offline replacement disk I got and resilver it and problem goes away. Few days later, different disk has the same thing going for it. I replace the complaining disk with the "broken" disk I pulled out the other day and resilver and problem goes away.

This cycle has been repeating on and on for a month now. I am kind of frustrated.

I check the SMART data of the "broken" disks each time I pull them out and they are all fine. I even went as far as surface checking them few times and nothing there.

Motherboard, HBAs, Proxmox and TrueNAS Scale are updated. I run Scale virtualized in PVE and passthrough all storage related PCIE devices to Scale.

Any ideas? Pulling this thing apart is a massive hassle so I kind of wanted to get some thoughts before that happens.
 

SlowmoDK

Active Member
Oct 4, 2023
197
121
43
I suspect the issue might be that 9300-16I... I've had similar problems, that all went away when, lets be honest crappy, 9300-16i was replaced

9300-16i is just 2 9300-8i slapped together on one PCB .. find a 9400-16i to replace it with (Lenovo 430-16i seems to be lowest priced)
 
Last edited:
  • Like
Reactions: nexox

Levent

New Member
May 5, 2024
5
0
1
I suspect the issue might be that 9300-16I... I've had similar problems, that all went away when, lets be honest crappy, 9300-16i was replaced

9300-16i is 2 controllers slapped together on one PCB .. find a 9400-16i to replace it with (Lenovo 430-16i seems to be lowest priced)
Sadly I cannot order from typical sources anymore (long story, government bs) and I literally cannot find any new or used HBAs. For that reason I had to order these from China in the first place.

Now that I took another look into the 9300-16i, I wonder if me not using the external 6-pin power is causing the issue. I just plugged that in, maybe, hopefully that makes the difference. If not I am going to have to tear everything down and test 9300-16i alone.
 

SlowmoDK

Active Member
Oct 4, 2023
197
121
43
It should be easy enough to source a 9400-16i from China if needed, plenty listed on ebay

Power pin made no difference here
 

Levent

New Member
May 5, 2024
5
0
1
It should be easy enough to source a 9400-16i from China if needed, plenty listed on ebay

Power pin made no difference here
Sadly even that is out of the question for me nowadays. Any item priced above $5 requires me to drive to a customs office 2 hours one way, pay 80% import tax.

Sellers in Aliex for example does not even ship to me anymore. Same goes for Ebay.

As much as I hate to say it, I am stuck with what I got.
 

Stephan

Well-Known Member
Apr 21, 2017
992
757
93
Germany
Curious what gov bs happened to you?

Cables are shielded well enough, should never be an EMI problem. Even under adverse EMI conditions, unless cheapest of the cheapest. Which is rare, because nobody buys low end SAS parts.

Are those controllers cooled by at least a 80mm 800rpm fan each?

Same controller or different random ports? Onboard also affected? If these are chinese knock-offs then the chips came from another board and were recycled. Results vary. I always like originals as OEM more. The more OEM labels, and some brand name on the PCB, the merrier.

Marginal power supply?
 

Levent

New Member
May 5, 2024
5
0
1
Curious what gov bs happened to you?

Cables are shielded well enough, should never be an EMI problem. Even under adverse EMI conditions, unless cheapest of the cheapest. Which is rare, because nobody buys low end SAS parts.

Are those controllers cooled by at least a 80mm 800rpm fan each?

Same controller or different random ports? Onboard also affected? If these are chinese knock-offs then the chips came from another board and were recycled. Results vary. I always like originals as OEM more. The more OEM labels, and some brand name on the PCB, the merrier.

Marginal power supply?
I have an Arctic P12 ziptied next to the HBAs and its running at 100%. I also have repasted the said cards today and temperatures are as follows during an heavy IO task.
1727883287081.png
As far as government bs goes, gist of it is as follows. Big honcho decided to restrict the cost of goods that can be imported without involving customs. I used to be able to order goods up to $150 and pay tax during aliexpress checkout and now that value is $5. Anything more than that, I have to pay for more expensive shipping, involve a customs agent OR drive 2 hours + spent 2-3 hours on multiple locations to pay upwards of 80% tax then drive 2 hours back. Anyways, more I talk more I risk jail time. (My username is a district in the most crowded province in this country, so that would give you a hint). Most (I am talking 99%) sellers straight up stopped shipping my way because of it.

All of the problem devices are currently connected to 9300-16i. I may have had issues with SSDs connected to 9207 as well but I might be wrong.

I have been also experiencing these said ssds "unplugging" themselves and basically get unresponsive unless I pull the sled/disk out and put it back it. I thought this was related to my problem but I am starting to feel like this isnt related to my original problem.

I took apart the spare disk I got and it has an Maxio MA1102 controller paired with Intel 144L QLC NANDs (which was rather surprising as these were the cheapest ssds out there).

I am going to be doing some more testing with these changes. I so far had only a disk "unplug" itself which was fixed by the said solution above.

Power supply is a brand new RM850E.
 

Levent

New Member
May 5, 2024
5
0
1
Latest update:
  • I took meticulous notes of what disk is connected to which HBA. So far I have had ZFS error on 1 disk on 2208 (twice the same disk) and 1 disk on 3008.
  • In order to reduce things that might be going wrong, I disabled PBO2, C states, Cool and quiet and set up a static CPU frequency and voltage which I know is 100% stable (and tested again for 12 hours just to be safe)
  • I flashed the latest available firmware on broadcoms website for both HBAs. I was previously running the unreleased/truenas released 3008 FW (which does not exist on Broadcom website). I also somehow had a BIOS newer than what was publicly available for the 2208 so I also flashed it with whatever latest it was on Broadcoms website.
  • I checked, double checked, triple checked all cables. I logged locations all disks, which enclosures they are connected to. So far there is no correlation.
  • Switched Legacy BIOSes off for both cards in host BIOS and disabled CSM in host.
  • Forced PCIE3 in host BIOS, set PCIE distribution to 8x8x4 (which should have made a difference due to CPU and motherboard layout).
  • Tested RAM stability for 24 hours, it passed with flying colors.
I have also forgot to mention that this server is running on an online UPS, so power is really the least of my worries combined with the RM850E. I have yet to find a solution. I will update if I ever do.
 
Sep 30, 2024
48
8
8
How did you even manage to connect so many SSDs to the PSU (and the controllers)? That sounds like a nightmare unless you use a suitable server case with so many SFF bays.

It could still be a power supply issue. The PSU has plenty power but when so many disks are connected to a single rail, there could be interferences. Maybe you can do a test with only 4 disks or so connected to see if there are still errors.

And use ECC RAM ...