Are HBA data cables extremely susceptible to EMI?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Levent

New Member
May 5, 2024
8
2
3
I have a Corsair 230T filled to brim with 4x HDDs and 26 SSDs. Due to physical limitations I had to "cable manage" HBA data cables every gap I could find.

I am running an 5700G, B550 PG Riptide, 128G RAM, 9300-16i, 9207-8i along with onboard sata controller.

I have 26 dramless hikvision 1tb ssds and 4 4tb reds.

Problem lies with the 26x disk RAIDZ2 array. Once every 3-4 days, array starts to complain about write error, read errors or checksum errors. I swap the disk with whatever offline replacement disk I got and resilver it and problem goes away. Few days later, different disk has the same thing going for it. I replace the complaining disk with the "broken" disk I pulled out the other day and resilver and problem goes away.

This cycle has been repeating on and on for a month now. I am kind of frustrated.

I check the SMART data of the "broken" disks each time I pull them out and they are all fine. I even went as far as surface checking them few times and nothing there.

Motherboard, HBAs, Proxmox and TrueNAS Scale are updated. I run Scale virtualized in PVE and passthrough all storage related PCIE devices to Scale.

Any ideas? Pulling this thing apart is a massive hassle so I kind of wanted to get some thoughts before that happens.
 

SlowmoDK

Active Member
Oct 4, 2023
211
130
43
I suspect the issue might be that 9300-16I... I've had similar problems, that all went away when, lets be honest crappy, 9300-16i was replaced

9300-16i is just 2 9300-8i slapped together on one PCB .. find a 9400-16i to replace it with (Lenovo 430-16i seems to be lowest priced)
 
Last edited:
  • Like
Reactions: nexox

Levent

New Member
May 5, 2024
8
2
3
I suspect the issue might be that 9300-16I... I've had similar problems, that all went away when, lets be honest crappy, 9300-16i was replaced

9300-16i is 2 controllers slapped together on one PCB .. find a 9400-16i to replace it with (Lenovo 430-16i seems to be lowest priced)
Sadly I cannot order from typical sources anymore (long story, government bs) and I literally cannot find any new or used HBAs. For that reason I had to order these from China in the first place.

Now that I took another look into the 9300-16i, I wonder if me not using the external 6-pin power is causing the issue. I just plugged that in, maybe, hopefully that makes the difference. If not I am going to have to tear everything down and test 9300-16i alone.
 

SlowmoDK

Active Member
Oct 4, 2023
211
130
43
It should be easy enough to source a 9400-16i from China if needed, plenty listed on ebay

Power pin made no difference here
 

Levent

New Member
May 5, 2024
8
2
3
It should be easy enough to source a 9400-16i from China if needed, plenty listed on ebay

Power pin made no difference here
Sadly even that is out of the question for me nowadays. Any item priced above $5 requires me to drive to a customs office 2 hours one way, pay 80% import tax.

Sellers in Aliex for example does not even ship to me anymore. Same goes for Ebay.

As much as I hate to say it, I am stuck with what I got.
 

Stephan

Well-Known Member
Apr 21, 2017
1,046
807
113
Germany
Curious what gov bs happened to you?

Cables are shielded well enough, should never be an EMI problem. Even under adverse EMI conditions, unless cheapest of the cheapest. Which is rare, because nobody buys low end SAS parts.

Are those controllers cooled by at least a 80mm 800rpm fan each?

Same controller or different random ports? Onboard also affected? If these are chinese knock-offs then the chips came from another board and were recycled. Results vary. I always like originals as OEM more. The more OEM labels, and some brand name on the PCB, the merrier.

Marginal power supply?
 

Levent

New Member
May 5, 2024
8
2
3
Curious what gov bs happened to you?

Cables are shielded well enough, should never be an EMI problem. Even under adverse EMI conditions, unless cheapest of the cheapest. Which is rare, because nobody buys low end SAS parts.

Are those controllers cooled by at least a 80mm 800rpm fan each?

Same controller or different random ports? Onboard also affected? If these are chinese knock-offs then the chips came from another board and were recycled. Results vary. I always like originals as OEM more. The more OEM labels, and some brand name on the PCB, the merrier.

Marginal power supply?
I have an Arctic P12 ziptied next to the HBAs and its running at 100%. I also have repasted the said cards today and temperatures are as follows during an heavy IO task.
1727883287081.png
As far as government bs goes, gist of it is as follows. Big honcho decided to restrict the cost of goods that can be imported without involving customs. I used to be able to order goods up to $150 and pay tax during aliexpress checkout and now that value is $5. Anything more than that, I have to pay for more expensive shipping, involve a customs agent OR drive 2 hours + spent 2-3 hours on multiple locations to pay upwards of 80% tax then drive 2 hours back. Anyways, more I talk more I risk jail time. (My username is a district in the most crowded province in this country, so that would give you a hint). Most (I am talking 99%) sellers straight up stopped shipping my way because of it.

All of the problem devices are currently connected to 9300-16i. I may have had issues with SSDs connected to 9207 as well but I might be wrong.

I have been also experiencing these said ssds "unplugging" themselves and basically get unresponsive unless I pull the sled/disk out and put it back it. I thought this was related to my problem but I am starting to feel like this isnt related to my original problem.

I took apart the spare disk I got and it has an Maxio MA1102 controller paired with Intel 144L QLC NANDs (which was rather surprising as these were the cheapest ssds out there).

I am going to be doing some more testing with these changes. I so far had only a disk "unplug" itself which was fixed by the said solution above.

Power supply is a brand new RM850E.
 

Levent

New Member
May 5, 2024
8
2
3
Latest update:
  • I took meticulous notes of what disk is connected to which HBA. So far I have had ZFS error on 1 disk on 2208 (twice the same disk) and 1 disk on 3008.
  • In order to reduce things that might be going wrong, I disabled PBO2, C states, Cool and quiet and set up a static CPU frequency and voltage which I know is 100% stable (and tested again for 12 hours just to be safe)
  • I flashed the latest available firmware on broadcoms website for both HBAs. I was previously running the unreleased/truenas released 3008 FW (which does not exist on Broadcom website). I also somehow had a BIOS newer than what was publicly available for the 2208 so I also flashed it with whatever latest it was on Broadcoms website.
  • I checked, double checked, triple checked all cables. I logged locations all disks, which enclosures they are connected to. So far there is no correlation.
  • Switched Legacy BIOSes off for both cards in host BIOS and disabled CSM in host.
  • Forced PCIE3 in host BIOS, set PCIE distribution to 8x8x4 (which should have made a difference due to CPU and motherboard layout).
  • Tested RAM stability for 24 hours, it passed with flying colors.
I have also forgot to mention that this server is running on an online UPS, so power is really the least of my worries combined with the RM850E. I have yet to find a solution. I will update if I ever do.
 
Sep 30, 2024
134
19
18
How did you even manage to connect so many SSDs to the PSU (and the controllers)? That sounds like a nightmare unless you use a suitable server case with so many SFF bays.

It could still be a power supply issue. The PSU has plenty power but when so many disks are connected to a single rail, there could be interferences. Maybe you can do a test with only 4 disks or so connected to see if there are still errors.

And use ECC RAM ...
 

Levent

New Member
May 5, 2024
8
2
3
How did you even manage to connect so many SSDs to the PSU (and the controllers)? That sounds like a nightmare unless you use a suitable server case with so many SFF bays.

It could still be a power supply issue. The PSU has plenty power but when so many disks are connected to a single rail, there could be interferences. Maybe you can do a test with only 4 disks or so connected to see if there are still errors.

And use ECC RAM ...
I have 3x 5.25" bay adapter that take 6 2.5disks each. I also designed and printed my own 2.5" disk enclosure for remaining 4 disks. I really meant it when I said I had it filled to brim


Swap faulty disks to different cables, see if error moves with it.
I am not sure if this did it or sudden drop in temperatures (finally had temperatures drop by 15c since I first started having issues). However I havent had errors for the last 10 days.
 

Stephan

Well-Known Member
Apr 21, 2017
1,046
807
113
Germany
Maybe you bought marginal twinax cables from a chinese trash heap, that somebody resold, instead of going into recycling. Marginal but working with some disks' phy, marginal but producing errors with others. Or oxidized or dirty contacts and by replugging like a mad man you scratched that off, so gold plating is touching again giving a good contact.
 
  • Like
Reactions: nexox
Sep 30, 2024
134
19
18
I have 3x 5.25" bay adapter that take 6 2.5disks each. I also designed and printed my own 2.5" disk enclosure for remaining 4 disks. I really meant it when I said I had it filled to brim
Twice I put 8x3.5" and 2x2.5" disks into HP Z820s, so I have an idea what "full" can mean :) There's even some room left but with all the cables it gets tedious.

If your adapters supply power to all the drives at least you don't have so many power cables ...

A drop of 15C is quite a lot. Perhaps you need better airflow in the case?
 

Levent

New Member
May 5, 2024
8
2
3
Alright hopefully the final update.

I think I got the problem triaged and *crossing my fingers* solved. TLDR it was the RAM/IF/memory controller.

After dissasembling the server in question, I took apart my own gaming pc and tested components each with different configurations and problem only kept popping back up when RAM in the server was paired with the CPU in the server at 3600mt/s. I dialed it back to 3200mhz and so far it has been error free for the last 10 days (which previously never happened before and the date coincides with the final day of me testing and decreasing ram speed)

Despite me using many many different testing software, different operating systems, heck even running different vms and not having one crash. I cant believe it turned out to be RAM related.

Thanks everyone!
 

UhClem

just another Bozo on the bus
Jun 26, 2012
483
294
63
NH, USA
Congratulations! I do believe (and hope) you've solved it.
Tested RAM stability for 24 hours, it passed with flying colors.
Was this some flavor of memtest?
I've seen quite a few troubleshooting situations where a "memory tester" indicated "no problem" but Prime95/mprime torture test would provoke errors. The former tests the RAM; the latter expands the scope of testing to the memory subsystem, including the combined (side-)effect of a stressed CPU (as with the RAIDZ2 and checksums of ZFS).

I don't use ZFS, but based on its professional origins (Sun, then Oracle), I'm surprised that it didn't give you better "clues"; tsk, tsk.
 
  • Like
Reactions: Levent

BLinux

cat lover server enthusiast
Jul 7, 2016
2,718
1,108
113
artofserver.com
@Levent congratulations on figuring out the problem. This is why I think using ECC RAM is a good idea. Not only would it have corrected bit errors, more importantly it would have reported the problem in the first place.
 

Levent

New Member
May 5, 2024
8
2
3
Congratulations! I do believe (and hope) you've solved it.

Was this some flavor of memtest?
I've seen quite a few troubleshooting situations where a "memory tester" indicated "no problem" but Prime95/mprime torture test would provoke errors. The former tests the RAM; the latter expands the scope of testing to the memory subsystem, including the combined (side-)effect of a stressed CPU (as with the RAIDZ2 and checksums of ZFS).

I don't use ZFS, but based on its professional origins (Sun, then Oracle), I'm surprised that it didn't give you better "clues"; tsk, tsk.
Testmem, mprime/prime95 and OCCT. All of them passed, I believe I even tried memtest in the proxmox grub list. I even tested using prime95 for 5 days non stop, not a single indication of a problem.

@Levent congratulations on figuring out the problem. This is why I think using ECC RAM is a good idea. Not only would it have corrected bit errors, more importantly it would have reported the problem in the first place.
Honestly I am kind of suprised no other vm indicated any sort of problem in PVE, I kind of expected it would.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,718
1,108
113
artofserver.com
Honestly I am kind of suprised no other vm indicated any sort of problem in PVE, I kind of expected it would.
Memory bit errors without ECC will not affect operation much unless the bit error happens in executable code. even then, it might just cause some strange error and keep going. if most of the memory errors are happening in data and not code, you'll never notice it without ECC, or without some level of check summing somewhere... which something like ZFS has that capability.
 
  • Like
Reactions: Levent