massive problems on my build.. need help!

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

jcizzo

Member
Jan 31, 2023
37
5
8
ok folks, i dunno if this is the place to post this request because i dunno if it's hardware or software that's causing my problems.. it's a doozy, so here goes.

for starters, this is a bare metal truenas core (latest stable) build.
the hardware is as follows:

supermicro x11ssh-f (latest firmware, i believe.. flashed it all very recently after downloading it from supermicro, so..)
I3-7100T (hyperthreading is disabled in the bios, so it's just 2 cores = 2 threads).
32Gigs of nemix ecc udimm (as per the spec that supermicro posted, it should work. i ran a memtest64 on it recently for 4 days+ straight, no errors).
LSI 9211-8i hba
intel x710-da2 nic
cpu cooler is complete over-kill as is the heatsink with fan on the hba.. basically, before anyone asks, heat is NOT a problem.. i can promise you that..
2 samsung ssd's for the OS (256Gig... had them laying around.. they're not the problem)
2 samsung 870 evo's (2.5 SATAIII) @500GB.. here's where the problem starts.
5 spinners (not a problem).

as previously stated, this is a bare metal nas and does NOT have any plugin's installed, and probably never will. it's only job will be for A) important file storage (resumes, financial, legal documentation, basic very important stuff for me), and media (movies, tv shows, music, less important, but important no less).

I want it configured as follows:
2x 500Gig SSD's mirrored to hold my ultra important files.
5x spinners will hold my less important media.

so here we go with the problem:
my SSD's are connected directly to the motherboard, while the spinners are connected to the LSI HBA.
during testing, when i'm performing large file copies to the mirrored data ssd's it starts off fine but after a few seconds the file copy slows to a crawl.. after a few more seconds the ssd data pool spits out errors and i get a warning saying that the pool has been degraded. from there, the only way i can get the system back is to connect to the IPMI and do a hard reset as attempting to reboot via the truenas core webgui doesn't work, the cli just shows random pids saying "waiting for whatever"

and at that point the SSDs are HOT.. truenas core claims 45C but they seem far warmer than that.. i've never felt a 2.5 SATA SSD get that warm.
the first time i experienced this, i thought "ok, an ssd died, just replace it.." so that's what i did.. after replacing it with a brand new one, POOF!! same thing after only about 15 minutes of use.
plugging all ssds into a usb caddy and scanning them on my win10 pro pc yields no problems whatsoever. (one drive is brand new, the other 2 were within a year old and were hardly used).

furthermore, i swapped in a pair of brand-new crucial 1TB mx500 SSD's and received the same results..

FURTHERMORE, i tried connecting the SSD's to different SATA ports on the motherboard, SAME PROBLEM!!
i tried connecting the SSD's to the HBA, SAME PROBLEM!!

copying the same 110GB movie folder to the 5 spinners (raidz1) is perfect.. copying the same folder to the nvme (in 2x mode), is perfect..

i can't imagine it being caused by an overloaded cpu that just can't keep up with data writes.. it doesn't happen with the spinners, it doesn't happen with the nvme, and it happens regardless if i send the files across the 10G nic or the 1G nic. none of this makes any sense.. and yes, i reloaded truenas core to see if that was it.. same problem. the only difference is if i send the files across the 10G nic, this blows up after 10-15 seconds, whereas if i send it across the 1G nic, it makes it to the end before blowing up..

if you've made it this far, thank you for taking the time!
 

reasonsandreasons

Active Member
May 16, 2022
133
88
28
Try swapping out the SATA cable connected to the drive. I've had issues like that triggered by a bad cable, though not to the same extent (no full system lock-up).
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
Those entry-level SATA consumer SSDs have a small cache that runs out rather quickly after which the performance drops significantly. There are SATA SSDs that might have better controllers, better/bigger DRAM, better NAND, but they usually only come in NVMe models where the sustained throughput is advertised rather boldly and they have to back that up with at least somewhat better hardware.

Spinners don't have that issue because the cache is tiny by comparison and you're on the 'real' throughput in about 2 seconds anyway.

One way to get around this is using enterprise SATA SSDs (used is fine). Another way is to move to different models as the market is quite segmented related to the I/O interface (which technically shouldn't matter but manufacturers make it matter anyway).

Sometimes you get an initial "good" performance on consumer SSDs because the unused NAND is used to accelerate writing when the wear and block allocation is low, but that degrades over time.
 

jcizzo

Member
Jan 31, 2023
37
5
8
that's probably the one thing i haven't done..

if that's it i'll go berserk. lol
i just remembered; it can't be that because i also tried it with the HBA which, aside from being on a different controller, is also on different sata cables (the one's connected to the hba).
 

Stephan

Well-Known Member
Apr 21, 2017
945
714
93
Germany
What does smartctl -ax /dev/adaX of both 500 GB SSDs say? Core is still FreeBSD right, so "camcontrol devlist" to list drives instead of lsblk. See if any SMART value is off aka critical or warning level. If you can't get it to work, use a bootable Linux and Ventoy from USB stick.

On Linux and with a second PC you could try a "shred -vzn1 /dev/sdX" while simultaneously watching the speed with iotop. See if one or both are misbehaving. Run a blkdiscard /dev/sdX afterwards to drop all blocks so SSD controller frees them internally too again.

Edit: shred is, as the name implies, a destructive test. Check the man page. So is blkdiscard.
 
Last edited:

Markess

Well-Known Member
May 19, 2018
1,166
783
113
Northern California
What does smartctl -ax /dev/adaX of both 500 GB SSDs say? Core is still FreeBSD right, so "camcontrol devlist" to list drives instead of lsblk. See if any SMART value is off aka critical or warning level. If you can't get it to work, use a bootable Linux and Ventoy from USB stick.

On Linux and with a second PC you could try a "shred -vzn1 /dev/sdX" while simultaneously watching the speed with iotop. See if one or both are misbehaving. Run a blkdiscard /dev/sdX afterwards to drop all blocks so SSD controller frees them internally too again.

This. But first.....were you using the same SATA power cabling for all your tests? Maybe try a different SATA power cable string? If on a backplane, and you have the ability to swap out the power connector from the PSU, do that, and/or swap the drives to the far side of the chassis?
 

UhClem

just another Bozo on the bus
Jun 26, 2012
438
252
63
NH, USA
Enable write-caching on those two SSDs; your problems (both slowness and overheating) will go away.

On Linux, it would be hdparm -W1 /dev/sdX

[ Last time I used BSD, there were no SSDs ... nor SATA, nor SAS. ]
 

jcizzo

Member
Jan 31, 2023
37
5
8
This. But first.....were you using the same SATA power cabling for all your tests? Maybe try a different SATA power cable string? If on a backplane, and you have the ability to swap out the power connector from the PSU, do that, and/or swap the drives to the far side of the chassis?
i can try that..

i could've sworn i used this power supply with another motherboard/cpu combo.. that combo was older.. the os drives are on the same cable so i'd think i'd have problems with them as well, although they're not accessed or written to at all except at boot up..

you might be onto something since i did switch out the power supply.. pretty sure i got this sata power cable from the package of cables that came with the power supply.. but at this point i've been fighting with this stupid thing for so long that i can't remember these details..
 

Markess

Well-Known Member
May 19, 2018
1,166
783
113
Northern California
Enable write-caching on those two SSDs; your problems (both slowness and overheating) will go away.

On Linux, it would be hdparm -W1 /dev/sdX

[ Last time I used BSD, there were no SSDs ... nor SATA, nor SAS. ]
I'm throwing this out, because I'm genuinely curious about the answer on this. Its not a recommendation :oops: or an expert statement of fact ...

Since OP is using Core, shouldn't ZFS already be configured for write caching in the...umm..."ZFS"ish way? I understand from what I read in the past when my own Free/TrueNAS installations gave me problems (usually due to operator error, hence my disclaimer that its not expert advice) that ZFS caching would address asynchronous writes to the extent that RAM was available. Synchronous writes would still be an issue, and be something that a SLOG would be meant to address. Although since the drives at issue are already SSDs, not sure that would help, except to spread out the writes to an additional drive?

Again, I'm curious about the answer...not because I know it.

Some anecdotal info: I currently have one machine on TrueNAS Scale with 128GB of RAM and a RAIDZ2 of 10 mixed NVMe and SATA 2.5 SSDs (probably not the most brilliant move I know). In the initial dump of files onto it, the ZFS cache quickly shot up from 1.7GB to over 70GB, and that was on a 1G connection. The 2.5" drives (Samsung SV843) temps went up a little over the course of a couple hours, but not a lot. By comparison, the OP has 1/4 as much RAM and a 10 times (or more) faster network connection. So, their write cache is probably maxed out in very short order and after that the drives are hammered with throughput?

In the past, I'd set up a FreeNAS system with 64GB of RAM, SAS3 SSDs (Toshiba PX02SMF020) & a 10G connection. In more sustained throughput testing, transfer was crazy fast, but the cache maxed out almost instantly and the drives became incredibly hot in very short order. Outside of testing though (and after I'd moved the initial batch of data onto the system) it all smoothed out and was pretty manageable with my usual routine of writes and reads in smaller bursts.
 

jcizzo

Member
Jan 31, 2023
37
5
8
This. But first.....were you using the same SATA power cabling for all your tests? Maybe try a different SATA power cable string? If on a backplane, and you have the ability to swap out the power connector from the PSU, do that, and/or swap the drives to the far side of the chassis?

MARKESS YOU BEAUTY!!!!! THAT DID IT!!

ohh maaan!! the award for biggest tech a-hole of the year goes to me in a massive way!!!

i forgot that i had swapped power supplies and the sata power cables looked the same and i didn't bother changing them.. upon inspection, the one that was running my ssds was different.. barely noticeable, but different.. when i looked at the wires coming outta the plug, it was different..

just been running 10Gig tests on those drives and all is running PERFECTLY!!! this has been causing me headaches for the past 2-3 weeks..

thank you again!!
 
  • Like
Reactions: Markess and SINN78

Markess

Well-Known Member
May 19, 2018
1,166
783
113
Northern California
i forgot that i had swapped power supplies and the sata power cables looked the same and i didn't bother changing them.. upon inspection, the one that was running my ssds was different.. barely noticeable, but different.. when i looked at the wires coming outta the plug, it was different..
I've learned (the hard way :confused: ) to label modular PSU cables with the brand and model # of the power supply they came with (if they didn't come pre-labeled). Most brands use the same 6 & 8 pin plug types for drives and peripherals on the PSU end, but the pin assignments are anything but standardized!

Just for the record, what exactly was wrong with that power cable?
I get the impression he had a SATA power cable of one brand/model plugged into a PSU from another brand/model & the two must have had differing pin assignments.