Does the SAS 2008 chip throttle when overheating?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Mastakilla

Member
Jul 23, 2019
36
8
8
Hi all,

While running the https://www.ixsystems.com/community/resources/solnet-array-test.1/
script on FreeNAS, I had some very weird behaviour during the seek-tress-read test (which uses dd and takes about a week to complete).
During the test my IPMI and network interfaces became unreachable (only an existing ssh connection remained open for a couple more days, after that it also crashed when running some command and I had to power off the server completely).
The 3 HDDs from the screenshots were the only ones that completed the test, the other 5 throttled even for longer periods...
The dd command did continue to run though after it became unreachable and completed on all 8 HDDs after some more days (but I was no longer able to take screenshots in the GUI).



etc

After thinking it through, this (the yellow) looks very much like something is throttling, but it seems to throttle not all 8 HDDs the same amount at the same time, rather it throttled them "inconsistently"...

I was wondering if that could have been my HBA that was throttling? It would make sense, as it is in desktop case, with only very little airflow (600rpm PSU fan at 13cm and 1000rpm case fan at 30cm with HDDs in between).


In mean time, I've tried to replace the heatsink on HBA, but I broke of resistor :( Although I managed to solder it on again (extremely badly) and although the HBA still seems to work fine, I no longer trust it and I'm looking to replace it...


As I hate high rpm small fans, I've created a 120mm fan holder to get some airflow on my future HBA


But... I'm still searching for some more certainty on what the hell happened that weird day... So anyone know if these HBAs can throttle?
 
Last edited:
  • Like
Reactions: lowfat and gb00s

Spartacus

Well-Known Member
May 27, 2019
788
328
63
Austin, TX
I had similar HBA cooling need
1) make sure you replace the TIM that stuff is usually insanely old

2) Check out this: PCI Side-Blown Fan Mount Universal Bracket Holder | eBay
or a US version if you dont wanna wait as long (but costs more):
https://www.amazon.com/dp/B012T3Q5S6/ref=cm_sw_em_r_mt_dp_U_FArLEb6W4MD9E

Here's what it looks like I cool 2 HBA + a 10g nic and when I move it all the way to the end, it even cools the onboard heatsink.


Edit: BTW I used the Dell H310 cards flashed to IT mode, they work great and the heatsink is just held on by metal hooks, super easy to remove and put back on (but difficult to change the heatsink to something different due to mounting)
 
Last edited:
  • Like
Reactions: Mastakilla

Mastakilla

Member
Jul 23, 2019
36
8
8
I had similar HBA cooling need
1) make sure you replace the TIM that stuff is usually insanely old

2) Check out this: PCI Side-Blown Fan Mount Universal Bracket Holder | eBay
or a US version if you dont wanna wait as long (but costs more):
https://www.amazon.com/dp/B012T3Q5S6/ref=cm_sw_em_r_mt_dp_U_FArLEb6W4MD9E

Here's what it looks like I cool 2 HBA + a 10g nic and when I move it all the way to the end, it even cools the onboard heatsink.
(picture)

Edit: BTW I used the Dell H310 cards flashed to IT mode, they work great and the heatsink is just held on by metal hooks, super easy to remove and put back on (but difficult to change the heatsink to something different due to mounting)
Thanks for the advise! I actually saw that type of fan bracket, but I didn't really trust it to be stable enough, as it is only held on one side and puts all weight on the slot. Doesn't the fan shake when held only by this bracket?

why dont you put a 40*40*10mm fan on the metalic cool block? much easier.

to my knowledge the most powerfull (high airflow) 40*40*10mm fan

Xilence - XF031 XPF40.W Case Fan 40 x 40 x 10, White Box | 40mm
Thanks for the advise! I will not do that though because I'm trying to make my NAS as silent as possible and I've had only bad experiences with high rpm small fans
 

BeTeP

Well-Known Member
Mar 23, 2019
653
429
63
I was wondering if that could have been my HBA that was throttling?
No. "throttling" is reducing clock frequency to allow the chip to cool down and gracefully recover. What you have observed is just drop in throughput caused by gradually increasing number of errors.
 
  • Like
Reactions: Mastakilla

Mastakilla

Member
Jul 23, 2019
36
8
8
your cpu fan makes more noise than that little fan. don't be fooled by high speed of fan
Thanks for that insight, that does surprise me a little...
However, I already have a dead-silent 120mm fan and I spend a whole afternoon making that wooden-fan-holder, so unless someone thinks that isn't sufficient, I'll try with that first... ;)

No. "throttling" is reducing clock frequency to allow the chip to cool down and gracefully recover. What you have observed is just drop in throughput caused by gradually increasing number of errors.
Ah yes, didn't think of that yet... Is that a known behaviour of the SAS2008 chip?

Also a bit strange that it happens so randomly inconsistent over the 8 disks. I would expect that it would be spread at the same time over all disks, as they're all using that same SAS2008 chip at the same time...?? Some disks just keep on going full speed, while others drop to 20% of their normal speed... Then a bit later some more drop to 20% and near the end some jumped back to full speed, while others kept on being sluggish at 20%...?
 

Spartacus

Well-Known Member
May 27, 2019
788
328
63
Austin, TX
Thanks for the advise! I actually saw that type of fan bracket, but I didn't really trust it to be stable enough, as it is only held on one side and puts all weight on the slot. Doesn't the fan shake when held only by this bracket?
I had this same thought and bought 2 of them one for each side, but if you screw it down over all 3 of the pcie slots it attaches to it doesn't have any wiggle at all up and down, it vibrates a little more but when tightened down properly even all the way at the end its reasonably stable. As long as you dont hit the fins most fans are pretty balanced.

The thickness of my hba and fan its actually putting pressure on the fan against the hba (which is fine since the guard is on that side to keep it from hitting anything). So its pressured against the hba removing any front to back movement too.
 
  • Like
Reactions: Mastakilla

BLinux

cat lover server enthusiast
Jul 7, 2016
2,669
1,081
113
artofserver.com
Did you check the system logs? I don't use FreeNAS that much, but at least with Linux, when the SAS2008 IOC overheats, it usually causes a reset and the driver will show this in the logs. It will then try to recover and reload the driver and you'll see messages where the HBA comes back online... if it is really bad, it will just report IOC is in fault state or something similar. At least, that's been my experience.... of course, that could also be caused my other things, and the SAS2008 doesn't have a temp sensor to tell you for sure, but at least I've seen this symptom match with overheating before so it's a possible diagnoses.
 
  • Like
Reactions: Mastakilla

zack$

Well-Known Member
Aug 16, 2018
701
315
63
I had this same thought and bought 2 of them one for each side, but if you screw it down over all 3 of the pcie slots it attaches to it doesn't have any wiggle at all up and down, it vibrates a little more but when tightened down properly even all the way at the end its reasonably stable. As long as you dont hit the fins most fans are pretty balanced.

The thickness of my hba and fan its actually putting pressure on the fan against the hba (which is fine since the guard is on that side to keep it from hitting anything). So its pressured against the hba removing any front to back movement too.
If you use 3 slots, can you still use two of those slots for add-in cards??
 

Mastakilla

Member
Jul 23, 2019
36
8
8
Did you check the system logs? I don't use FreeNAS that much, but at least with Linux, when the SAS2008 IOC overheats, it usually causes a reset and the driver will show this in the logs. It will then try to recover and reload the driver and you'll see messages where the HBA comes back online... if it is really bad, it will just report IOC is in fault state or something similar. At least, that's been my experience.... of course, that could also be caused my other things, and the SAS2008 doesn't have a temp sensor to tell you for sure, but at least I've seen this symptom match with overheating before so it's a possible diagnoses.
Yes, I did check all of the log files in /var/log but couldn't find anything related.

All I found that could be slightly related to loosing all connectivity was:
Feb 20 12:41:20 FreeNAS kernel: ix0: link state changed to DOWN
Feb 20 12:41:20 FreeNAS kernel: ix0: link state changed to DOWN
Feb 20 12:42:45 FreeNAS kernel: ix0: link state changed to UP
Feb 20 12:42:45 FreeNAS kernel: ix0: link state changed to UP

But that doesn't say much either...
 
Last edited:

Mastakilla

Member
Jul 23, 2019
36
8
8
As I didn't really trust my LSI controller anymore, I bought a new / second hand Dell H310. I replaced the thermal paste of it and placed a 120mm fan right at it (as in the picture above).

Then I re-ran the solnet array test script and although things have improved compared to the LSI, there are still some weird things happening...

The good:
  • Heatsink temp during a couple days non-stop stress seek testing is "ok to hold your finger on" (just), so very ok.
  • IOstat serial read = 228-242MB/sec, so every HDD has a normal / consistent speed. Also FreeNAS I/O report shows a nice and normal declining line. da1/da5 are slowest and da0/da4/da7 are fastest.
  • dd parallel read = 228-244MB/sec, so every HDD has a normal / consistent speed. Also FreeNAS I/O report shows a nice and normal declining line. da1/da5 are slowest and da0/da4/da7 are fastest.
  • So normal read speeds are very consistent for each HDD
  • The system didn't crash and didn't become unreachable

The bad:
  • Nothing really bad this time.

The ugly:
  • dd parallel seek-stress read has improved (a little?), but there still is weird behaviour and large inconsistency (30%).
  • Although the time for the parallel read is correctly measured by the solnet array test script (when comparing them to the FreeNAS I/O reporting graphs), it seems like something is wrong with the seek-stress read time measurement. The slowest HDD, according to the script, took 289132 seconds (3d8h) to complete, while that test actually took more than 5 days (see output below). Also the FreeNAS I/O reporting graphs confirm this huge and weird difference.
Code:
Performing initial parallel seek-stress array read
Sat Apr 25 14:33:46 CEST 2020
...
Awaiting completion: initial parallel seek-stress array read
Thu Apr 30 15:47:22 CEST 2020
Completed: initial parallel seek-stress array read
  • The parallel seek-stress read, according to the solnet array test script timers per HDD
    • The test per HDD took between 2d21h and 3d8h
    • da1/da4 are slowest and da0/da6 are fastest during dd parallel seek-stress read. (so different from normal read)
    • It still marks an HDD as "fast" and as "slow" in the output. If I understand the script correctly, it only does this when results "jump out" and are "not normal"?
    • Compared to the "LSI-run" from some months ago, it ran almost 25% faster.
  • The parallel seek-stress read, according to FreeNAS I/O reporting graphs per HDD
    • The test per HDD took between 3d22h and 5d2h (so an insane difference)
    • da5/da6/da3 are slowest and da4/da2/da0 are fastest during dd parallel seek-stress read. (so very different from all above)
    • Compared to the "LSI-run from some months ago, it only ran about 5% faster.
    • The graph starts out pretty normal and consistent, but near the end there is again quite some variance with both I/O drops and I/O peaks. Exactly the same as with the LSI, but perhaps a little less extreme (it took a little less long and minimal speed was higher).
As I've now replaced the HBA, solved the temperature issues, I was actually hoping for even more consistency. It is better, but it still looks a little problematic, no?

Below the full output of the script and screenshots of the FreeNAS I/O reporting graphs (you may ignore the first peak before 25 April as this was from an aborted run).
Performing initial serial array read (baseline speeds)
Fri Apr 24 23:12:06 CEST 2020
Fri Apr 24 23:30:10 CEST 2020
Completed: initial serial array read (baseline speeds)

Array's average speed is 237.013 MB/sec per disk

Disk Disk Size MB/sec %ofAvg
------- ---------- ------ ------
da0 9537536MB 241 102
da1 9537536MB 228 96
da2 9537536MB 237 100
da3 9537536MB 240 101
da4 9537536MB 241 102
da5 9537536MB 231 97
da6 9537536MB 236 99
da7 9537536MB 242 102

Performing initial parallel array read
Fri Apr 24 23:30:10 CEST 2020
The disk da0 appears to be 9537536 MB.
Disk is reading at about 242 MB/sec
This suggests that this pass may take around 656 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da0 9537536MB 241 244 101
da1 9537536MB 228 228 100
da2 9537536MB 237 240 101
da3 9537536MB 240 240 100
da4 9537536MB 241 241 100
da5 9537536MB 231 233 101
da6 9537536MB 236 236 100
da7 9537536MB 242 242 100

Awaiting completion: initial parallel array read
Sat Apr 25 14:33:46 CEST 2020
Completed: initial parallel array read

Disk's average time is 52391 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da0 10000831348736 51051 97
da1 10000831348736 54216 103
da2 10000831348736 52273 100
da3 10000831348736 52096 99
da4 10000831348736 51460 98
da5 10000831348736 53648 102
da6 10000831348736 52694 101
da7 10000831348736 51690 99

Performing initial parallel seek-stress array read
Sat Apr 25 14:33:46 CEST 2020
The disk da0 appears to be 9537536 MB.
Disk is reading at about 227 MB/sec
This suggests that this pass may take around 699 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da0 9537536MB 241 228 95
da1 9537536MB 228 209 92
da2 9537536MB 237 220 93
da3 9537536MB 240 225 94
da4 9537536MB 241 222 92
da5 9537536MB 231 213 92
da6 9537536MB 236 224 95
da7 9537536MB 242 224 93

Awaiting completion: initial parallel seek-stress array read
Thu Apr 30 15:47:22 CEST 2020
Completed: initial parallel seek-stress array read

Disk's average time is 267933 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da0 10000831348736 248448 93
da1 10000831348736 288345 108 --SLOW--
da2 10000831348736 263132 98
da3 10000831348736 266320 99
da4 10000831348736 289132 108 --SLOW--
da5 10000831348736 270470 101
da6 10000831348736 252665 94
da7 10000831348736 264953 99
1588323240403.png
1588323246604.png
1588323252144.png
1588323256390.png
1588323260980.png
1588323265249.png
1588323270994.png
1588323274973.png
 
Last edited:

Mastakilla

Member
Jul 23, 2019
36
8
8
Thanks for your response!!

A couple minor remarks
So in your case, your average disk speeds are ok. Also remember that Freenas (ZFS) wont hit your disks for data once it has been transferred to the ARC (your ram), so that is possibly why you see slower disk access or slower transfer from your disks, as it doesn't need to hit the disks for much as time goes forward. This is assuming you pull the same or relative data from your disks. This is just my guess on this
Currently the disks have not been added to a ZFS pool yet. The filesystem from a previous FreeNAS installation may still be present, but as far as the current FreeNAS install knows, there is no ZFS filesystem. I suppose this means (although it does sound very plausible) that ARC is not playing a role in this.

As for the HBA, you did the right thing to replace it. A broken resistor might have caused some data corruption errors, and possibly some flipped bits, albeit ZFS usually fixes stuff like this to a point.
Just fyi: the results from the start of this thread were taken before I damaged that resistor...

When you initiate a sequential transfer to a Samba or Windows share, do you get your rated network speed? Are you using 1Gb or 10Gb Ethernet? As long as you see those network speeds, you are fine. The hard drives are offering up their data, and the ARC/L2ARC is taking it and pushing it down the network pipe.
10GBe indeed and the limited testing I did so far matched my expectations.

If you must see a graphical representation of the health of your drives, you could sideload Windows and use HDSentinel and see the SMART data visualized, or you can use SmartCTRL in Freenas to get the SMART data. As long as the drives are healthy, don't get too wrapped up in who is transferring slow or fast. As long as your drives are not sick, you are ok!
I already have Windows with HDDSentinel sideloaded :) And also using HDDSentinel the HDDs were already fully tested.

That the HDDs themselves are probably fine, I already knew. But all these tests, test more than just the HDDs. They test how well the HDDs work with the HBA, motherboard and OS. And there I do still find these results a bit strange and too inconsistent.
What also surprises me a bit that this solnet array test script is more than 10 years old and FreeNAS advises everyone to burn-in-test their HDDs, but no one seems to have thoroughly done so. No one (so far) can tell me if these results are "normal" or not.