Intel RS3DC040 troubleshooting - CRC unexpected sense and frequent Port0 drive failures.

gdmaddog

New Member
Apr 15, 2022
2
0
1
The interwebz
I have a fairly non-standard setup here, using an Intel RS3DC040 to control a local 4-drive RAID10 for video editing work, and hope someone can offer some guidance with frequent troubles: It has repeatedly killed a multitude of drives, most recently a number of WD Gold 6TB, and I am struggling to determine what the root cause is. There are 4 drives, originally all were 4TB WD Blacks (not ideal, I recognize this), but it was killing the Port0 drive frequently enough, I moved to WD Gold 6TBs in the hopes the enterprise drive would prove to help resolve this issue (and I could eventually migrate all 4 drives to 6TB golds and up my capacity).

No backplane (though there had been one at one point), simply 4 drives and always Port0 fails.

Tonight, in less than 36 hours- I had not even ordered a new spare yet. (a 4TB will now arrive on Saturday). The log shows Unexpected Sense reset and CRC errors after about 17 hours uptime and light drive use

No indication of overheating in the log, this controller reportedly has an operating temperature range up to 105C and at the time the alarm sounded, it was sitting at 89C (and I have never seen it higher - though I am now looking into ways to improve airflow to it, just in case)

Intel RWC3 shows three of the drives at ~22C, but the "failed" WD gold at ~45C - it is in a slightly higher location than the others.



Since these issues started, I have:

Removed the backplane
Replaced the controller SAS to SATA cable
Replaced the PSU power cable
Replaced the PSU
Changed from WD Black to WD Gold enterprise drives.


Each time the drive is failed by the controller, I replace it - it is becoming almost comical how frequently. I am fairly new at maintaining this RAID, and hope someone here can perhaps point me in the right direction and mindset for troubleshooting this frustrating scenario. I have 3 (now 4) "failed" drives sitting on my desk now, waiting for me to work out what the best method might be for even determining how/why they have failed - at least one refuses to mount on another PC here.
 

gdmaddog

New Member
Apr 15, 2022
2
0
1
The interwebz
Today I did the un-advisable and out of curiosity, backed up whatever was needed in a pinch and set the "failed" state drive to rebuild.

I didn't watch it as closely as I intended and a few hours ago it started throwing "Unexpected sense occurred; Additional Sense Info: Power on, reset, or bus device reset occurred." in the log every few seconds (sometimes much more frequent), punctuated by the occasional actual warning pair " Command timeout" followed immediately by "Reset"

I noted the controller has hit 95C just rebuilding in an otherwise idle machine - my suspicions now are that the airflow in this non-server tower may simply not sufficient for a server controller w passive cooling, and perhaps this is the source of my issues.

Wondering if I can modify or replace this heatsink with an active fan...

[UPDATE: running all fans Full Speed has dropped the temp to 70C, but this is unfortunately not a long-term solution for this machine in it current case/location - a spot cooling fan or fan modified directly onto the heat-sink is being considered)
 
Last edited: