Loosing hdd at random in my 24 disks server

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

vl1969

Active Member
Feb 5, 2014
634
76
28
Hi, not sure if this is the right place, so moderator move this where it belongs.

I have a home server setup with an old ish supermicro sc846 enclosure.
The details are in my sig, but it is a dual xenon with sas controller in IT mode and a 24 lines extender going to a
SAS846TQ Supermicro 4U 24 Bay SAS Backplane.

I have a mix of 2T , 3T, 4T disks in several zfs pools.
For the last few months I have been experiencing disk dropout at random. One or two 2T disks in couple of videvs, I even got 4 new 4T drives to replace and expand the pools but now couple of those are dropping out as well. So my pools are going into degraded mode and I have to online the disks and resilver the pools. No data loss Soo far but it is anoying.

Can some chime in with ideas on why this can be happening?

I replaced the cables, I replaced controller and expander card.

Can this be the backplane?
 

nabsltd

Active Member
Jan 26, 2022
345
211
43
My first thought would be power supply issues. Yes, it could technically be the expander card with the issue, but random drop out of different disks usually points to not enough power.

Unless you have really small power supplies (like 500W) in that chassis, the raw total power is likely not the issue. Most came with at least 700W, with much higher being very common.
 

i386

Well-Known Member
Mar 18, 2016
4,222
1,541
113
34
Germany
I even got 4 new 4T drives
new like manufactured in the last 5 years? or new like drives that were not used in this system before?
consumer drives (thinking of wd greens and similar)?
orginal fans or "frankensteined" cooling/system?

My guesses:
(0: heat if the chassis or the fans has bed modded)
1: psu
2. cables
3. expander (many connectors -> more chances that something was damaged)
4. hba/raid controller
Most came with at least 700W, with much higher being very common.
Some newer jbod chassis come now with a titanium 600watt psu and I think sm had 846 servers with bronze 500watt psus...
 

vl1969

Active Member
Feb 5, 2014
634
76
28
Ok, to cover all this questions in one.

FYI, this is what you would call a kind of frankenbuild.
Chassis is SuperMicro SC846
old der kind. It came with opteron CPUs.
I have upgraded the MB to dual xenon 56xx and 124gb ram.

I also gut the whole thing and moved to standard ata PSU 1050 Watt.

So have plenty of power.
Don't think it's cooling but will see if I can improve it.
It have been working for 6years now since upgrade with no issue though.

Cables are new, had a controller failure 2 years ago, so had to swap 3x8 disks for 1+expander and 8087 to 4 sata breakout cables than.

If I read all of the comments so far, backplane seams to be likely suspect.



Power should not be an issue.
 

Sean Ho

seanho.com
Nov 19, 2019
768
352
63
Vancouver, BC
seanho.com
Does the issue follow the drives, or the bays? zfs will have no problem with you swapping the drives around. Do the old failing drives work (and pass smart tests, etc.) on another system?
 

vl1969

Active Member
Feb 5, 2014
634
76
28
Didn't run smart test yet on all of them but Debian smart does not report anything, and I swap drives around and now it doesn't follow the drives or the bays. It's like one or two zfs pools sudenly loose a disk now and again.
Will try hunting for new backplane.
 

vl1969

Active Member
Feb 5, 2014
634
76
28
Ok, guess I am not getting anymore suggestions here.

I ordered a new BPN-SAS2-846EL1 will see how things work out.

Thanks
 

vl1969

Active Member
Feb 5, 2014
634
76
28
Drives work fine in another system.
All drives worked in this system untill like 6 months ago. And than I start loosing drives from pools.
 

bilbo1337

Member
Sep 18, 2020
79
45
18
Florida
I was just reading on another forum and it reminded me of this thread. It was talking about how if the HBA gets too hot then that can cause drop outs. Another thing I was just thinking about was allocation size, if the firmware is buggy then having it emulate to some strange sector size could be throwing things off too.
 

vl1969

Active Member
Feb 5, 2014
634
76
28
Thanks bilbo1337.
I read about too, but the server was working fine for at least 6 or 7 years.
Hence why I was thinking it was HDDs at first. But I checked the drives that fall out and they are ok. Health check report is ok. And when I put them back and resilver they work.

Anyhow, I got a new backplane and all looks good so far. The the last week all pools are up and running with no dropouts or issue.

So I guess the 10yo backplane just went bye bye.
Gave me a chance to move to the sas2 el01 type with built-in expander. I don't have those 24 sata cables all over the place anymore. Just 2 nice and need 8087 breakout running to my controller. I now have a free PCIe slot as well.