Could PowerSupply cause random drives to fail?

Myth · Oct 12, 2018

Hey Guys,

We have a big server with 48 spinning 8TB seagate enterprise hard drives. We installed this server about 3 months ago and things have been going smoothly. However, last week during a nightly defragmentation, two drives died at the same time.

Let me first explain the setup of the Raids. We have three RAID 6 arrays each 16 drives about 100TBs each. We call them Volume A, B, & C. Volume C lost two drives at the exact same time - strange. We also use adaptec controller cards, FYI.

So I rebuilt one of the drives, which took about 24 hours. Then I restarted the server, remember that there was still one failed HDD in Volume C, which I hadn't swapped yet. After the reboot, the failed HDD came alive again, and started auto rebuilding, humm...

So I figured that the failed HDD got a sputter of life, and the it will soon die, but then a few days later, another different drive dies. So they are back to only 1 level of protection (RAID 5). I say RAID 5 because once a drive fails in a RAID 6 array, it operates like a RAID 5 array until it's rebuilt with two redundant parities.

Anyways, so now I'm really curious. It's one thing for two Hard Drives to die at the exact same time, but then a third drive so close together, something is up. So I take the first failed HDD and plug it into a different server, and sure enough, the drive is fine. So it's not the drives that are failing.

My thoughts, it could be the backplane? Would a backplane failing cause random drives to die, then when you reboot the server they come back on?

My other thought is that it might be the power supply. It seems to me that when each of the drives "died" it was during maximum performance hour (Nightly defragmentation) which runs all 3 volumes at 100% IO. So I'm thinking maybe the power going to the drives is failing. Which would explain why the drives come back after a reboot.

My last thought is that it's the adaptec controller card, but I see no errors at all. Just failed drives.

My plan is to replace all three this weekend, the PSU, back-plane, and controller card. It's not an easy job. But I wanted to get you guy's thoughts on this. Do you think a failing power supply could cause certain drives to die, but then after a reboot they come back online? What do you guys think?

Also the power usage for each drive is 11.4w max. Then 35w for each fiber channel HBA, then 20w for each myricom 10gE card, then 120w for the CPU, maybe 85w for the supermicro Mobo, I don't know maybe 50w per backplane?, anything else... that all adds up to
550w for the HDDS
70w HBA
40w Myricoms
205w for Mobo + CPU
150w for backplanes (probably over quoted) SAS 936A something like that, supermicro 16 bay. 3.5"
20w for 32GB RAM ECC Am I wrong about this number? 4x 8 Gig Sticks...
15w for 13x Noctua 2000rpm fans
Total = 1,050w

PSU is a zippy rated at 1200w. What do you guys think??????

By the way there is some corrupted media.

Best,
Myth

T_Minus · Oct 12, 2018

SATA drives? Cable distance from drives to adaptec? Within spec?

Link to exact power supply. Is it 2x1200w redundant so 50/50 utilization or 1 PSU getting all utilization? Were you measuring power / can you to see actual utilization?

"Volume C lost two drives at the exact same time - strange."
- Not really that strange especially during a rebuild or heavy-utilization and same-age drives (defrag in this case).

Myth · Oct 12, 2018

Well it seems strange that the drives come back online after a reboot or if I swap them to a different server.

They are all about one year old. They are SAS 12gb. Cables are 3m or less and the cable length is the same for all our servers. Yet this one is having difficulties.

The original server was installed about one year ago. Then about three months ago we swapped the drives into a new server because the power supply fuse popped. We think it had to do with a faulty ups.

Other things to note I also updated the firmware to E03 and change the stripe size from 256kb to 512kb. I don’t think that has anything to do with it.

And remember all these 48 drives are the same make and model and have the same hours. The only difference is that volume c is striped at 512kb with e03 firmware while volume A and B is striped at 256kb and has e02 firmware. But I don’t think it’s a firmware issue since the drives work after a reboot or they are moved into different servers.

T_Minus · Oct 12, 2018

Link to exact power supply. Is it 2x1200w redundant so 50/50 utilization or 1 PSU getting all utilization? Were you measuring power / can you to see actual utilization?

i386 · Oct 12, 2018

Are you using staggered spin up?

This is from a server with 15 drives without staggered spin up:

Myth · Oct 13, 2018

I don't use stagger spin up unless adaptec does it automatically. Thinking about it, each drive seems to light up one at a time. So for 48 drives that's a lot of juice, looking at more like 1500watts during boot up.

I'll be servicing the Server tomorrow morning at 9am I brought a power meter with me to test actual usage.

The power supply is only a single PSU and it just says zippy 1200watts max on the box.

As an additional note, I have seen backplanes that make certain drives fail while other drives don't fail. So I think it could be a backplane, which makes sense since the failing drives only happens on one volume linked to one backplane, but then again, I'm thinking that since this power supply has been used for 3 months now, it might not be giving enough power to all the drives, which could cause them to fail, but the strange thing is that the drives don't show that many errors or any warnings, they just die. So I'm really not sure what the problem is. I hope it's not actually the drives. But I'll replace those also.

So I'm going to replace the PSU, the backplane, and the the third failed HDD. I might also replace the controller card.

I'm also concerned about the shadow copy service failing on the operating system. It's Windows server 2012 (storage) and for some reason I see a bunch of VSS errors in the event log. I don't know if it's related or not, we don't use or need shadow copies, but I'm not sure why that service has failed. When I go to shadow copies on the properties of any drive, it says service not available, and in the event log it says shadow copy service VSS has failed with some strange string of random numbers. Anyways, I don't know what the hell is going on with this server.

Maybe it's just three failing HDDs two at the same time, then a third a few days later, but it' just confuses me that the HDDs come back to life when I move them around, so I'm hoping it's just a backplane, but I think it's probably also pulling to much power for 48 spinning drives. Come to think of it, I would have installed a 1500w power supply in this server if I had to build it over again. Maybe something like this :

CORSAIR AX1600i CP-9020087-NA 1600W ATX 80 PLUS TITANIUM Certified Full Modular Digital ATX Power Supply - Newegg.com

It's less than $400 seems like a good price. Although I would love to have a dual redundant power supply, each 1500w I don't think it would fit in the chassis, I can only fit whatever PSU would fit inside an ATX case. Anyways, thanks for all your help, if anyone has any PSU recommendations, I think that's the best bet.

I mean based on i386's post 48 drives would easily pull over 1500w's on boot up alone, but when they are all running at full capacity. Man.

I'm just glad my boss is out of the picture, the company I work for went bankrupt. Still I'm servicing the clients, so now I can properly fix the servers and blame the problem on the design without my boss getting defensive. I guess I'll just charge the clients for my time and have them pay me directly. Since my boss has fled the country and turned off all means of communication with him - bad situation to say the least, but that's a separate issue.

I'm under a bit of stress, don't really know how to deal with all the politics, both legal and ethical. But I think it's best to just try to get this server working reliably then worry about how to get my check? I don't know, I'll just pray.

Tha_14 · Oct 14, 2018

Myth said:
I don't use stagger spin up unless adaptec does it automatically. Thinking about it, each drive seems to light up one at a time. So for 48 drives that's a lot of juice, looking at more like 1500watts during boot up.

I'll be servicing the Server tomorrow morning at 9am I brought a power meter with me to test actual usage.

The power supply is only a single PSU and it just says zippy 1200watts max on the box.

As an additional note, I have seen backplanes that make certain drives fail while other drives don't fail. So I think it could be a backplane, which makes sense since the failing drives only happens on one volume linked to one backplane, but then again, I'm thinking that since this power supply has been used for 3 months now, it might not be giving enough power to all the drives, which could cause them to fail, but the strange thing is that the drives don't show that many errors or any warnings, they just die. So I'm really not sure what the problem is. I hope it's not actually the drives. But I'll replace those also.

So I'm going to replace the PSU, the backplane, and the the third failed HDD. I might also replace the controller card.

I'm also concerned about the shadow copy service failing on the operating system. It's Windows server 2012 (storage) and for some reason I see a bunch of VSS errors in the event log. I don't know if it's related or not, we don't use or need shadow copies, but I'm not sure why that service has failed. When I go to shadow copies on the properties of any drive, it says service not available, and in the event log it says shadow copy service VSS has failed with some strange string of random numbers. Anyways, I don't know what the hell is going on with this server.

Maybe it's just three failing HDDs two at the same time, then a third a few days later, but it' just confuses me that the HDDs come back to life when I move them around, so I'm hoping it's just a backplane, but I think it's probably also pulling to much power for 48 spinning drives. Come to think of it, I would have installed a 1500w power supply in this server if I had to build it over again. Maybe something like this :

CORSAIR AX1600i CP-9020087-NA 1600W ATX 80 PLUS TITANIUM Certified Full Modular Digital ATX Power Supply - Newegg.com

It's less than $400 seems like a good price. Although I would love to have a dual redundant power supply, each 1500w I don't think it would fit in the chassis, I can only fit whatever PSU would fit inside an ATX case. Anyways, thanks for all your help, if anyone has any PSU recommendations, I think that's the best bet.

I mean based on i386's post 48 drives would easily pull over 1500w's on boot up alone, but when they are all running at full capacity. Man.

I'm just glad my boss is out of the picture, the company I work for went bankrupt. Still I'm servicing the clients, so now I can properly fix the servers and blame the problem on the design without my boss getting defensive. I guess I'll just charge the clients for my time and have them pay me directly. Since my boss has fled the country and turned off all means of communication with him - bad situation to say the least, but that's a separate issue.

I'm under a bit of stress, don't really know how to deal with all the politics, both legal and ethical. But I think it's best to just try to get this server working reliably then worry about how to get my check? I don't know, I'll just pray.

Good luck man. Working as a freelancer can be a real bother when It comes to getting paid sometimes. If you had any connection with the client while you were working for the company it will most probably go smoothly.

funkywizard · Oct 27, 2018

Myth said:
Well it seems strange that the drives come back online after a reboot or if I swap them to a different server.

They are all about one year old. They are SAS 12gb. Cables are 3m or less and the cable length is the same for all our servers. Yet this one is having difficulties.

The original server was installed about one year ago. Then about three months ago we swapped the drives into a new server because the power supply fuse popped. We think it had to do with a faulty ups.

Other things to note I also updated the firmware to E03 and change the stripe size from 256kb to 512kb. I don’t think that has anything to do with it.

And remember all these 48 drives are the same make and model and have the same hours. The only difference is that volume c is striped at 512kb with e03 firmware while volume A and B is striped at 256kb and has e02 firmware. But I don’t think it’s a firmware issue since the drives work after a reboot or they are moved into different servers.

That's normal seagate bullsh**. If you pop the drives in the freezer for 15 minutes, better than 50% chance they work "for a little while". You can sometimes get the same result by just powering off the server for a while.

The failure rate on seagates is astronomical. It is totally normal for a 4 drive raid to have more than one bad drive at any given time, let alone 48 of them.

My advice, pick out any drives that are clean / no bad sectors, and sell them on ebay. Plenty of people don't realize how trash seagate is and will buy them. Then buy hgst drives to replace them.

Search

Could PowerSupply cause random drives to fail?

Myth

Member

T_Minus

Build. Break. Fix. Repeat

Myth

Member

T_Minus

Build. Break. Fix. Repeat

i386

Well-Known Member

Myth

Member

Tha_14

Server Newbie

funkywizard

mmm.... bandwidth.