Hey Guys,
We have a big server with 48 spinning 8TB seagate enterprise hard drives. We installed this server about 3 months ago and things have been going smoothly. However, last week during a nightly defragmentation, two drives died at the same time.
Let me first explain the setup of the Raids. We have three RAID 6 arrays each 16 drives about 100TBs each. We call them Volume A, B, & C. Volume C lost two drives at the exact same time - strange. We also use adaptec controller cards, FYI.
So I rebuilt one of the drives, which took about 24 hours. Then I restarted the server, remember that there was still one failed HDD in Volume C, which I hadn't swapped yet. After the reboot, the failed HDD came alive again, and started auto rebuilding, humm...
So I figured that the failed HDD got a sputter of life, and the it will soon die, but then a few days later, another different drive dies. So they are back to only 1 level of protection (RAID 5). I say RAID 5 because once a drive fails in a RAID 6 array, it operates like a RAID 5 array until it's rebuilt with two redundant parities.
Anyways, so now I'm really curious. It's one thing for two Hard Drives to die at the exact same time, but then a third drive so close together, something is up. So I take the first failed HDD and plug it into a different server, and sure enough, the drive is fine. So it's not the drives that are failing.
My thoughts, it could be the backplane? Would a backplane failing cause random drives to die, then when you reboot the server they come back on?
My other thought is that it might be the power supply. It seems to me that when each of the drives "died" it was during maximum performance hour (Nightly defragmentation) which runs all 3 volumes at 100% IO. So I'm thinking maybe the power going to the drives is failing. Which would explain why the drives come back after a reboot.
My last thought is that it's the adaptec controller card, but I see no errors at all. Just failed drives.
My plan is to replace all three this weekend, the PSU, back-plane, and controller card. It's not an easy job. But I wanted to get you guy's thoughts on this. Do you think a failing power supply could cause certain drives to die, but then after a reboot they come back online? What do you guys think?
Also the power usage for each drive is 11.4w max. Then 35w for each fiber channel HBA, then 20w for each myricom 10gE card, then 120w for the CPU, maybe 85w for the supermicro Mobo, I don't know maybe 50w per backplane?, anything else... that all adds up to
550w for the HDDS
70w HBA
40w Myricoms
205w for Mobo + CPU
150w for backplanes (probably over quoted) SAS 936A something like that, supermicro 16 bay. 3.5"
20w for 32GB RAM ECC Am I wrong about this number? 4x 8 Gig Sticks...
15w for 13x Noctua 2000rpm fans
Total = 1,050w
PSU is a zippy rated at 1200w. What do you guys think??????
By the way there is some corrupted media.
Best,
Myth
We have a big server with 48 spinning 8TB seagate enterprise hard drives. We installed this server about 3 months ago and things have been going smoothly. However, last week during a nightly defragmentation, two drives died at the same time.
Let me first explain the setup of the Raids. We have three RAID 6 arrays each 16 drives about 100TBs each. We call them Volume A, B, & C. Volume C lost two drives at the exact same time - strange. We also use adaptec controller cards, FYI.
So I rebuilt one of the drives, which took about 24 hours. Then I restarted the server, remember that there was still one failed HDD in Volume C, which I hadn't swapped yet. After the reboot, the failed HDD came alive again, and started auto rebuilding, humm...
So I figured that the failed HDD got a sputter of life, and the it will soon die, but then a few days later, another different drive dies. So they are back to only 1 level of protection (RAID 5). I say RAID 5 because once a drive fails in a RAID 6 array, it operates like a RAID 5 array until it's rebuilt with two redundant parities.
Anyways, so now I'm really curious. It's one thing for two Hard Drives to die at the exact same time, but then a third drive so close together, something is up. So I take the first failed HDD and plug it into a different server, and sure enough, the drive is fine. So it's not the drives that are failing.
My thoughts, it could be the backplane? Would a backplane failing cause random drives to die, then when you reboot the server they come back on?
My other thought is that it might be the power supply. It seems to me that when each of the drives "died" it was during maximum performance hour (Nightly defragmentation) which runs all 3 volumes at 100% IO. So I'm thinking maybe the power going to the drives is failing. Which would explain why the drives come back after a reboot.
My last thought is that it's the adaptec controller card, but I see no errors at all. Just failed drives.
My plan is to replace all three this weekend, the PSU, back-plane, and controller card. It's not an easy job. But I wanted to get you guy's thoughts on this. Do you think a failing power supply could cause certain drives to die, but then after a reboot they come back online? What do you guys think?
Also the power usage for each drive is 11.4w max. Then 35w for each fiber channel HBA, then 20w for each myricom 10gE card, then 120w for the CPU, maybe 85w for the supermicro Mobo, I don't know maybe 50w per backplane?, anything else... that all adds up to
550w for the HDDS
70w HBA
40w Myricoms
205w for Mobo + CPU
150w for backplanes (probably over quoted) SAS 936A something like that, supermicro 16 bay. 3.5"
20w for 32GB RAM ECC Am I wrong about this number? 4x 8 Gig Sticks...
15w for 13x Noctua 2000rpm fans
Total = 1,050w
PSU is a zippy rated at 1200w. What do you guys think??????
By the way there is some corrupted media.
Best,
Myth
Last edited: