Meta data corruption on SuperMicro H8DG6

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
(Not sure if this is the right forum to post such topic on, but starting here)
Hi, though i’d like to share my strange issues on a SuperMicro systems with STH forum user and hopefully get some feedback about what you think about it.
I bought the barebone system in May 2015 and it was running as a test system before being put into production Jan 2016.

The system details:
1x SC826E1-R800LPB
1x H8DG6-F
2x AMD Opteron 6376
2x SNK-P0043P
8 x 8GB Samsung PC3-10600U DDR3 1333 1.5v Reg Memory
2 x Intel 320 160GB - SSD
1 x Intel DC S3500 300GB - SSD
1 x WD Velocity Raptor 10K SATA 160GB - HDD.
1 x Innodisk SATA DOM 8GB

Description of the issue:
The system experienced meta data corruption on 4 disks, 3 SSD and 1 HDD on the very same day at end of the February 2016. Some of the reboots also resulted in the partition being lost from the disks. This is also easy to reproduce, even if we don't experience a meta data corruption, the partition are lost on some of the reboots of the server.
Details about disks:
2 of Intel 320 160GB - SSD
1 of Intel DC S3500 300GB - SSD
1 of WD Velocity Raptor 160GB - HDD.

All of the SSDs were connected to bays which were again connected to the onboard SAS2 controller through the direct attached backplane. The HDD was connected to a bay which was connected to one of the onboard SATA ports.

After experiencing the issue, i started to suspect the disks and controllers. So to take the onboard SAS2 controller and the backplane out of the question, i started to taking ALL disks out of the bays and connecting them one by one into the SATA ports directly on the motherboard. All of them experienced the same issue, which could take some time to appear.
I've also updated BIOS and all the firmware to the latest version during the troubleshooting process.

The server was running CentOS 7.2 at the time of issue
Suspecting a software issue, we installed different kernel version on the system after reinstalling the OS. Reinstalling the OS did not make any difference.
E.g using Elrepo, we installed kernel 4.4.3, but again the issue was reproducible runnign this kernel as well.
Then we went on to downgrade from the official kernel of CentOS 7.2 to 7.1 using version 3.10-229. Unfortunately that again resulted with meta data corruption on disks.

Lastly we did boot on Fedora live image, resulting in the same issue.

We've run memtest86+ and Prime95 for over 24 hours each, without having any issues so we've concluded that there is (most likely) no issues with CPUs or RAM.

On 31/03-2016 we have inserted a new SAS controller (IBM M1015 and with other SAS to SATA breakout cables) and connected the backplace to that instead of the onboard SAS controller. Unfortunately, i was able to reproduce the issue on the SSD (S3500).

The server is in a DC in the Netherlands, and before i plan to travel over there (costs some significant amount), i was thinking to obtain a new MB, and possibly RAM as well.

What do you think? Should do something else prior to replacing the MB?

TIA!
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
I would think so.
Here is a recent log:
Apr 01 22:40:45 localhost.localdomain crond[8270]: (CRON) INFO (@reboot jobs will be run at computer's startup.)
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): Metadata corruption detected at xfs_dir3_data_reada_verify+0x42/0x80 [xfs], block 0x58
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): Unmount and run xfs_repair
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): First 64 bytes of corrupted metadata buffer:
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc010: 00 00 00 00 00 00 00 00 00 b0 10 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc020: 00 a0 c1 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): metadata I/O error: block 0x58 ("xfs_trans_read_buf_map") error 117 numblks 8

og:


When you say "lost partition" does that mean block zero on the harddrive was wiped?
 

unwind-protect

Active Member
Mar 7, 2016
415
156
43
Boston
I would think so.
Here is a recent log:
Apr 01 22:40:45 localhost.localdomain crond[8270]: (CRON) INFO (@reboot jobs will be run at computer's startup.)
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): Metadata corruption detected at xfs_dir3_data_reada_verify+0x42/0x80 [xfs], block 0x58
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): Unmount and run xfs_repair
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): First 64 bytes of corrupted metadata buffer:
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc010: 00 00 00 00 00 00 00 00 00 b0 10 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc020: 00 a0 c1 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: ffff880410bcc030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 01 22:46:42 localhost.localdomain kernel: XFS (sda1): metadata I/O error: block 0x58 ("xfs_trans_read_buf_map") error 117 numblks 8

og:
That looks like block 0 in the filesystem, not block 0 on the disk.
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
Yeah, i had that suspicion early and pulled out the PSU that it was running on and connected power to the second PSU
Same issue on the other PSU as well :(

But yeah, this issue appeared all of a sudden so, could it be the DC's power ability?

My money is on an intermittent power supply :)
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
Yeah, i had that suspicion early and pulled out the PSU that it was running on and connected power to the second PSU
Same issue on the other PSU as well :(

But yeah, this issue appeared all of a sudden so, could it be the DC's power ability?
The problem could be with the power supply backplane rather than an individual hotswap supply module, or it could be a bad cable connector somewhere. My suggestion would be to use an external power supply on it as a test, you might need to make up some leads/ extenders etc to achieve that, but it rules out your power supply entirely should the fault persists. A power supply tester with a low volts alarm function might also be handy, but you will need to test every connection with it. Also, don't rule out the mains cable, plug and the socket or circuit your are feeding it with :)
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
That's a good suggestion indeed! I totally forgot the fact that PSUs are connected to a power supply backplane which again has the leads to drives, the board etc! It's hard as i've seen the server personally, it was shipped directly to the DC.
But using the web interface of BMC/IPMI, the voltages etc seem fine, but if it's intermittent i can't see it easily OFC.
Sent a request to the DC if they can confirm nothing is wrong with their mains, power etc!

I'm trying to fix a cheap shipping to get it shipped to me so i can troubleshoot myself. Then i'll be able to connect an external PSU and test. Should a 400W PSU work when i only use 1 SSD?

It's really hard to believe the H8DG6 board is faulty. I have the same server setup in another setup, it's been rock solid for 2 years.



The problem could be with the power supply backplane rather than an individual hotswap supply module, or it could be a bad cable connector somewhere. My suggestion would be to use an external power supply on it as a test, you might need to make up some leads/ extenders etc to achieve that, but it rules out your power supply entirely should the fault persists. A power supply tester with a low volts alarm function might also be handy, but you will need to test every connection with it. Also, don't rule out the mains cable, plug and the socket or circuit your are feeding it with :)
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
I'm trying to fix a cheap shipping to get it shipped to me so i can troubleshoot myself. Then i'll be able to connect an external PSU and test. Should a 400W PSU work when i only use 1 SSD?

It's really hard to believe the H8DG6 board is faulty. I have the same server setup in another setup, it's been rock solid for 2 years.
400w should be fine, any old supply that's known to be good should do you for testing with. I'll be surprised if your whole server isn't hovering around that mark or less anyway. I wouldn't rule out the mainboard, or anything else for that matter just yet, just take steps to eliminate all other possibilities first in a methodical manner and you'll either nail whatever the problem is along the way, or you'll eliminate everything else and be left with only one possible cause. The power supply system is, in my opinion anyway, not only the most likely cause, but also one of the easiest to rule out as a first troubleshooting step :)
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
Finally i have the server in front of me at home. I could reproduce the issue with one SSD on SATA port on the M.B and with power from another pc since there is no SATA power cable on the PSU backplane. And with another SSD connected to the backplane. So both use SATA ports on the M.B, but one is through SAS backplane, another directly. Even though one SSD had external power, i still can'r rule out PSU on the server since the MB gets main power from it.
But i took out a 400W PSU from a SM 732 chassis and it does have 20+4 pin connector, another 4pins and a 6pins. While the H8DG6-F M.B has 2 of 8 pins socket for the 12V for the CPUs right next to the 24 pins socket.

The M.B manual says: JPW2/3 +12V 8-pin CPU Power Connectors
http://www.supermicro.com/manuals/motherboard/SR56x0/MNL-H8DGi(6)(-F).pdf

I wonder if there is still a way to use this PSU?
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
Hmm, now the fact that i used external power to one of the SSD and still got corruption makes me very suspicious on the M.B and not on the PSU or PSU backplane really. Should i start replacing the M.B and CPU?
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
Replacing the mainboard and CPU will obviously rule them out, even better if you replace one at a time, but if it is a borderline or otherwise intermittent PSU and the replacement doesn't pull as much power as the original one...it may seem fine, for a few weeks, you know, just long enough for you to declare its fixed and return it! Then there's the fact that simply disturbing it all may temporarily "fix" a dry solder joint in the PSU or PSU backplane etc. If you can, best to leave it all undisturbed as much as you can until you have ruled out the PSU with certainty imo :)
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
Makes a lot of sense yeah. I've used almost 2 months soon to get this server shipped to me and just can't ship it back with uncertainty or in false believe of having fixed the issue!
I'll take a look at some older servers at work taken off production to see if they have PSU with the 2x 8pins in addition to a 24pin connector.
Otherwise i'd buy one.
If you have any tip for a cheap PSU, please let me know.
Thanks!
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
I personally like the smaller Seasonic and Zippy supplies and 3Y have always treated me well :)
However those are definitely in server/ enterprise land. Maybe go with a decent sized EVGA gold rated or something along those lines so that you can use the PSU in a PC build later?
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
If it has the correct connectors for your mainboard and any peripherals you have installed, it should do the job fine. Just double check against your manuals to confirm this though. Either way, it's at least as good a quality supply as any of the other mid to high end ones available and the price isn't too bad for a new one. Providing it has the correct connectors and can handle the power needed, I would probably go for it :)
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
Yeah, i checked last night and it had the CPU1 and CPU2 8-pins connector which many PSU i saw seem to lack. Often they have only 1 so meant for single CPU MBs.
Problem is MOLEX connectors from the PSU backplane to the SAS backplane, i don't think this PSU has such, but i don't need this for test.
I can test the disks directly on SATA or through the SAS to SATA breakout cables and still get power to disks from the SATA power connector from this new PSU. Our aim is after all to test with the new PSU to see if the problem is reproducible. The SAS backplane is irrelevant here.

That should be fine right?

Gonna place order soon.