(Not sure if this is the right forum to post such topic on, but starting here)
Hi, though i’d like to share my strange issues on a SuperMicro systems with STH forum user and hopefully get some feedback about what you think about it.
I bought the barebone system in May 2015 and it was running as a test system before being put into production Jan 2016.
The system details:
1x SC826E1-R800LPB
1x H8DG6-F
2x AMD Opteron 6376
2x SNK-P0043P
8 x 8GB Samsung PC3-10600U DDR3 1333 1.5v Reg Memory
2 x Intel 320 160GB - SSD
1 x Intel DC S3500 300GB - SSD
1 x WD Velocity Raptor 10K SATA 160GB - HDD.
1 x Innodisk SATA DOM 8GB
Description of the issue:
The system experienced meta data corruption on 4 disks, 3 SSD and 1 HDD on the very same day at end of the February 2016. Some of the reboots also resulted in the partition being lost from the disks. This is also easy to reproduce, even if we don't experience a meta data corruption, the partition are lost on some of the reboots of the server.
Details about disks:
2 of Intel 320 160GB - SSD
1 of Intel DC S3500 300GB - SSD
1 of WD Velocity Raptor 160GB - HDD.
All of the SSDs were connected to bays which were again connected to the onboard SAS2 controller through the direct attached backplane. The HDD was connected to a bay which was connected to one of the onboard SATA ports.
After experiencing the issue, i started to suspect the disks and controllers. So to take the onboard SAS2 controller and the backplane out of the question, i started to taking ALL disks out of the bays and connecting them one by one into the SATA ports directly on the motherboard. All of them experienced the same issue, which could take some time to appear.
I've also updated BIOS and all the firmware to the latest version during the troubleshooting process.
The server was running CentOS 7.2 at the time of issue
Suspecting a software issue, we installed different kernel version on the system after reinstalling the OS. Reinstalling the OS did not make any difference.
E.g using Elrepo, we installed kernel 4.4.3, but again the issue was reproducible runnign this kernel as well.
Then we went on to downgrade from the official kernel of CentOS 7.2 to 7.1 using version 3.10-229. Unfortunately that again resulted with meta data corruption on disks.
Lastly we did boot on Fedora live image, resulting in the same issue.
We've run memtest86+ and Prime95 for over 24 hours each, without having any issues so we've concluded that there is (most likely) no issues with CPUs or RAM.
On 31/03-2016 we have inserted a new SAS controller (IBM M1015 and with other SAS to SATA breakout cables) and connected the backplace to that instead of the onboard SAS controller. Unfortunately, i was able to reproduce the issue on the SSD (S3500).
The server is in a DC in the Netherlands, and before i plan to travel over there (costs some significant amount), i was thinking to obtain a new MB, and possibly RAM as well.
What do you think? Should do something else prior to replacing the MB?
TIA!
Hi, though i’d like to share my strange issues on a SuperMicro systems with STH forum user and hopefully get some feedback about what you think about it.
I bought the barebone system in May 2015 and it was running as a test system before being put into production Jan 2016.
The system details:
1x SC826E1-R800LPB
1x H8DG6-F
2x AMD Opteron 6376
2x SNK-P0043P
8 x 8GB Samsung PC3-10600U DDR3 1333 1.5v Reg Memory
2 x Intel 320 160GB - SSD
1 x Intel DC S3500 300GB - SSD
1 x WD Velocity Raptor 10K SATA 160GB - HDD.
1 x Innodisk SATA DOM 8GB
Description of the issue:
The system experienced meta data corruption on 4 disks, 3 SSD and 1 HDD on the very same day at end of the February 2016. Some of the reboots also resulted in the partition being lost from the disks. This is also easy to reproduce, even if we don't experience a meta data corruption, the partition are lost on some of the reboots of the server.
Details about disks:
2 of Intel 320 160GB - SSD
1 of Intel DC S3500 300GB - SSD
1 of WD Velocity Raptor 160GB - HDD.
All of the SSDs were connected to bays which were again connected to the onboard SAS2 controller through the direct attached backplane. The HDD was connected to a bay which was connected to one of the onboard SATA ports.
After experiencing the issue, i started to suspect the disks and controllers. So to take the onboard SAS2 controller and the backplane out of the question, i started to taking ALL disks out of the bays and connecting them one by one into the SATA ports directly on the motherboard. All of them experienced the same issue, which could take some time to appear.
I've also updated BIOS and all the firmware to the latest version during the troubleshooting process.
The server was running CentOS 7.2 at the time of issue
Suspecting a software issue, we installed different kernel version on the system after reinstalling the OS. Reinstalling the OS did not make any difference.
E.g using Elrepo, we installed kernel 4.4.3, but again the issue was reproducible runnign this kernel as well.
Then we went on to downgrade from the official kernel of CentOS 7.2 to 7.1 using version 3.10-229. Unfortunately that again resulted with meta data corruption on disks.
Lastly we did boot on Fedora live image, resulting in the same issue.
We've run memtest86+ and Prime95 for over 24 hours each, without having any issues so we've concluded that there is (most likely) no issues with CPUs or RAM.
On 31/03-2016 we have inserted a new SAS controller (IBM M1015 and with other SAS to SATA breakout cables) and connected the backplace to that instead of the onboard SAS controller. Unfortunately, i was able to reproduce the issue on the SSD (S3500).
The server is in a DC in the Netherlands, and before i plan to travel over there (costs some significant amount), i was thinking to obtain a new MB, and possibly RAM as well.
What do you think? Should do something else prior to replacing the MB?
TIA!