Supermicro H8DCL-iF crashes randomly with strange kernel messages

Discussion in 'RAID Controllers and Host Bus Adapters' started by lpallard, Feb 19, 2014.

  1. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Hi there,,

    Ok this time I *really* need the pros! About a month ago (+/- when I got my Intel PRo/1000 Quad NIC card and installed it .. coincidence?) I noticed strange messages in dmesg. At first I thought some application had crashed, and didnt bother with that. THe server appeared to run normally. Suddenly, the occurence increased, and today, I lost mysql and databases started crashing.

    The messages in dmesg:
    Code:
    [   31.117103] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
    [   40.440181] NET: Registered protocol family 10
    [   40.640740] svc: failed to register lockdv1 RPC service (errno 97).
    [   40.640907] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
    [   40.640915] NFSD: unable to find recovery directory /var/lib/nfs/v4recovery
    [   40.640918] NFSD: starting 90-second grace period
    [   51.210086] eth0: no IPv6 routers present
    [   59.804882] xfsettingsd[4055]: segfault at 1 ip 000000000040c261 sp 00007fff28cd9210 error 4 in xfsettingsd[400000+14000]
    [   62.337378] ata1.00: configured for UDMA/133
    [   62.337381] ata1: EH complete
    [   62.390605] ata2.00: configured for UDMA/133
    [   62.390608] ata2: EH complete
    [   62.418051] ata3.00: configured for UDMA/133
    [   62.418055] ata3: EH complete
    [   67.222901] EXT4-fs (md2): re-mounted. Opts: commit=0
    [   67.225728] EXT4-fs (md0): re-mounted. Opts: commit=0
    [   67.227650] EXT4-fs (md3): re-mounted. Opts: data=writeback,stripe=48,barrier=0,errors=remount-ro,commit=0
    [B][264481.220106] INFO: task syslogd:2517 blocked for more than 120 seconds.
    [264481.220112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [264481.220117] syslogd         D ffff88101ec31c40     0  2517      1 0x00000000
    [264481.220126]  ffff880fb42cbde8 0000000000000082 ffff880fb42cbd88 ffffffff00000000
    [264481.220134]  ffff880fb8226720 ffff880fb42cbfd8 ffff880fb42cbfd8 ffff880fb42cbfd8
    [264481.220141]  ffff880fb81044c0 ffff880fb8226720 0000000000000001 0000000100000246
    [264481.220148] Call Trace:
    [264481.220166]  [<ffffffff81b2fcff>] schedule+0x3f/0x60
    [264481.220175]  [<ffffffff8126ae05>] jbd2_log_wait_commit+0xb5/0x130
    [264481.220185]  [<ffffffff81074c90>] ? finish_wait+0x80/0x80
    [264481.220192]  [<ffffffff8126cc61>] jbd2_complete_transaction+0x51/0xa0
    [264481.220200]  [<ffffffff81217548>] ext4_sync_file+0x198/0x3a0
    [264481.220210]  [<ffffffff81161795>] do_fsync+0x55/0x80
    [264481.220217]  [<ffffffff81161ac0>] sys_fsync+0x10/0x20
    [264481.220223]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [264481.220244] INFO: task mysqld:3971 blocked for more than 120 seconds.
    [264481.220248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [264481.220253] mysqld          D ffff88101ec11c40     0  3971   3057 0x00000000
    [264481.220260]  ffff880ef3c17de8 0000000000000082 ffff880ef3c17d88 ffffffff810f2935
    [264481.220267]  ffff880ef3e8d280 ffff880ef3c17fd8 ffff880ef3c17fd8 ffff880ef3c17fd8
    [264481.220273]  ffff8804c6afd280 ffff880ef3e8d280 0000000000000001 0000000000000246
    [264481.220286] Call Trace:
    [264481.220289]  [<ffffffff810f2935>] ? pagevec_lookup_tag+0x25/0x40
    [264481.220292]  [<ffffffff81b2fcff>] schedule+0x3f/0x60
    [264481.220295]  [<ffffffff8126ae05>] jbd2_log_wait_commit+0xb5/0x130
    [264481.220298]  [<ffffffff81074c90>] ? finish_wait+0x80/0x80
    [264481.220300]  [<ffffffff8126cc61>] jbd2_complete_transaction+0x51/0xa0
    [264481.220303]  [<ffffffff81217548>] ext4_sync_file+0x198/0x3a0
    [264481.220307]  [<ffffffff81089cbd>] ? sys_futex+0x8d/0x190
    [264481.220310]  [<ffffffff81161795>] do_fsync+0x55/0x80
    [264481.220312]  [<ffffffff81161ac0>] sys_fsync+0x10/0x20
    [264481.220314]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [265561.220103] INFO: task syslogd:2517 blocked for more than 120 seconds.
    [265561.220109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [265561.220115] syslogd         D ffff88101ecd1c40     0  2517      1 0x00000000
    [265561.220124]  ffff880fb42cbde8 0000000000000082 ffff880fb42cbd88 ffffffff00000000
    [265561.220132]  ffff880fb8226720 ffff880fb42cbfd8 ffff880fb42cbfd8 ffff880fb42cbfd8
    [265561.220139]  ffff880fb8185280 ffff880fb8226720 0000000000000001 0000000100000246
    [265561.220146] Call Trace:
    [265561.220163]  [<ffffffff81b2fcff>] schedule+0x3f/0x60
    [265561.220173]  [<ffffffff8126ae05>] jbd2_log_wait_commit+0xb5/0x130
    [265561.220182]  [<ffffffff81074c90>] ? finish_wait+0x80/0x80
    [265561.220189]  [<ffffffff8126cc61>] jbd2_complete_transaction+0x51/0xa0
    [265561.220197]  [<ffffffff81217548>] ext4_sync_file+0x198/0x3a0
    [265561.220207]  [<ffffffff81161795>] do_fsync+0x55/0x80
    [265561.220214]  [<ffffffff81161ac0>] sys_fsync+0x10/0x20
    [265561.220220]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b[/B]
    After a reboot, these were gone until a few hours/days later they come back. Like I said, tonight, the apps started to behave strangely at the exact moment these errors appeared in dmesg.

    At first I suspected filesystem corruption seeing "ext4" in the errors.. Then after a foeced FS repair (which found quite a lot of errors), I rebooted the server but the same happened again. This time I suspected that one of the hard drive forming the raid1 array where / is located had gone bad. I ran smartctl on both drives:

    Code:
    bash-4.2# smartctl -a /dev/sdj
    smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.2.45] (local build)
    Copyright (C) 2002-12 by Bruce Allen, [URL="http://smartmontools.sourceforge.net"]smartmontools[/URL]
    
    === START OF INFORMATION SECTION ===
    Model Family:     [URL="http://shop.ebay.com/i.html?_nkw=seagate+barracuda"]Seagate Barracuda[/URL] (SATA 3Gb/s, 4K Sectors)
    Device Model:     ST2000DM001-1CH164
    Serial Number:    S1E1REY8
    LU WWN Device Id: 5 000c50 060fb47fd
    Firmware Version: CC24
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   8
    ATA Standard is:  ATA-8-ACS revision 4
    Local Time is:    Wed Feb 19 19:42:42 2014 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x82)    Offline data collection activity
                        was completed without error.
                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)    The previous self-test routine completed
                        without error or no self-test has ever 
                        been run.
    Total time to complete Offline 
    data collection:         (  575) seconds.
    Offline data collection
    capabilities:              (0x7b) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003)    Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01)    Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:      (   1) minutes.
    Extended self-test routine
    recommended polling time:      ( 217) minutes.
    Conveyance self-test routine
    recommended polling time:      (   2) minutes.
    SCT capabilities:            (0x3085)    SCT Status supported.
    
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       -       7776808
      3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       39
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always       -       101348117
      9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3869
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       39
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
    189 High_Fly_Writes         0x003a   096   096   000    Old_age   Always       -       4
    190 Airflow_Temperature_Cel 0x0022   074   064   045    Old_age   Always       -       26 (Min/Max 22/28)
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       12
    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       90
    194 Temperature_Celsius     0x0022   026   040   000    Old_age   Always       -       26 (0 18 0 0 0)
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       121229746900763
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       33625870858
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       10563155383
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%      3869         -
    # 2  Short offline       Completed without error       00%      3858         -
    # 3  Short offline       Completed without error       00%      3834         -
    # 4  Short offline       Completed without error       00%      3810         -
    # 5  Short offline       Completed without error       00%      3786         -
    # 6  Short offline       Completed without error       00%      3762         -
    # 7  Short offline       Completed without error       00%      3738         -
    # 8  Short offline       Completed without error       00%      3714         -
    # 9  Short offline       Completed without error       00%      3690         -
    #10  Short offline       Completed without error       00%      3666         -
    #11  Short offline       Completed without error       00%      3642         -
    #12  Short offline       Completed without error       00%      3618         -
    #13  Short offline       Completed without error       00%      3594         -
    #14  Short offline       Completed without error       00%      3570         -
    #15  Short offline       Completed without error       00%      3546         -
    #16  Short offline       Completed without error       00%      3522         -
    #17  Short offline       Completed without error       00%      3498         -
    #18  Short offline       Completed without error       00%      3474         -
    #19  Short offline       Completed without error       00%      3450         -
    #20  Short offline       Completed without error       00%      3426         -
    #21  Short offline       Completed without error       00%      3402         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    Code:
    bash-4.2# smartctl -a /dev/sdk
    smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.2.45] (local build)
    Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
    
    === START OF INFORMATION SECTION ===
    Model Family:     [URL="http://shop.ebay.com/i.html?_nkw=seagate+barracuda"]Seagate Barracuda[/URL] (SATA 3Gb/s, 4K Sectors)
    Device Model:     ST2000DM001-1CH164
    Serial Number:    S1E1RH1L
    LU WWN Device Id: 5 000c50 060fae855
    Firmware Version: CC24
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   8
    ATA Standard is:  ATA-8-ACS revision 4
    Local Time is:    Wed Feb 19 19:46:27 2014 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x82)    Offline data collection activity
                        was completed without error.
                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)    The previous self-test routine completed
                        without error or no self-test has ever 
                        been run.
    Total time to complete Offline 
    data collection:         (  575) seconds.
    Offline data collection
    capabilities:              (0x7b) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003)    Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01)    Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:      (   1) minutes.
    Extended self-test routine
    recommended polling time:      ( 210) minutes.
    Conveyance self-test routine
    recommended polling time:      (   2) minutes.
    SCT capabilities:            (0x3085)    SCT Status supported.
    
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail  Always       -       189150976
      3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       42
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always       -       4395936207
      9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3873
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       42
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
    189 High_Fly_Writes         0x003a   089   089   000    Old_age   Always       -       11
    190 Airflow_Temperature_Cel 0x0022   073   064   045    Old_age   Always       -       27 (Min/Max 22/29)
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       108
    194 Temperature_Celsius     0x0022   027   040   000    Old_age   Always       -       27 (0 18 0 0 0)
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       158832185577246
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       36909133199
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       7977923618
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%      3873         -
    # 2  Short offline       Completed without error       00%      3862         -
    # 3  Short offline       Completed without error       00%      3838         -
    # 4  Short offline       Completed without error       00%      3814         -
    # 5  Short offline       Completed without error       00%      3790         -
    # 6  Short offline       Completed without error       00%      3766         -
    # 7  Short offline       Completed without error       00%      3742         -
    # 8  Short offline       Completed without error       00%      3718         -
    # 9  Short offline       Completed without error       00%      3694         -
    #10  Short offline       Completed without error       00%      3670         -
    #11  Short offline       Completed without error       00%      3646         -
    #12  Short offline       Completed without error       00%      3622         -
    #13  Short offline       Completed without error       00%      3598         -
    #14  Short offline       Completed without error       00%      3574         -
    #15  Short offline       Completed without error       00%      3550         -
    #16  Short offline       Completed without error       00%      3526         -
    #17  Short offline       Completed without error       00%      3502         -
    #18  Short offline       Completed without error       00%      3478         -
    #19  Short offline       Completed without error       00%      3454         -
    #20  Short offline       Completed without error       00%      3430         -
    #21  Short offline       Completed without error       00%      3406         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    Mdadm seems not to see anything wrong... Both drives came back with zero sector reallocation.. That points to the next in line: the motherboard's SATA controller.. Or RAM, or PSU??
     
    #1
  2. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Next was to run "hwinfo --disk" to find out which controller may be faulty. Turns out, md2 that has the underlying drives connected to the mainboard's SATA controller (Supermicro H8DCL-iF).. After I ran the hwinfo utility, I re-ran dmesg and got a slightly different (but not better) output.

    Code:
    [  303.019706] ------------[ cut here ]------------
    [  303.019714] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xd4/0x100()
    [  303.019716] Hardware name: [B]H8DCL[/B]
    [  303.019717] sysfs: cannot create duplicate filename '/class/scsi_tape'
    [  303.019719] Modules linked in: st(+) ipv6 ipmi_si ipmi_devintf ipmi_msghandler agpgart lp ppdev parport_pc parport pcspkr fuse joydev processor thermal_sys psmouse e1000e k10temp evdev fam15h_power sp5100_tco hwmon i2c_piix4 i2c_core serio_raw button loop usbhid hid
    [  303.019738] Pid: 8219, comm: modprobe Not tainted 3.2.45 #2
    [  303.019740] Call Trace:
    [  303.019747]  [<ffffffff8105344f>] warn_slowpath_common+0x7f/0xc0
    [  303.019751]  [<ffffffff81053546>] warn_slowpath_fmt+0x46/0x50
    [  303.019756]  [<ffffffff8158f275>] ? strlcat+0x65/0x90
    [  303.019759]  [<ffffffff811a4bf4>] sysfs_add_one+0xd4/0x100
    [  303.019763]  [<ffffffff811a4c97>] create_dir+0x77/0xd0
    [  303.019766]  [<ffffffff811a4d8d>] sysfs_create_dir+0x7d/0xc0
    [  303.019770]  [<ffffffff81588579>] kobject_add_internal+0xa9/0x1f0
    [  303.019774]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  303.019779]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  303.019782]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  303.019785]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  303.019790]  [<ffffffffa002b065>] init_st+0x65/0x1a5 [st]
    [  303.019792]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  303.019796]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  303.019801]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  303.019805]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  303.019808] ---[ end trace b3a5b89a15be52ad ]---
    [  303.019811] kobject_add_internal failed for scsi_tape with -EEXIST, don't try to register things with the same name in the same directory.
    [  303.019815] Pid: 8219, comm: modprobe Tainted: G        W    3.2.45 #2
    [  303.019816] Call Trace:
    [  303.019819]  [<ffffffff815885c2>] kobject_add_internal+0xf2/0x1f0
    [  303.019823]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  303.019826]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  303.019829]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  303.019832]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  303.019836]  [<ffffffffa002b065>] init_st+0x65/0x1a5 [st]
    [  303.019838]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  303.019841]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  303.019843]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  303.019845]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  303.019847] Unable create sysfs class for SCSI tapes
    [  303.096048] sd 0:0:0:0: Attached scsi generic sg0 type 0
    [  303.096092] sd 0:0:1:0: Attached scsi generic sg1 type 0
    [  303.096132] sd 0:0:2:0: Attached scsi generic sg2 type 0
    [  303.096172] sd 0:0:3:0: Attached scsi generic sg3 type 0
    [  303.096215] sd 0:0:4:0: Attached scsi generic sg4 type 0
    [  303.096260] sd 0:0:5:0: Attached scsi generic sg5 type 0
    [  303.096310] sd 0:0:6:0: Attached scsi generic sg6 type 0
    [  303.096356] sd 0:0:7:0: Attached scsi generic sg7 type 0
    [  303.096450] sd 1:0:0:0: Attached scsi generic sg8 type 0
    [  303.096517] sd 2:0:0:0: Attached scsi generic sg9 type 0
    [  303.096650] sd 3:0:0:0: Attached scsi generic sg10 type 0
    [  303.151329] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
    [  303.151332] EDD information not available.
    [  315.046517] st: Version 20101219, fixed bufsize 32768, s/g segs 256
    [  315.046527] ------------[ cut here ]------------
    [  315.046535] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xd4/0x100()
    [  315.046537] Hardware name: H8DCL
    [  315.046539] sysfs: cannot create duplicate filename '/class/scsi_tape'
    [  315.046544] Modules linked in: st(+) sg ipv6 ipmi_si ipmi_devintf ipmi_msghandler agpgart lp ppdev parport_pc parport pcspkr fuse joydev processor thermal_sys psmouse e1000e k10temp evdev fam15h_power sp5100_tco hwmon i2c_piix4 i2c_core serio_raw button loop usbhid hid
    [  315.046563] Pid: 8360, comm: modprobe Tainted: G        W    3.2.45 #2
    [  315.046566] Call Trace:
    [  315.046573]  [<ffffffff8105344f>] warn_slowpath_common+0x7f/0xc0
    [  315.046577]  [<ffffffff81053546>] warn_slowpath_fmt+0x46/0x50
    [  315.046582]  [<ffffffff8158f275>] ? strlcat+0x65/0x90
    [  315.046586]  [<ffffffff811a4bf4>] sysfs_add_one+0xd4/0x100
    [  315.046590]  [<ffffffff811a4c97>] create_dir+0x77/0xd0
    [  315.046593]  [<ffffffff811a4d8d>] sysfs_create_dir+0x7d/0xc0
    [  315.046597]  [<ffffffff81588579>] kobject_add_internal+0xa9/0x1f0
    [  315.046601]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  315.046606]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  315.046609]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  315.046613]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  315.046617]  [<ffffffffa002b065>] init_st+0x65/0x1a5 [st]
    [  315.046620]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  315.046624]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  315.046629]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  315.046633]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  315.046636] ---[ end trace b3a5b89a15be52ae ]---
    [  315.046640] kobject_add_internal failed for scsi_tape with -EEXIST, don't try to register things with the same name in the same directory.
    [  315.046643] Pid: 8360, comm: modprobe Tainted: G        W    3.2.45 #2
    [  315.046645] Call Trace:
    [  315.046648]  [<ffffffff815885c2>] kobject_add_internal+0xf2/0x1f0
    [  315.046651]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  315.046654]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  315.046660]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  315.046663]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  315.046667]  [<ffffffffa002b065>] init_st+0x65/0x1a5 [st]
    [  315.046670]  [<ffffffffa002b000>] ? 0xffffffffa002afff
    [  315.046672]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  315.046675]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  315.046678]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  315.046680] Unable create sysfs class for SCSI tapes
    [  315.100633] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
    [  315.100635] EDD information not available.
    [  654.310746] lp: driver loaded but no devices found
    [  654.325259] st: Version 20101219, fixed bufsize 32768, s/g segs 256
    [  654.325267] ------------[ cut here ]------------
    [  654.325275] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xd4/0x100()
    [  654.325278] Hardware name: [B]H8DCL[/B]
    [  654.325280] sysfs: cannot create duplicate filename '/class/scsi_tape'
    [  654.325282] Modules linked in: st(+) lp parport_pc sg ipv6 ipmi_si ipmi_devintf ipmi_msghandler agpgart ppdev parport pcspkr fuse joydev processor thermal_sys psmouse e1000e k10temp evdev fam15h_power sp5100_tco hwmon i2c_piix4 i2c_core serio_raw button loop usbhid hid [last unloaded: parport_pc]
    [  654.325303] Pid: 8481, comm: modprobe Tainted: G        W    3.2.45 #2
    [  654.325306] Call Trace:
    [  654.325313]  [<ffffffff8105344f>] warn_slowpath_common+0x7f/0xc0
    [  654.325317]  [<ffffffff81053546>] warn_slowpath_fmt+0x46/0x50
    [  654.325323]  [<ffffffff8158f275>] ? strlcat+0x65/0x90
    [  654.325327]  [<ffffffff811a4bf4>] sysfs_add_one+0xd4/0x100
    [  654.325330]  [<ffffffff811a4c97>] create_dir+0x77/0xd0
    [  654.325334]  [<ffffffff811a4d8d>] sysfs_create_dir+0x7d/0xc0
    [  654.325338]  [<ffffffff81588579>] kobject_add_internal+0xa9/0x1f0
    [  654.325342]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  654.325347]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  654.325350]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  654.325354]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  654.325358]  [<ffffffffa006c065>] init_st+0x65/0x1a5 [st]
    [  654.325361]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  654.325365]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  654.325369]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  654.325374]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  654.325377] ---[ end trace b3a5b89a15be52af ]---
    [  654.325380] kobject_add_internal failed for scsi_tape with -EEXIST, don't try to register things with the same name in the same directory.
    [  654.325384] Pid: 8481, comm: modprobe Tainted: G        W    3.2.45 #2
    [  654.325385] Call Trace:
    [  654.325388]  [<ffffffff815885c2>] kobject_add_internal+0xf2/0x1f0
    [  654.325391]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  654.325395]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  654.325398]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  654.325401]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  654.325405]  [<ffffffffa006c065>] init_st+0x65/0x1a5 [st]
    [  654.325407]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  654.325410]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  654.325412]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  654.325415]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  654.325417] Unable create sysfs class for SCSI tapes
    [  654.383088] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
    [  654.383090] EDD information not available.
    [  689.583198] lp: driver loaded but no devices found
    [  689.598456] st: Version 20101219, fixed bufsize 32768, s/g segs 256
    [  689.598463] ------------[ cut here ]------------
    [  689.598469] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xd4/0x100()
    [  689.598471] Hardware name: H8DCL
    [  689.598473] sysfs: cannot create duplicate filename '/class/scsi_tape'
    [  689.598474] Modules linked in: st(+) lp parport_pc sg ipv6 ipmi_si ipmi_devintf ipmi_msghandler agpgart ppdev parport pcspkr fuse joydev processor thermal_sys psmouse e1000e k10temp evdev fam15h_power sp5100_tco hwmon i2c_piix4 i2c_core serio_raw button loop usbhid hid [last unloaded: parport_pc]
    [  689.598493] Pid: 8573, comm: modprobe Tainted: G        W    3.2.45 #2
    [  689.598495] Call Trace:
    [  689.598501]  [<ffffffff8105344f>] warn_slowpath_common+0x7f/0xc0
    [  689.598504]  [<ffffffff81053546>] warn_slowpath_fmt+0x46/0x50
    [  689.598508]  [<ffffffff8158f275>] ? strlcat+0x65/0x90
    [  689.598511]  [<ffffffff811a4bf4>] sysfs_add_one+0xd4/0x100
    [  689.598514]  [<ffffffff811a4c97>] create_dir+0x77/0xd0
    [  689.598517]  [<ffffffff811a4d8d>] sysfs_create_dir+0x7d/0xc0
    [  689.598520]  [<ffffffff81588579>] kobject_add_internal+0xa9/0x1f0
    [  689.598523]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  689.598527]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  689.598529]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  689.598533]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  689.598536]  [<ffffffffa006c065>] init_st+0x65/0x1a5 [st]
    [  689.598539]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  689.598542]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  689.598545]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  689.598548]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  689.598550] ---[ end trace b3a5b89a15be52b0 ]---
    [  689.598554] kobject_add_internal failed for scsi_tape with -EEXIST, don't try to register things with the same name in the same directory.
    [  689.598557] Pid: 8573, comm: modprobe Tainted: G        W    3.2.45 #2
    [  689.598559] Call Trace:
    [  689.598561]  [<ffffffff815885c2>] kobject_add_internal+0xf2/0x1f0
    [  689.598564]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  689.598566]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  689.598569]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  689.598572]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  689.598575]  [<ffffffffa006c065>] init_st+0x65/0x1a5 [st]
    [  689.598577]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  689.598580]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  689.598582]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  689.598585]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  689.598587] Unable create sysfs class for SCSI tapes
    [  689.651287] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
    [  689.651289] EDD information not available.
    [  731.942239] st: Version 20101219, fixed bufsize 32768, s/g segs 256
    [  731.942246] ------------[ cut here ]------------
    [  731.942253] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xd4/0x100()
    [  731.942255] Hardware name: H8DCL
    [  731.942257] sysfs: cannot create duplicate filename '/class/scsi_tape'
    [  731.942259] Modules linked in: st(+) lp parport_pc sg ipv6 ipmi_si ipmi_devintf ipmi_msghandler agpgart ppdev parport pcspkr fuse joydev processor thermal_sys psmouse e1000e k10temp evdev fam15h_power sp5100_tco hwmon i2c_piix4 i2c_core serio_raw button loop usbhid hid [last unloaded: parport_pc]
    [  731.942275] Pid: 8641, comm: modprobe Tainted: G        W    3.2.45 #2
    [  731.942277] Call Trace:
    [  731.942284]  [<ffffffff8105344f>] warn_slowpath_common+0x7f/0xc0
    [  731.942287]  [<ffffffff81053546>] warn_slowpath_fmt+0x46/0x50
    [  731.942292]  [<ffffffff8158f275>] ? strlcat+0x65/0x90
    [  731.942295]  [<ffffffff811a4bf4>] sysfs_add_one+0xd4/0x100
    [  731.942298]  [<ffffffff811a4c97>] create_dir+0x77/0xd0
    [  731.942301]  [<ffffffff811a4d8d>] sysfs_create_dir+0x7d/0xc0
    [  731.942305]  [<ffffffff81588579>] kobject_add_internal+0xa9/0x1f0
    [  731.942308]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  731.942312]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  731.942315]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  731.942318]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  731.942321]  [<ffffffffa006c065>] init_st+0x65/0x1a5 [st]
    [  731.942324]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  731.942327]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  731.942331]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  731.942335]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  731.942337] ---[ end trace b3a5b89a15be52b1 ]---
    [  731.942340] kobject_add_internal failed for scsi_tape with -EEXIST, don't try to register things with the same name in the same directory.
    [  731.942343] Pid: 8641, comm: modprobe Tainted: G        W    3.2.45 #2
    [  731.942345] Call Trace:
    [  731.942347]  [<ffffffff815885c2>] kobject_add_internal+0xf2/0x1f0
    [  731.942350]  [<ffffffff81588d39>] kset_register+0x29/0x60
    [  731.942353]  [<ffffffff8164468c>] __class_register+0xec/0x200
    [  731.942356]  [<ffffffff81644803>] __class_create+0x63/0xb0
    [  731.942358]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  731.942361]  [<ffffffffa006c065>] init_st+0x65/0x1a5 [st]
    [  731.942364]  [<ffffffffa006c000>] ? 0xffffffffa006bfff
    [  731.942366]  [<ffffffff810002b2>] do_one_initcall+0x122/0x170
    [  731.942368]  [<ffffffff81090e44>] sys_init_module+0x84/0x1e0
    [  731.942371]  [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
    [  731.942373] Unable create sysfs class for SCSI tapes
    [  731.997439] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
    [  731.997442] EDD information not available.
    I was suggested to run memtest to check the RAM. I couldnt do it...

    1. Server has no CD/DVD drive. I tried to take the drive from my media center and install it on my server, but its a headless machine and the only monitor I have has only a DVI cable, the server has only a VGA port, and I dont have an adapter DVI to VGA (only lots of VGA to DVI)...

    2. The IPMI's virtual console (or whatever it is called) doesnt work. I tried the "Remote control > Launch SOL", all I get is a blank window with a blinking cursor.

    3. The console redirection also doesnt seems to work. I get a totally blank window. Nothing there..

    4. Trying to upload a bootable image through the IPMI platform, there is 2 options: Floppy image and ISO. The floppy image accepts maximum 1.44MB (while all images or so are 1.44+) and accepts only .img or .ima files.. The ISO option requires storing an ISO on a Windows share.. I have no windows machine.

    5. I tried to create a bootable USB UBCD stick.. Cant create the bootable stick. I followed the instructions of http://nlug.ml1.co.uk/2012/04/instal...ory-stick/2512 but arrived at the step where I need to run syslinux on the USB stick, I was getting a permission denied error, even as root (stick of course NOT mounted).

    6. Tried to create a System Rescue CD bootable USB stick following http://www.sysresccd.org/Sysresccd-m...n_an_USB-stick but it also failed. THeir script couldnt detect my usb stick..

    Im desperate.. I need a vacation [​IMG]

    Sounds like my only avenue now is to hope/pray that my monthly incremental backups on my hotswappable drive are good, and that the data actually stored on the drives is also not too corrupted. I know for a fact that there is some database damage as Mysql has been having a hard time lately and tonight an application crashed catastrophically..

    What would be the safe course of actions here? Keep the server powered down to prevent further corruption until I can get the RAM checked or the mobo exchanged under warranty?
     
    #2
  3. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Sorry for the 3rd post in a row.. a hint of a panic a bit I guess..

    I found this interesting thread Disk IO errors on all sata disks, disks dropped out

    I have the same hardware (more or less) 2 Seagate barracuda LP drives assembled as mdadm raid1 on a supermicro mobo. Could this be the culprit?

    May be time to install that shiny new M5016 controller and get myself two nice SAS drives....
     
    #3
  4. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    So I have spent an hour over a remote desktop session with Supermicro. Their tech rep was very good and seemed to think that it may either be one of the drives (due to failure or firmware bug of some sort) or the mobo's SATA controller..

    He however did not recommend anything. I am left a bit clueless but for now I am running memtest to eliminate RAM issues. I highly doubt the memory is faulty but im free to test so why not!

    Right now I am contemplating getting 2 identical SAS enterprise drives to plug to my M5016 controller and proceed with my virtualized setup as I was planning about a month ago before this nightmare started,

    Would it be a waste of money or you guys suggest I seriously look at the SAS drives?

    In the 0-$200CAD range, there are

    • Seagate Constellation ES.3 ST1000NM0043 1TB 128MB cache 7200RPM
    • Seagate Constellation ES.3 ST1000NM0023 1TB 128MB cache 7200RPM
    • HGST Ultrastar 15K600 HUS156030VLS600(0B23661) 300GB 16MB 15000RPM

    Cant tell the difference between the two first drives but they are Seagate and Im getting awfully displeased by Seagate these days so even with the tradeoff capacity VS spindle speed, I am leaning toward the 2X HGST's 300GB..

    Not cheap. $437CAD including shipping.

    But on the other hand, when I built this server, I wanted to use it for my personal document repository, finances, databases, etc. Right now, every 4 to 6 months or so, I have a catastrophic failure like the one right now. I am tired of it. Its costly and unreliable.

    Either I put all the chances on my side to have a reliable server and eliminate the consumer crap out of it for good, or I get rid of the entire thing... Do you guys recommend spending further money to have something that I may be some day able to rely on??

    Sincerely asking opinions here..

    Cheers
     
    #4
  5. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,498
    Likes Received:
    4,441
    Or just verify with $15 Cheetah 15K.5's
     
    #5
  6. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Dear friend, not sure I understand you there... $15 ?
     
    #6
  7. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,498
    Likes Received:
    4,441
    #7
  8. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Patrick,

    would you generally recommend second hand drives from a reliability standpoint? Most of the drives on ebay will be older generations without any warranty left.. I know a warranty doesnt help me recover my data but at least if the drives die, then I can get them replaced fairly easily..
     
    #8
  9. MiniKnight

    MiniKnight Well-Known Member

    Joined:
    Mar 30, 2012
    Messages:
    2,927
    Likes Received:
    854
    I would never buy drives this old for production. I'd think you could inexpensively test whether SAS drives would throw errors. If SAS drives work, then potentially invest more in them selling off whatever you bought for the test.
     
    #9
    Last edited: Feb 21, 2014
  10. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5

    And we have a winner!
    Turns out, this morning while I was tring to boot the server in init1 to run fsck on the RAID partitions, every time a hard drive on the mobo was accessed, I could hear a strange metallic noise (kinda high pitch but more metallic like pennies being dropped on a glass top table..) coming out of the PSU.
    I pulled the PSU out of the case to be able to stick my ear to it and confirm, and I could very clearly hear it coming from inside the PSU.
    So right now, Im not putting aside a SATA controller failure on the motherboard, but I am leaning toward a PSU failure causing controller hiccups..

    I will get the PSU replaced under warranty. Stupid Corsair.. With a 275$ PSU you'd expect more than 5 months of life!!!!!!
    Now my next question, and you will guess: How can I be sure the rest of the server has not suffered from a failing PSU? Corsair's warranty doesnt cover damage caused to equipment connected to their PSU..
     
    #10
  11. mrkrad

    mrkrad Well-Known Member

    Joined:
    Oct 13, 2012
    Messages:
    1,234
    Likes Received:
    49
    don't you have redundant psu's? Might be a good reason to move to a rackmount server with RPS features :)
     
    #11
  12. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    No that wasnt a redundant PSU.. Only a Single Corsair HX1050

    How would redundant PSU's protect my server from electrical damage? WOuldn't they only protect against downtime kinda a RAID1 setup for hard drives?
     
    #12
  13. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Im about to shell out $450 for SAS drives.. Can somebody recommend these or deny their usefulness?

    Newegg.ca - HGST Ultrastar 15K600 HUS156030VLS600(0B23661) 300GB 15000 RPM 16MB Cache SAS 6Gb/s 3.5" Enterprise Hard Drive Bare Drive

    These drives will have to last 5 years or +.. If you can recommend anything better, please do!
     
    #13
  14. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    OK a few days later... PSU is on his way to Cali for RMA... Should take about a month before I get the replacement..

    Until then, I am planning to:

    -Visually look at damage on the motherboard (not useful but who knows)
    -Check which FW I am running on ALL hard drives & upgrade the ones that aren't latest (is this only dangerous for bricking the drive or for anything else?) Should I even go there?
    -Run a long smart test on ALL drives
    -Get reliable drives (2X) for rebuilding the server, and get SAS cables

    I have tried to get a hint of a confirmation from HGST about the Ultrastar drives with the M5016 in terms of compatibility. They wouldnt confirm anything.. IBM seems to be impossivble to talk to without a 5 year maintenance agreement and seems to be more complicated than buying a house...

    Most places I can put my hands on SAS drives will likely not take them back because of compatibility issues so I need to ensure they are 100% compatible with the M5016?

    None of you guys are using SAS Ultrastars on IBM ServeRAID controllers?

    Theres always Cheetahs but being Seagate and all, Id rather avoid..

    Going back to the initial topic, I must say, this issue is not solved yet. The more I search on this issue (using keywords from the dmesg outout) the more I find sites where people are reporting kernel bug or other incidences where they were using a linux MD raid and not a real hardware raid...

    Im surprised none of you guys have seen this before.. Seems to be "fairly" frequent...

    Ronny Egners Blog » INFO: task blocked for more than 120 seconds.
    https://www.linuxquestions.org/ques...e-than-120-seconds-errors-and-crashes-890981/
    0004515: Server hangs, processes being blocked for more than 120 seconds - CentOS Bug Tracker
    https://www.linuxquestions.org/ques...server-crash-kernel-info-task-blocked-908600/
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=516374

    Some used have suggested that a process may be using too much RAM (mysql with its InnoDB Buffer of 80% installed RAM??), while others have suggested that the storage subsystem may be struggling to achieve the required throughput (I/O wise). Finally, few (if none) have indicated controller failure, but some have proposed that HDD failure could be the culprit.

    THats why I am leaning toward a SAS setup. May be some $$ up front, but if I can forget this stupid server for a few months without hardware failing on me, Im willing to spend...
     
    #14
  15. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    #15
  16. Peter_U

    Peter_U New Member

    Joined:
    Apr 11, 2012
    Messages:
    22
    Likes Received:
    0
    #16
  17. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Perhaps its better to stay away then..

    I meant to ask, and did on other forums to get the community's "pulse" but:

    What about CentOS instead of slackware for this server?

    Slackware is my default distro because I know slackware better than any other distro and I love its KISS principle and the UNIX way of doing things. However, I am very tired of rebuilding such system from scratch because it requires tons of packages from sources, CPAN modules, building from scratch, configuring, tweaking, and there's always exceptions due to slackware (builds fails, because of.. apps not running because of...), etc... Slackware doesnt (and probably will never) use a real package manager so dependency resolution is on the user's shoulders. Nice if you end up using the same system for ages, but rebuilding it 3 times in a single year, Ive got enough of it.

    I fould to have a working server OS, I had to spend SEVERAL hours on it, whereas, I hope, with CentOS or another server tailored OS, I would have this working state almost right after the initial OS installation. Dependency and compability with server grade hardware should be better (or at least tested somehow) with CentOS?

    Im just wondering.. Extensively searching the web, I found only CentOS, RHEL & Ubuntu as recommended server OS'es (supported). Hell, Supermicro tech reps didnt even know what Slackware was!

    Yesterday I installed CentOS 6.5 in a VBox machine to test/play with it. In the meantime Im hoping to get real life server admin's experiences with CentOS or another linux based server OS.
     
    #17
  18. MiniKnight

    MiniKnight Well-Known Member

    Joined:
    Mar 30, 2012
    Messages:
    2,927
    Likes Received:
    854
    CentOS is based largely on RHEL . You are not going to have issues running it.

    Ubuntu server is also viable.

    I'd bet most Web servers are using one of the two.
     
    #18
  19. Salami

    Salami New Member

    Joined:
    Oct 12, 2012
    Messages:
    31
    Likes Received:
    0
    There is an Amazon seller with that drive at the same price.
     
    #19
  20. lpallard

    lpallard Member

    Joined:
    Aug 17, 2013
    Messages:
    204
    Likes Received:
    5
    Are the Hitachi Ultrastars even good to begin with? Before I spit hundreds, I'd like to have feedback from an actual human instead of reading crappy useless websites .... Maybe I post on the Hitachi HDD thread?!
     
    #20
Similar Threads: Supermicro H8DCL-iF
Forum Title Date
RAID Controllers and Host Bus Adapters Supermicro X11SSH-CTF IT mode flash successful BUT... Aug 24, 2019
RAID Controllers and Host Bus Adapters Supermicro AOC-SLG3-8E2P Octaport NVMe HBA Aug 21, 2019
RAID Controllers and Host Bus Adapters Supermicro BPN-826A backplane and large drives? Jul 8, 2019
RAID Controllers and Host Bus Adapters SuperMicro AOC-SLG3-4E4T (full height bracket?) Jul 2, 2019
RAID Controllers and Host Bus Adapters RAID controller recommendation for Supermicro 836 + 16x 7200 rpm drives May 9, 2019

Share This Page