Hi there,,
Ok this time I *really* need the pros! About a month ago (+/- when I got my Intel PRo/1000 Quad NIC card and installed it .. coincidence?) I noticed strange messages in dmesg. At first I thought some application had crashed, and didnt bother with that. THe server appeared to run normally. Suddenly, the occurence increased, and today, I lost mysql and databases started crashing.
The messages in dmesg:
After a reboot, these were gone until a few hours/days later they come back. Like I said, tonight, the apps started to behave strangely at the exact moment these errors appeared in dmesg.
At first I suspected filesystem corruption seeing "ext4" in the errors.. Then after a foeced FS repair (which found quite a lot of errors), I rebooted the server but the same happened again. This time I suspected that one of the hard drive forming the raid1 array where / is located had gone bad. I ran smartctl on both drives:
Mdadm seems not to see anything wrong... Both drives came back with zero sector reallocation.. That points to the next in line: the motherboard's SATA controller.. Or RAM, or PSU??
Ok this time I *really* need the pros! About a month ago (+/- when I got my Intel PRo/1000 Quad NIC card and installed it .. coincidence?) I noticed strange messages in dmesg. At first I thought some application had crashed, and didnt bother with that. THe server appeared to run normally. Suddenly, the occurence increased, and today, I lost mysql and databases started crashing.
The messages in dmesg:
Code:
[ 31.117103] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[ 40.440181] NET: Registered protocol family 10
[ 40.640740] svc: failed to register lockdv1 RPC service (errno 97).
[ 40.640907] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 40.640915] NFSD: unable to find recovery directory /var/lib/nfs/v4recovery
[ 40.640918] NFSD: starting 90-second grace period
[ 51.210086] eth0: no IPv6 routers present
[ 59.804882] xfsettingsd[4055]: segfault at 1 ip 000000000040c261 sp 00007fff28cd9210 error 4 in xfsettingsd[400000+14000]
[ 62.337378] ata1.00: configured for UDMA/133
[ 62.337381] ata1: EH complete
[ 62.390605] ata2.00: configured for UDMA/133
[ 62.390608] ata2: EH complete
[ 62.418051] ata3.00: configured for UDMA/133
[ 62.418055] ata3: EH complete
[ 67.222901] EXT4-fs (md2): re-mounted. Opts: commit=0
[ 67.225728] EXT4-fs (md0): re-mounted. Opts: commit=0
[ 67.227650] EXT4-fs (md3): re-mounted. Opts: data=writeback,stripe=48,barrier=0,errors=remount-ro,commit=0
[B][264481.220106] INFO: task syslogd:2517 blocked for more than 120 seconds.
[264481.220112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[264481.220117] syslogd D ffff88101ec31c40 0 2517 1 0x00000000
[264481.220126] ffff880fb42cbde8 0000000000000082 ffff880fb42cbd88 ffffffff00000000
[264481.220134] ffff880fb8226720 ffff880fb42cbfd8 ffff880fb42cbfd8 ffff880fb42cbfd8
[264481.220141] ffff880fb81044c0 ffff880fb8226720 0000000000000001 0000000100000246
[264481.220148] Call Trace:
[264481.220166] [<ffffffff81b2fcff>] schedule+0x3f/0x60
[264481.220175] [<ffffffff8126ae05>] jbd2_log_wait_commit+0xb5/0x130
[264481.220185] [<ffffffff81074c90>] ? finish_wait+0x80/0x80
[264481.220192] [<ffffffff8126cc61>] jbd2_complete_transaction+0x51/0xa0
[264481.220200] [<ffffffff81217548>] ext4_sync_file+0x198/0x3a0
[264481.220210] [<ffffffff81161795>] do_fsync+0x55/0x80
[264481.220217] [<ffffffff81161ac0>] sys_fsync+0x10/0x20
[264481.220223] [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
[264481.220244] INFO: task mysqld:3971 blocked for more than 120 seconds.
[264481.220248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[264481.220253] mysqld D ffff88101ec11c40 0 3971 3057 0x00000000
[264481.220260] ffff880ef3c17de8 0000000000000082 ffff880ef3c17d88 ffffffff810f2935
[264481.220267] ffff880ef3e8d280 ffff880ef3c17fd8 ffff880ef3c17fd8 ffff880ef3c17fd8
[264481.220273] ffff8804c6afd280 ffff880ef3e8d280 0000000000000001 0000000000000246
[264481.220286] Call Trace:
[264481.220289] [<ffffffff810f2935>] ? pagevec_lookup_tag+0x25/0x40
[264481.220292] [<ffffffff81b2fcff>] schedule+0x3f/0x60
[264481.220295] [<ffffffff8126ae05>] jbd2_log_wait_commit+0xb5/0x130
[264481.220298] [<ffffffff81074c90>] ? finish_wait+0x80/0x80
[264481.220300] [<ffffffff8126cc61>] jbd2_complete_transaction+0x51/0xa0
[264481.220303] [<ffffffff81217548>] ext4_sync_file+0x198/0x3a0
[264481.220307] [<ffffffff81089cbd>] ? sys_futex+0x8d/0x190
[264481.220310] [<ffffffff81161795>] do_fsync+0x55/0x80
[264481.220312] [<ffffffff81161ac0>] sys_fsync+0x10/0x20
[264481.220314] [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b
[265561.220103] INFO: task syslogd:2517 blocked for more than 120 seconds.
[265561.220109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[265561.220115] syslogd D ffff88101ecd1c40 0 2517 1 0x00000000
[265561.220124] ffff880fb42cbde8 0000000000000082 ffff880fb42cbd88 ffffffff00000000
[265561.220132] ffff880fb8226720 ffff880fb42cbfd8 ffff880fb42cbfd8 ffff880fb42cbfd8
[265561.220139] ffff880fb8185280 ffff880fb8226720 0000000000000001 0000000100000246
[265561.220146] Call Trace:
[265561.220163] [<ffffffff81b2fcff>] schedule+0x3f/0x60
[265561.220173] [<ffffffff8126ae05>] jbd2_log_wait_commit+0xb5/0x130
[265561.220182] [<ffffffff81074c90>] ? finish_wait+0x80/0x80
[265561.220189] [<ffffffff8126cc61>] jbd2_complete_transaction+0x51/0xa0
[265561.220197] [<ffffffff81217548>] ext4_sync_file+0x198/0x3a0
[265561.220207] [<ffffffff81161795>] do_fsync+0x55/0x80
[265561.220214] [<ffffffff81161ac0>] sys_fsync+0x10/0x20
[265561.220220] [<ffffffff81b3246b>] system_call_fastpath+0x16/0x1b[/B]
At first I suspected filesystem corruption seeing "ext4" in the errors.. Then after a foeced FS repair (which found quite a lot of errors), I rebooted the server but the same happened again. This time I suspected that one of the hard drive forming the raid1 array where / is located had gone bad. I ran smartctl on both drives:
Code:
bash-4.2# smartctl -a /dev/sdj
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.2.45] (local build)
Copyright (C) 2002-12 by Bruce Allen, [URL="http://smartmontools.sourceforge.net"]smartmontools[/URL]
=== START OF INFORMATION SECTION ===
Model Family: [URL="http://shop.ebay.com/i.html?_nkw=seagate+barracuda"]Seagate Barracuda[/URL] (SATA 3Gb/s, 4K Sectors)
Device Model: ST2000DM001-1CH164
Serial Number: S1E1REY8
LU WWN Device Id: 5 000c50 060fb47fd
Firmware Version: CC24
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Feb 19 19:42:42 2014 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 217) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 7776808
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 39
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always - 101348117
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3869
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 39
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4
190 Airflow_Temperature_Cel 0x0022 074 064 045 Old_age Always - 26 (Min/Max 22/28)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 12
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 90
194 Temperature_Celsius 0x0022 026 040 000 Old_age Always - 26 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 121229746900763
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 33625870858
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 10563155383
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3869 -
# 2 Short offline Completed without error 00% 3858 -
# 3 Short offline Completed without error 00% 3834 -
# 4 Short offline Completed without error 00% 3810 -
# 5 Short offline Completed without error 00% 3786 -
# 6 Short offline Completed without error 00% 3762 -
# 7 Short offline Completed without error 00% 3738 -
# 8 Short offline Completed without error 00% 3714 -
# 9 Short offline Completed without error 00% 3690 -
#10 Short offline Completed without error 00% 3666 -
#11 Short offline Completed without error 00% 3642 -
#12 Short offline Completed without error 00% 3618 -
#13 Short offline Completed without error 00% 3594 -
#14 Short offline Completed without error 00% 3570 -
#15 Short offline Completed without error 00% 3546 -
#16 Short offline Completed without error 00% 3522 -
#17 Short offline Completed without error 00% 3498 -
#18 Short offline Completed without error 00% 3474 -
#19 Short offline Completed without error 00% 3450 -
#20 Short offline Completed without error 00% 3426 -
#21 Short offline Completed without error 00% 3402 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Code:
bash-4.2# smartctl -a /dev/sdk
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.2.45] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: [URL="http://shop.ebay.com/i.html?_nkw=seagate+barracuda"]Seagate Barracuda[/URL] (SATA 3Gb/s, 4K Sectors)
Device Model: ST2000DM001-1CH164
Serial Number: S1E1RH1L
LU WWN Device Id: 5 000c50 060fae855
Firmware Version: CC24
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Feb 19 19:46:27 2014 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 210) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 189150976
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 42
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always - 4395936207
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3873
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 42
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 089 089 000 Old_age Always - 11
190 Airflow_Temperature_Cel 0x0022 073 064 045 Old_age Always - 27 (Min/Max 22/29)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 13
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 108
194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 158832185577246
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 36909133199
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 7977923618
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3873 -
# 2 Short offline Completed without error 00% 3862 -
# 3 Short offline Completed without error 00% 3838 -
# 4 Short offline Completed without error 00% 3814 -
# 5 Short offline Completed without error 00% 3790 -
# 6 Short offline Completed without error 00% 3766 -
# 7 Short offline Completed without error 00% 3742 -
# 8 Short offline Completed without error 00% 3718 -
# 9 Short offline Completed without error 00% 3694 -
#10 Short offline Completed without error 00% 3670 -
#11 Short offline Completed without error 00% 3646 -
#12 Short offline Completed without error 00% 3622 -
#13 Short offline Completed without error 00% 3598 -
#14 Short offline Completed without error 00% 3574 -
#15 Short offline Completed without error 00% 3550 -
#16 Short offline Completed without error 00% 3526 -
#17 Short offline Completed without error 00% 3502 -
#18 Short offline Completed without error 00% 3478 -
#19 Short offline Completed without error 00% 3454 -
#20 Short offline Completed without error 00% 3430 -
#21 Short offline Completed without error 00% 3406 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.