Hi,
I'm new to the STH Forum, so if I'm missing some information, please let me know and I will add as much information as possible to debug this issue.
My problem is with a SAS1068E controller (06:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)) in an HP ProLiant DL180 G6 (LFF Bays) which is randomly failing under high load conditions which is mostly random read/write to all disks (WD RE4 2TB) on Ubuntu 16.04.
dmesg contains a lot of complains from mptbase and storage related systems. When there is no disk activity, dmesg contains the following messages at a random interval between 1 and ~600 seconds:
Here's a list of all mptbase-related messages sorted by frequency:
A lot of these messages appear in dmesg while the random IO is not causing any issues and the controller is pushing around 80MiB/s to each hard disk (around 1GiB/s total) but after some time, all IO to all disks fail at the same time for more than half a minute and the following messages from the kernel core are printed to dmesg:
IO continues for some time and then stops again with different messages being printed:
At some point, the controller completely crashes and stalls all IO to all attached hard disk (even the ones that weren't involved in testing).
Some other messages:
The full dmesg output is attached, load test begins at 1668249.031352.
I hope someone can help me or answer some of my main questions which are:
- Is this controller totaled? It's pretty old, but never caused any issues on another system where the load was kept pretty low. Issues started when I used it to power my ZFS on FreeBSD.
- Is this fixable? I rarely update hardware firmwares, and if I do it's mostly just BIOS updates or something that doesn't require (much) interaction, like the HP Service Pack "DVD"s.
That being said, I really appreciate that this forum exists and although I work in IT, there's always something new to learn and I hope that I can contribute back to the community at one point.
I'm new to the STH Forum, so if I'm missing some information, please let me know and I will add as much information as possible to debug this issue.
My problem is with a SAS1068E controller (06:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)) in an HP ProLiant DL180 G6 (LFF Bays) which is randomly failing under high load conditions which is mostly random read/write to all disks (WD RE4 2TB) on Ubuntu 16.04.
Code:
0000:06:00.0 SCSI storage controller [0100]: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS [1000:0058] (rev 08)
Subsystem: Super Micro Computer Inc SAS1068E PCI-Express Fusion-MPT SAS [15d9:a580]
Physical Slot: 2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 38
Region 0: I/O ports at 6000 [size=256]
Region 1: Memory at faef0000 (64-bit, non-prefetchable) [size=16K]
Region 3: Memory at faee0000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [98] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=1 offset=00002000
PBA: BAR=1 offset=00003000
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel driver in use: mptsas
Kernel modules: mptsas
dmesg contains a lot of complains from mptbase and storage related systems. When there is no disk activity, dmesg contains the following messages at a random interval between 1 and ~600 seconds:
Code:
mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108) cb_idx mptctl_reply
Code:
3222 mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptscsih_io_done
3178 mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptscsih_io_done
2049 mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108) cb_idx mptctl_reply
262 mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply
254 mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptbase_reply
67 mptbase: ioc0: LogInfo(0x31120101): Originator={PL}, Code={Abort}, SubCode(0x0101) cb_idx mptscsih_io_done
53 mptbase: ioc0: LogInfo(0x31110900): Originator={PL}, Code={Reset}, SubCode(0x0900) cb_idx mptscsih_io_done
49 mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) cb_idx mptscsih_io_done
19 mptbase: ioc0: LogInfo(0x31126000): Originator={PL}, Code={Abort}, SubCode(0x6000) cb_idx mptscsih_io_done
15 mptbase: ioc0: LogInfo(0x31110e00): Originator={PL}, Code={Reset}, SubCode(0x0e00) cb_idx mptscsih_io_done
9 mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
7 mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) cb_idx mptbase_reply
7 mptbase: ioc0: LogInfo(0x31110900): Originator={PL}, Code={Reset}, SubCode(0x0900) cb_idx mptbase_reply
6 mptbase: ioc0: LogInfo(0x31120101): Originator={PL}, Code={Abort}, SubCode(0x0101) cb_idx mptbase_reply
2 mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
2 mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
2 mptbase: ioc0: LogInfo(0x31126000): Originator={PL}, Code={Abort}, SubCode(0x6000) cb_idx mptbase_reply
2 mptbase: ioc0: LogInfo(0x31110e00): Originator={PL}, Code={Reset}, SubCode(0x0e00) cb_idx mptbase_reply
Code:
[1669631.767544] scsi_io_completion: 51 callbacks suppressed
[1669631.767554] sd 4:0:5:0: [sdh] tag#85 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1669631.767559] sd 4:0:5:0: [sdh] tag#85 Sense Key : Medium Error [current]
[1669631.767563] sd 4:0:5:0: [sdh] tag#85 Add. Sense: Record not found
[1669631.767568] sd 4:0:5:0: [sdh] tag#85 CDB: Write(10) 2a 00 00 00 60 00 00 08 00 00
[1669631.767570] blk_update_request: 51 callbacks suppressed
[1669631.767573] blk_update_request: critical medium error, dev sdh, sector 24576
Code:
[1669657.052284] mptscsih: ioc0: attempting task abort! (sc=ffff8812df1c4000)
[1669657.052292] sd 4:0:5:0: [sdh] tag#4 CDB: Read(10) 28 00 02 8e 28 00 00 02 00 00
[1669658.476844] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8812df1c4000)
[1669658.476861] mptscsih: ioc0: attempting task abort! (sc=ffff881237405800)
[1669658.476867] sd 4:0:5:0: [sdh] tag#1 CDB: Write(10) 2a 00 09 f6 28 00 00 08 00 00
[1669658.476872] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237405800)
[1669658.476880] mptscsih: ioc0: attempting task abort! (sc=ffff881237405e00)
[1669658.476885] sd 4:0:5:0: [sdh] tag#74 CDB: Write(10) 2a 00 09 f6 20 00 00 08 00 00
[1669658.476888] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237405e00)
[1669658.476894] mptscsih: ioc0: attempting task abort! (sc=ffff881237404780)
[1669658.476898] sd 4:0:5:0: [sdh] tag#17 CDB: Write(10) 2a 00 09 f6 18 00 00 08 00 00
[1669658.476901] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237404780)
[1669658.476906] mptscsih: ioc0: attempting task abort! (sc=ffff881237404c00)
[1669658.476910] sd 4:0:5:0: [sdh] tag#102 CDB: Write(10) 2a 00 09 f6 10 00 00 08 00 00
[1669658.476913] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237404c00)
[1669658.476918] mptscsih: ioc0: attempting task abort! (sc=ffff880ff18f5080)
[1669658.476922] sd 4:0:5:0: [sdh] tag#61 CDB: Write(10) 2a 00 09 f6 08 00 00 08 00 00
[1669658.476925] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880ff18f5080)
[1669658.476929] mptscsih: ioc0: attempting task abort! (sc=ffff880ff18f7780)
[1669658.476933] sd 4:0:5:0: [sdh] tag#44 CDB: Write(10) 2a 00 09 f6 00 00 00 08 00 00
[1669658.476936] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880ff18f7780)
Some other messages:
Code:
[1669944.006820] sd 4:0:2:0: [sde] tag#82 CDB: Write(10) 2a 00 0a d7 a0 00 00 08 00 00
[1669944.134950] sd 4:0:10:0: [sdm] tag#46 CDB: Read(10) 28 00 02 93 ce 08 00 02 00 00
[1669948.408968] sd 4:0:2:0: [sde] tag#121 CDB: Read(10) 28 00 02 92 7a 00 00 01 00 00
[1669948.408985] sd 4:0:2:0: [sde] tag#32 CDB: Read(10) 28 00 02 92 79 00 00 01 00 00
[1669948.408999] sd 4:0:2:0: [sde] tag#20 CDB: Write(10) 2a 00 00 00 e8 00 00 08 00 00
[1669948.409010] sd 4:0:2:0: [sde] tag#13 CDB: Write(10) 2a 00 00 00 e0 00 00 08 00 00
[1669948.409022] sd 4:0:2:0: [sde] tag#96 CDB: Write(10) 2a 00 00 00 d8 00 00 08 00 00
[1669948.409034] sd 4:0:2:0: [sde] tag#108 CDB: Write(10) 2a 00 00 00 d0 00 00 08 00 00
[1669948.409046] sd 4:0:2:0: [sde] tag#80 CDB: Write(10) 2a 00 00 00 c8 00 00 08 00 00
[1669948.409058] sd 4:0:2:0: [sde] tag#101 CDB: Write(10) 2a 00 00 00 c0 00 00 08 00 00
[1669948.409069] sd 4:0:2:0: [sde] tag#33 CDB: Write(10) 2a 00 00 00 b8 00 00 08 00 00
[1669948.409080] sd 4:0:2:0: [sde] tag#30 CDB: Write(10) 2a 00 00 00 b0 00 00 08 00 00
[1669948.409092] sd 4:0:9:0: [sdl] tag#25 CDB: Write(10) 2a 00 0a 5c f0 00 00 08 00 00
[1669948.409105] sd 4:0:7:0: [sdj] tag#95 CDB: Write(10) 2a 00 0b bd c0 00 00 08 00 00
[1669948.409116] sd 4:0:9:0: [sdl] tag#88 CDB: Write(10) 2a 00 0a 5c e8 00 00 08 00 00
[1669948.409127] sd 4:0:7:0: [sdj] tag#73 CDB: Write(10) 2a 00 0b bd b8 00 00 08 00 00
[1669948.409139] sd 4:0:9:0: [sdl] tag#2 CDB: Write(10) 2a 00 0a 5c e0 00 00 08 00 00
[1669948.409150] sd 4:0:7:0: [sdj] tag#41 CDB: Write(10) 2a 00 0b bd b0 00 00 08 00 00
[1669948.409161] sd 4:0:9:0: [sdl] tag#91 CDB: Write(10) 2a 00 0a 5c d8 00 00 08 00 00
[1669948.409173] sd 4:0:7:0: [sdj] tag#62 CDB: Write(10) 2a 00 0b bd a8 00 00 08 00 00
[1669948.409184] sd 4:0:9:0: [sdl] tag#116 CDB: Write(10) 2a 00 0a 5c d0 00 00 08 00 00
[1669948.409195] sd 4:0:7:0: [sdj] tag#14 CDB: Write(10) 2a 00 0b bd a0 00 00 08 00 00
[1669948.409206] sd 4:0:9:0: [sdl] tag#56 CDB: Read(10) 28 00 02 84 8c 00 00 02 00 00
[1669948.409217] sd 4:0:7:0: [sdj] tag#42 CDB: Read(10) 28 00 02 96 67 00 00 02 00 00
[1669948.409229] sd 4:0:11:0: [sdn] tag#118 CDB: Read(10) 28 00 02 8d 1c 00 00 02 00 00
[1669948.409241] sd 4:0:11:0: [sdn] tag#124 CDB: Write(10) 2a 00 00 00 98 00 00 08 00 00
[1669948.409252] sd 4:0:11:0: [sdn] tag#10 CDB: Write(10) 2a 00 00 00 90 00 00 08 00 00
[1669948.409263] sd 4:0:11:0: [sdn] tag#48 CDB: Write(10) 2a 00 00 00 88 00 00 08 00 00
[1669948.409274] sd 4:0:11:0: [sdn] tag#76 CDB: Write(10) 2a 00 00 00 80 00 00 08 00 00
[1669948.409285] sd 4:0:11:0: [sdn] tag#90 CDB: Write(10) 2a 00 00 00 78 00 00 08 00 00
[1669948.409297] sd 4:0:11:0: [sdn] tag#12 CDB: Write(10) 2a 00 00 00 70 00 00 08 00 00
[1669948.409308] sd 4:0:11:0: [sdn] tag#84 CDB: Write(10) 2a 00 00 00 68 00 00 08 00 00
Code:
[1670058.867952] INFO: task dd:16474 blocked for more than 120 seconds.
[1670058.868891] Tainted: G W I 4.4.0-72-generic #93-Ubuntu
[1670058.869896] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1670058.871008] dd D ffff8801f5253968 0 16474 16470 0x00000000
[1670058.871014] ffff8801f5253968 ffff8815e0901c28 ffff880adfc92640 ffff880ae1538cc0
[1670058.871018] ffff8801f5254000 ffff880aebad6dc0 7fffffffffffffff ffff8815e4108200
[1670058.871021] ffff880ae07ec880 ffff8801f5253980 ffffffff81838565 0000000000000000
[1670058.871024] Call Trace:
[1670058.871034] [<ffffffff81838565>] schedule+0x35/0x80
[1670058.871038] [<ffffffff8183b6b5>] schedule_timeout+0x1b5/0x270
[1670058.871044] [<ffffffff813c41f3>] ? queue_unplugged+0xa3/0xb0
[1670058.871048] [<ffffffff81837a94>] io_schedule_timeout+0xa4/0x110
[1670058.871053] [<ffffffff8140b4be>] ? iov_iter_get_pages+0x9e/0x210
[1670058.871058] [<ffffffff8124e7f2>] do_blockdev_direct_IO+0x19e2/0x3410
[1670058.871063] [<ffffffff81249e20>] ? I_BDEV+0x20/0x20
[1670058.871066] [<ffffffff81250263>] __blockdev_direct_IO+0x43/0x50
[1670058.871069] [<ffffffff8124a8c8>] blkdev_direct_IO+0x58/0x80
[1670058.871074] [<ffffffff81190c09>] generic_file_direct_write+0xb9/0x180
[1670058.871077] [<ffffffff81190d92>] __generic_file_write_iter+0xc2/0x1e0
[1670058.871080] [<ffffffff8124abbf>] blkdev_write_iter+0x8f/0x130
[1670058.871085] [<ffffffff8120e60b>] new_sync_write+0x9b/0xe0
[1670058.871089] [<ffffffff8120e676>] __vfs_write+0x26/0x40
[1670058.871092] [<ffffffff8120eff9>] vfs_write+0xa9/0x1a0
[1670058.871095] [<ffffffff8120ef34>] ? vfs_read+0x114/0x130
[1670058.871098] [<ffffffff8120fcb5>] SyS_write+0x55/0xc0
[1670058.871102] [<ffffffff8183c672>] entry_SYSCALL_64_fastpath+0x16/0x71
[1670186.589639] sd 4:0:13:0: [sdo] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1670186.589665] sd 4:0:13:0: [sdo] tag#5 Sense Key : Unit Attention [current]
[1670186.589671] sd 4:0:13:0: [sdo] tag#5 Add. Sense: Power on, reset, or bus device reset occurred
[1670186.589678] sd 4:0:13:0: [sdo] tag#5 CDB: Write(10) 2a 00 09 c8 b8 00 00 04 00 00
[1670186.589682] blk_update_request: I/O error, dev sdo, sector 164149248
[1670186.590993] sd 4:0:10:0: [sdm] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1670186.590998] sd 4:0:10:0: [sdm] tag#8 Sense Key : Unit Attention [current]
[1670186.591001] sd 4:0:10:0: [sdm] tag#8 Add. Sense: Power on, reset, or bus device reset occurred
[1670186.591005] sd 4:0:10:0: [sdm] tag#8 CDB: Write(10) 2a 00 0a 76 68 00 00 07 f8 00
[1670186.591008] blk_update_request: I/O error, dev sdm, sector 175532032
[1670186.592257] sd 4:0:8:0: [sdk] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1670186.592261] sd 4:0:8:0: [sdk] tag#9 Sense Key : Unit Attention [current]
[1670186.592265] sd 4:0:8:0: [sdk] tag#9 Add. Sense: Power on, reset, or bus device reset occurred
[1670186.592271] sd 4:0:8:0: [sdk] tag#9 CDB: Write(10) 2a 00 0a be e0 00 00 08 00 00
I hope someone can help me or answer some of my main questions which are:
- Is this controller totaled? It's pretty old, but never caused any issues on another system where the load was kept pretty low. Issues started when I used it to power my ZFS on FreeBSD.
- Is this fixable? I rarely update hardware firmwares, and if I do it's mostly just BIOS updates or something that doesn't require (much) interaction, like the HP Service Pack "DVD"s.
That being said, I really appreciate that this forum exists and although I work in IT, there's always something new to learn and I hope that I can contribute back to the community at one point.
Attachments
-
209.5 KB Views: 1
Last edited: