LSI SAS1068E errors and crashes during high load

Fusl · May 10, 2017

Hi,

I'm new to the STH Forum, so if I'm missing some information, please let me know and I will add as much information as possible to debug this issue.

My problem is with a SAS1068E controller (06:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)) in an HP ProLiant DL180 G6 (LFF Bays) which is randomly failing under high load conditions which is mostly random read/write to all disks (WD RE4 2TB) on Ubuntu 16.04.

Code:

0000:06:00.0 SCSI storage controller [0100]: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS [1000:0058] (rev 08)
        Subsystem: Super Micro Computer Inc SAS1068E PCI-Express Fusion-MPT SAS [15d9:a580]
        Physical Slot: 2
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 38
        Region 0: I/O ports at 6000 [size=256]
        Region 1: Memory at faef0000 (64-bit, non-prefetchable) [size=16K]
        Region 3: Memory at faee0000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [98] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
                Vector table: BAR=1 offset=00002000
                PBA: BAR=1 offset=00003000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Kernel driver in use: mptsas
        Kernel modules: mptsas

dmesg contains a lot of complains from mptbase and storage related systems. When there is no disk activity, dmesg contains the following messages at a random interval between 1 and ~600 seconds:

Code:

mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108) cb_idx mptctl_reply

Here's a list of all mptbase-related messages sorted by frequency:

Code:

3222 mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptscsih_io_done
3178 mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptscsih_io_done
2049 mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108) cb_idx mptctl_reply
 262 mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply
 254 mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptbase_reply
  67 mptbase: ioc0: LogInfo(0x31120101): Originator={PL}, Code={Abort}, SubCode(0x0101) cb_idx mptscsih_io_done
  53 mptbase: ioc0: LogInfo(0x31110900): Originator={PL}, Code={Reset}, SubCode(0x0900) cb_idx mptscsih_io_done
  49 mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) cb_idx mptscsih_io_done
  19 mptbase: ioc0: LogInfo(0x31126000): Originator={PL}, Code={Abort}, SubCode(0x6000) cb_idx mptscsih_io_done
  15 mptbase: ioc0: LogInfo(0x31110e00): Originator={PL}, Code={Reset}, SubCode(0x0e00) cb_idx mptscsih_io_done
   9 mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
   7 mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) cb_idx mptbase_reply
   7 mptbase: ioc0: LogInfo(0x31110900): Originator={PL}, Code={Reset}, SubCode(0x0900) cb_idx mptbase_reply
   6 mptbase: ioc0: LogInfo(0x31120101): Originator={PL}, Code={Abort}, SubCode(0x0101) cb_idx mptbase_reply
   2 mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
   2 mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
   2 mptbase: ioc0: LogInfo(0x31126000): Originator={PL}, Code={Abort}, SubCode(0x6000) cb_idx mptbase_reply
   2 mptbase: ioc0: LogInfo(0x31110e00): Originator={PL}, Code={Reset}, SubCode(0x0e00) cb_idx mptbase_reply

A lot of these messages appear in dmesg while the random IO is not causing any issues and the controller is pushing around 80MiB/s to each hard disk (around 1GiB/s total) but after some time, all IO to all disks fail at the same time for more than half a minute and the following messages from the kernel core are printed to dmesg:

Code:

[1669631.767544] scsi_io_completion: 51 callbacks suppressed
[1669631.767554] sd 4:0:5:0: [sdh] tag#85 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1669631.767559] sd 4:0:5:0: [sdh] tag#85 Sense Key : Medium Error [current]
[1669631.767563] sd 4:0:5:0: [sdh] tag#85 Add. Sense: Record not found
[1669631.767568] sd 4:0:5:0: [sdh] tag#85 CDB: Write(10) 2a 00 00 00 60 00 00 08 00 00
[1669631.767570] blk_update_request: 51 callbacks suppressed
[1669631.767573] blk_update_request: critical medium error, dev sdh, sector 24576

IO continues for some time and then stops again with different messages being printed:

Code:

[1669657.052284] mptscsih: ioc0: attempting task abort! (sc=ffff8812df1c4000)
[1669657.052292] sd 4:0:5:0: [sdh] tag#4 CDB: Read(10) 28 00 02 8e 28 00 00 02 00 00
[1669658.476844] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8812df1c4000)
[1669658.476861] mptscsih: ioc0: attempting task abort! (sc=ffff881237405800)
[1669658.476867] sd 4:0:5:0: [sdh] tag#1 CDB: Write(10) 2a 00 09 f6 28 00 00 08 00 00
[1669658.476872] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237405800)
[1669658.476880] mptscsih: ioc0: attempting task abort! (sc=ffff881237405e00)
[1669658.476885] sd 4:0:5:0: [sdh] tag#74 CDB: Write(10) 2a 00 09 f6 20 00 00 08 00 00
[1669658.476888] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237405e00)
[1669658.476894] mptscsih: ioc0: attempting task abort! (sc=ffff881237404780)
[1669658.476898] sd 4:0:5:0: [sdh] tag#17 CDB: Write(10) 2a 00 09 f6 18 00 00 08 00 00
[1669658.476901] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237404780)
[1669658.476906] mptscsih: ioc0: attempting task abort! (sc=ffff881237404c00)
[1669658.476910] sd 4:0:5:0: [sdh] tag#102 CDB: Write(10) 2a 00 09 f6 10 00 00 08 00 00
[1669658.476913] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff881237404c00)
[1669658.476918] mptscsih: ioc0: attempting task abort! (sc=ffff880ff18f5080)
[1669658.476922] sd 4:0:5:0: [sdh] tag#61 CDB: Write(10) 2a 00 09 f6 08 00 00 08 00 00
[1669658.476925] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880ff18f5080)
[1669658.476929] mptscsih: ioc0: attempting task abort! (sc=ffff880ff18f7780)
[1669658.476933] sd 4:0:5:0: [sdh] tag#44 CDB: Write(10) 2a 00 09 f6 00 00 00 08 00 00
[1669658.476936] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880ff18f7780)

At some point, the controller completely crashes and stalls all IO to all attached hard disk (even the ones that weren't involved in testing).

Some other messages:

Code:

[1669944.006820] sd 4:0:2:0: [sde] tag#82 CDB: Write(10) 2a 00 0a d7 a0 00 00 08 00 00
[1669944.134950] sd 4:0:10:0: [sdm] tag#46 CDB: Read(10) 28 00 02 93 ce 08 00 02 00 00
[1669948.408968] sd 4:0:2:0: [sde] tag#121 CDB: Read(10) 28 00 02 92 7a 00 00 01 00 00
[1669948.408985] sd 4:0:2:0: [sde] tag#32 CDB: Read(10) 28 00 02 92 79 00 00 01 00 00
[1669948.408999] sd 4:0:2:0: [sde] tag#20 CDB: Write(10) 2a 00 00 00 e8 00 00 08 00 00
[1669948.409010] sd 4:0:2:0: [sde] tag#13 CDB: Write(10) 2a 00 00 00 e0 00 00 08 00 00
[1669948.409022] sd 4:0:2:0: [sde] tag#96 CDB: Write(10) 2a 00 00 00 d8 00 00 08 00 00
[1669948.409034] sd 4:0:2:0: [sde] tag#108 CDB: Write(10) 2a 00 00 00 d0 00 00 08 00 00
[1669948.409046] sd 4:0:2:0: [sde] tag#80 CDB: Write(10) 2a 00 00 00 c8 00 00 08 00 00
[1669948.409058] sd 4:0:2:0: [sde] tag#101 CDB: Write(10) 2a 00 00 00 c0 00 00 08 00 00
[1669948.409069] sd 4:0:2:0: [sde] tag#33 CDB: Write(10) 2a 00 00 00 b8 00 00 08 00 00
[1669948.409080] sd 4:0:2:0: [sde] tag#30 CDB: Write(10) 2a 00 00 00 b0 00 00 08 00 00
[1669948.409092] sd 4:0:9:0: [sdl] tag#25 CDB: Write(10) 2a 00 0a 5c f0 00 00 08 00 00
[1669948.409105] sd 4:0:7:0: [sdj] tag#95 CDB: Write(10) 2a 00 0b bd c0 00 00 08 00 00
[1669948.409116] sd 4:0:9:0: [sdl] tag#88 CDB: Write(10) 2a 00 0a 5c e8 00 00 08 00 00
[1669948.409127] sd 4:0:7:0: [sdj] tag#73 CDB: Write(10) 2a 00 0b bd b8 00 00 08 00 00
[1669948.409139] sd 4:0:9:0: [sdl] tag#2 CDB: Write(10) 2a 00 0a 5c e0 00 00 08 00 00
[1669948.409150] sd 4:0:7:0: [sdj] tag#41 CDB: Write(10) 2a 00 0b bd b0 00 00 08 00 00
[1669948.409161] sd 4:0:9:0: [sdl] tag#91 CDB: Write(10) 2a 00 0a 5c d8 00 00 08 00 00
[1669948.409173] sd 4:0:7:0: [sdj] tag#62 CDB: Write(10) 2a 00 0b bd a8 00 00 08 00 00
[1669948.409184] sd 4:0:9:0: [sdl] tag#116 CDB: Write(10) 2a 00 0a 5c d0 00 00 08 00 00
[1669948.409195] sd 4:0:7:0: [sdj] tag#14 CDB: Write(10) 2a 00 0b bd a0 00 00 08 00 00
[1669948.409206] sd 4:0:9:0: [sdl] tag#56 CDB: Read(10) 28 00 02 84 8c 00 00 02 00 00
[1669948.409217] sd 4:0:7:0: [sdj] tag#42 CDB: Read(10) 28 00 02 96 67 00 00 02 00 00
[1669948.409229] sd 4:0:11:0: [sdn] tag#118 CDB: Read(10) 28 00 02 8d 1c 00 00 02 00 00
[1669948.409241] sd 4:0:11:0: [sdn] tag#124 CDB: Write(10) 2a 00 00 00 98 00 00 08 00 00
[1669948.409252] sd 4:0:11:0: [sdn] tag#10 CDB: Write(10) 2a 00 00 00 90 00 00 08 00 00
[1669948.409263] sd 4:0:11:0: [sdn] tag#48 CDB: Write(10) 2a 00 00 00 88 00 00 08 00 00
[1669948.409274] sd 4:0:11:0: [sdn] tag#76 CDB: Write(10) 2a 00 00 00 80 00 00 08 00 00
[1669948.409285] sd 4:0:11:0: [sdn] tag#90 CDB: Write(10) 2a 00 00 00 78 00 00 08 00 00
[1669948.409297] sd 4:0:11:0: [sdn] tag#12 CDB: Write(10) 2a 00 00 00 70 00 00 08 00 00
[1669948.409308] sd 4:0:11:0: [sdn] tag#84 CDB: Write(10) 2a 00 00 00 68 00 00 08 00 00

Code:

[1670058.867952] INFO: task dd:16474 blocked for more than 120 seconds.
[1670058.868891]       Tainted: G        W I     4.4.0-72-generic #93-Ubuntu
[1670058.869896] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1670058.871008] dd              D ffff8801f5253968     0 16474  16470 0x00000000
[1670058.871014]  ffff8801f5253968 ffff8815e0901c28 ffff880adfc92640 ffff880ae1538cc0
[1670058.871018]  ffff8801f5254000 ffff880aebad6dc0 7fffffffffffffff ffff8815e4108200
[1670058.871021]  ffff880ae07ec880 ffff8801f5253980 ffffffff81838565 0000000000000000
[1670058.871024] Call Trace:
[1670058.871034]  [<ffffffff81838565>] schedule+0x35/0x80
[1670058.871038]  [<ffffffff8183b6b5>] schedule_timeout+0x1b5/0x270
[1670058.871044]  [<ffffffff813c41f3>] ? queue_unplugged+0xa3/0xb0
[1670058.871048]  [<ffffffff81837a94>] io_schedule_timeout+0xa4/0x110
[1670058.871053]  [<ffffffff8140b4be>] ? iov_iter_get_pages+0x9e/0x210
[1670058.871058]  [<ffffffff8124e7f2>] do_blockdev_direct_IO+0x19e2/0x3410
[1670058.871063]  [<ffffffff81249e20>] ? I_BDEV+0x20/0x20
[1670058.871066]  [<ffffffff81250263>] __blockdev_direct_IO+0x43/0x50
[1670058.871069]  [<ffffffff8124a8c8>] blkdev_direct_IO+0x58/0x80
[1670058.871074]  [<ffffffff81190c09>] generic_file_direct_write+0xb9/0x180
[1670058.871077]  [<ffffffff81190d92>] __generic_file_write_iter+0xc2/0x1e0
[1670058.871080]  [<ffffffff8124abbf>] blkdev_write_iter+0x8f/0x130
[1670058.871085]  [<ffffffff8120e60b>] new_sync_write+0x9b/0xe0
[1670058.871089]  [<ffffffff8120e676>] __vfs_write+0x26/0x40
[1670058.871092]  [<ffffffff8120eff9>] vfs_write+0xa9/0x1a0
[1670058.871095]  [<ffffffff8120ef34>] ? vfs_read+0x114/0x130
[1670058.871098]  [<ffffffff8120fcb5>] SyS_write+0x55/0xc0
[1670058.871102]  [<ffffffff8183c672>] entry_SYSCALL_64_fastpath+0x16/0x71
[1670186.589639] sd 4:0:13:0: [sdo] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1670186.589665] sd 4:0:13:0: [sdo] tag#5 Sense Key : Unit Attention [current]
[1670186.589671] sd 4:0:13:0: [sdo] tag#5 Add. Sense: Power on, reset, or bus device reset occurred
[1670186.589678] sd 4:0:13:0: [sdo] tag#5 CDB: Write(10) 2a 00 09 c8 b8 00 00 04 00 00
[1670186.589682] blk_update_request: I/O error, dev sdo, sector 164149248
[1670186.590993] sd 4:0:10:0: [sdm] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1670186.590998] sd 4:0:10:0: [sdm] tag#8 Sense Key : Unit Attention [current]
[1670186.591001] sd 4:0:10:0: [sdm] tag#8 Add. Sense: Power on, reset, or bus device reset occurred
[1670186.591005] sd 4:0:10:0: [sdm] tag#8 CDB: Write(10) 2a 00 0a 76 68 00 00 07 f8 00
[1670186.591008] blk_update_request: I/O error, dev sdm, sector 175532032
[1670186.592257] sd 4:0:8:0: [sdk] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1670186.592261] sd 4:0:8:0: [sdk] tag#9 Sense Key : Unit Attention [current]
[1670186.592265] sd 4:0:8:0: [sdk] tag#9 Add. Sense: Power on, reset, or bus device reset occurred
[1670186.592271] sd 4:0:8:0: [sdk] tag#9 CDB: Write(10) 2a 00 0a be e0 00 00 08 00 00

The full dmesg output is attached, load test begins at 1668249.031352.

I hope someone can help me or answer some of my main questions which are:
- Is this controller totaled? It's pretty old, but never caused any issues on another system where the load was kept pretty low. Issues started when I used it to power my ZFS on FreeBSD.
- Is this fixable? I rarely update hardware firmwares, and if I do it's mostly just BIOS updates or something that doesn't require (much) interaction, like the HP Service Pack "DVD"s.

That being said, I really appreciate that this forum exists and although I work in IT, there's always something new to learn and I hope that I can contribute back to the community at one point.

Terry Kennedy · May 10, 2017

Fusl said:
I hope someone can help me or answer some of my main questions which are:
- Is this controller totaled? It's pretty old, but never caused any issues on another system where the load was kept pretty low. Issues started when I used it to power my ZFS on FreeBSD.

My first guess would be that it is overheating. After you start getting the errors, if you shut the system down and take the cover off, is the heatsink on the card hot?

- Is this fixable? I rarely update hardware firmwares, and if I do it's mostly just BIOS updates or something that doesn't require (much) interaction, like the HP Service Pack "DVD"s.

Normally I would suggest switching to a newer-generation LSI card. But HP systems are a bit picky and may complain about non-HP cards which can limit your choices.

Fusl · May 10, 2017

Terry Kennedy said:
if you shut the system down and take the cover off, is the heatsink on the card hot

I'm unable to check this right now as I recently moved this server to a datecenter 350km away from where I live, but considering the controller didn't behave like this in my apartment which has no air condition and now behaves like this in the datacenter where there is a proper air condition and the temperatures (CPUs, HDDs, chassis', etc.) have also dropped, I don't think that temperature is the cause of this.

The previous operating system where this has been working fine on was Debian Lenny. Behavior of the controller changed negatively when I used ZFS on FreeBSD - not immediately, but rather some months in.

Terry Kennedy said:
HP systems are a bit picky

Off-topic, but, I tried using a HP H220 HBA in this server and it didn't work in this HP ProLiant DL180 G6 (server just refused to POST), but the H220 controller works in an older model of the DL180 G6 series (the one which does not have the integrated iLO management but rather requires an external module to be attached to the mainboard).

Terry Kennedy said:
I would suggest switching to a newer-generation LSI card

Can you recommend a specific card to use with a DL180 G6 and its backplane or should I just read through and follow the LSI RAID Controller and HBA Complete Listing Plus OEM Models thread?

JustinH · May 10, 2017

A lot of the Initial messages were about SDH. Maybe a HDD is failing on you (or its connection is flimsy) and it's messing with everything else on the SAS bus?
(I have a SATA segate that would cause random lockups to the controller. When I took it out and put it on a normal SATA port it wasn't even detected)

Sent from my iPhone using Tapatalk

Fusl · May 11, 2017

JustinH said:
Maybe a HDD is failing on you

Those are all "new" drives from eBay, probably the power-on hours were just wiped so for the sake of it, I tried with a stack of 320GB, 500GB and 1TB drives (mixed) that I know are good and the controller still shows the same behavior (complaining about 1, 2 or even 3 disks at first, then starting to complain alot and suddenly failing completely).

Terry Kennedy · May 11, 2017

Fusl said:
I'm unable to check this right now as I recently moved this server to a datecenter 350km away from where I live, but considering the controller didn't behave like this in my apartment which has no air condition and now behaves like this in the datacenter where there is a proper air condition and the temperatures (CPUs, HDDs, chassis', etc.) have also dropped, I don't think that temperature is the cause of this.

OK. As I said, it was just my first guess. Airflow in some of the low-profile chassis can be a bit problematic, buf if I'm looking at the right picture, there is an air duct over the expansion card which should provide adequate cooling to the 1068 card.

You have disks up to (at least) sdn - 14 drives - and I believe the SAS1068 has a maximum of 8 ports, so there may be a SAS expander involved. If you are using SATA drives, SAS expanders can be problematic. The basic issue is that part of the expander's error recovery in that situation is going "Resets for everybody!" which is happening "behind the back" of the 1068 card / driver, so it has no idea what is going on, except that a bunch of the disks it knows about all started acting oddly at the same time.

The previous operating system where this has been working fine on was Debian Lenny. Behavior of the controller changed negatively when I used ZFS on FreeBSD - not immediately, but rather some months in.

If it didn't happen right away, I think we can rule out a driver or other operating system problem. Has the workload increased, have there been any power hits or other problems around the time this issue started? Does the SMART data for all of your drives look good?

Off-topic, but, I tried using a HP H220 HBA in this server and it didn't work in this HP ProLiant DL180 G6 (server just refused to POST), but the H220 controller works in an older model of the DL180 G6 series (the one which does not have the integrated iLO management but rather requires an external module to be attached to the mainboard).

HP systems are known to be rather annoying, both due to the restrictions in the BIOS regarding what adapters are supported (unupported adapters may be ignored or may work but force the system fans to run at high speed all the time) and because updates for officially supported HP cards require a service contract to be in place for you to be able to download the updates.

Can you recommend a specific card to use with a DL180 G6 and its backplane or should I just read through and follow the LSI RAID Controller and HBA Complete Listing Plus OEM Models thread?

If you are using the IR (integrated RAID) functionality of the 1068 card, a card in the 9211 family (4 or 8 ports, internal or external) would be the closest equivalent in the SAS2 generation. If you are using IT mode, a 9201 card would suffice. You may need to find a HP-branded card, due to how picky the HP systems are. But that at least gives you a place to look in the other thread.

Fusl · May 11, 2017

Terry Kennedy said:
there is an air duct over the expansion card

This, and the fans (8 in total) are spinning at 100% (as configured in the BIOS).

Terry Kennedy said:
14 drives

12 drives on an HP DL18xG6BP:

Code:

[4:0:12:0]   enclosu HP       DL18xG6BP        2.06  -          /dev/sg14       -
|- [4:0:0:0]    disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdc   /dev/sg2   2.00TB
|- [4:0:1:0]    disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdd   /dev/sg3   2.00TB
|- [4:0:2:0]    disk    ATA      WDC WD2003FYPS-2 5G09  /dev/sde   /dev/sg4   2.00TB
|- [4:0:3:0]    disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdf   /dev/sg5   2.00TB
|- [4:0:4:0]    disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdg   /dev/sg6   2.00TB
|- [4:0:5:0]    disk    ATA      WDC WD2003FYPS-2 5G09  /dev/sdh   /dev/sg7   2.00TB
|- [4:0:6:0]    disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdi   /dev/sg8   2.00TB
|- [4:0:7:0]    disk    ATA      WDC WD2003FYPS-2 5G09  /dev/sdj   /dev/sg9   2.00TB
|- [4:0:8:0]    disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdk   /dev/sg10  2.00TB
|- [4:0:9:0]    disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdl   /dev/sg11  2.00TB
|- [4:0:10:0]   disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdm   /dev/sg12  2.00TB
`- [4:0:11:0]   disk    ATA      WDC WD2003FYPS-2 5G11  /dev/sdn   /dev/sg13  2.00TB
[0:0:0:0]    disk    ATA      Corsair Force GT 5.05  /dev/sda   /dev/sg0    480GB
[0:0:1:0]    disk    ATA      Corsair Force GT 5.02  /dev/sdb   /dev/sg1    480GB

Terry Kennedy said:
Has the workload increased

The workload has increased by at least 4 times (previously we were limited to 2x 1Gbit/s bonded NICs, now we're using 2x 10Gbit/s failover NICs).

Terry Kennedy said:
have there been any power hits

There have been some (at least 10) unexpected electrical power cuts and few times power flickering as well, but I don't think those would cause such issues.

Terry Kennedy said:
Does the SMART data for all of your drives look good?

They look all pretty good. There's a couple read errors on all of them, it increases every time I run the load test for most of them as well, so I think those are caused by the controller crashing.

Terry Kennedy said:
IT mode

Terry Kennedy said:
You may need to find a HP-branded card, due to how picky the HP systems are

The card which us currently in place is not HP-branded, it's a pretty cheap Supermicro AOC-USAS-L8i controller, so I guess it eats almost everything?

Terry Kennedy said:
look in the other thread

I'll do that and probably get some random cards to test in this server.

Search

LSI SAS1068E errors and crashes during high load

Fusl

New Member

Attachments

Terry Kennedy

Well-Known Member

Fusl

New Member

JustinH

Active Member

Fusl

New Member

Terry Kennedy

Well-Known Member

Fusl

New Member