Some drives not seen after reboot...

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

rthorntn

Member
Apr 28, 2011
81
12
8
Hi,

Two Supermicro BPN-SAS-216A with 42 drives connected (open frame), LSI HBAs (2x9400-16i & 9200-8i) and a RES3TV360.

Drives are 4/5TB Seagate 2.5" SATAs. EXT4 formatted JBOD.

Corsair HX1200

Ubuntu 20.04

There seems to be no rhyme nor reason, every time I reboot some of the drives don't come up, once I figure out which ones are missing (a few days ago 6 were missing) a hot-unplug followed by a hot-plug usually gets them listed in lsblk. I'm using the UUIDs in fstab to mount it all the issue is the drives don't exist in lsblk.

Maybe it's SAS cables, dust, heat, power, HBA/expander firmwares or fsck-on-boot issues, I don't really know where to start.

Anyone had similar issues?

My train of thought was around power, the SATA 2.5" drives are 5V motors and most modern PSU's have weak 5V rails, I doubt the backplanes are converting 12V to 5V and 40 drives at 3.75W startup is 150W (HX1200 has 150W on the 5V rail). The backplanes maybe do staggered spinup, I'm not sure, the capacitors on them may dampen the spinup "surge".

Maybe most of the most recent drive issues are off the 2nd BPN-SAS-216A which has the RES3TV360.

I've also started thinking that Linux might boot too quickly to properly support 40+ JBOD drives.

Or it could be connectors.

Or maybe I need to upgrade firmwares, if it's that, I have no real idea how to do Intel expander firmware with LSI HBAs.

Thanks for looking.

Cheers
Richard
 
Last edited:

rthorntn

Member
Apr 28, 2011
81
12
8
Also I've started getting drives drop out, this just happened, all I had to do was remount it:
Code:
[171961.514565] sd 6:0:31:0: device_block, handle(0x003a)
[171964.263997] sd 6:0:31:0: device_unblock and setting to running, handle(0x003a)
[171964.265526] blk_update_request: I/O error, dev sdas, sector 13144 op 0x1:(WRITE) flags 0x3000 phys_seg 1 prio class 0
[171964.265575] Buffer I/O error on dev sdas, logical block 1643, lost async page write
[171964.265700] blk_update_request: I/O error, dev sdas, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[171964.301739] JBD2: Error while async write back metadata bh 1643.
[171964.301740] Aborting journal on device sdas-8.
[171964.302066] blk_update_request: I/O error, dev sdas, sector 4882432000 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[171964.302088] Buffer I/O error on dev sdas, logical block 610304000, lost sync page write
[171964.302590] JBD2: Error -5 detected when updating journal superblock for sdas-8.
[171964.303841] sd 6:0:31:0: [sdas] Synchronizing SCSI cache
[171964.303854] sd 6:0:31:0: [sdas] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[171964.343598] mpt3sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x50000d17039c2c0c)
[171964.343600] mpt3sas_cm1: removing handle(0x003a), sas_addr(0x50000d17039c2c0c)
[171964.343601] mpt3sas_cm1: enclosure logical id(0x50000d17039c2c3e), slot(12)
[171964.343602] mpt3sas_cm1: enclosure level(0x0000), connector name( C0  )
[171967.014952] scsi 6:0:44:0: Direct-Access     ATA      ST5000LM000-2AN1 0001 PQ: 0 ANSI: 6
[171967.014964] scsi 6:0:44:0: SATA: handle(0x003a), sas_addr(0x50000d17039c2c0c), phy(12), device_name(0x0000000000000000)
[171967.014966] scsi 6:0:44:0: enclosure logical id (0x50000d17039c2c3e), slot(12)
[171967.014968] scsi 6:0:44:0: enclosure level(0x0000), connector name( C0  )
[171967.015012] scsi 6:0:44:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[171967.015014] scsi 6:0:44:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[171967.020953] sd 6:0:44:0: Power-on or device reset occurred
[171967.020963] sd 6:0:44:0: Attached scsi generic sg47 type 0
[171967.021012] mpt3sas_cm1: log_info(0x31200205): originator(PL), code(0x20), sub_code(0x0205)
[171967.023381]  end_device-6:0:35: add: handle(0x003a), sas_addr(0x50000d17039c2c0c)
[171967.362848] sd 6:0:44:0: [sdas] 9767541168 512-byte logical blocks: (5.00 TB/4.55 TiB)
[171967.362851] sd 6:0:44:0: [sdas] 4096-byte physical blocks
[171967.447252] sd 6:0:44:0: [sdas] Write Protect is off
[171967.447254] sd 6:0:44:0: [sdas] Mode Sense: 9b 00 10 08
[171967.450959] sd 6:0:44:0: [sdas] Write cache: enabled, read cache: enabled, supports DPO and FUA
[171967.772715] sd 6:0:44:0: [sdas] Attached SCSI disk
[178260.816156] EXT4-fs (sdas): recovery complete
[178260.838887] EXT4-fs (sdas): mounted filesystem with ordered data mode. Opts: (null)
 
Last edited:

rthorntn

Member
Apr 28, 2011
81
12
8
This was the one that dropped out before:
Code:
[54041.277101] sd 6:0:35:0: [sdau] tag#2820 CDB: Read(16) 88 00 00 00 00 00 c8 40 62 b8 00 00 00 18 00 00
[54044.301937] blk_update_request: I/O error, dev sdau, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[54044.317313] sd 6:0:35:0: [sdau] Synchronizing SCSI cache
[54044.317339] sd 6:0:35:0: [sdau] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[54045.048760] scsi 6:0:35:0: [sdau] tag#781 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[54045.048769] scsi 6:0:35:0: [sdau] tag#4014 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[54045.048780] scsi 6:0:35:0: [sdau] tag#781 CDB: Read(16) 88 00 00 00 00 01 87 f5 a7 68 00 00 00 18 00 00
[54045.048782] scsi 6:0:35:0: [sdau] tag#4014 CDB: Read(16) 88 00 00 00 00 01 51 e5 e2 58 00 00 00 40 00 00
[54045.048787] blk_update_request: I/O error, dev sdau, sector 6575990632 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[54045.048790] blk_update_request: I/O error, dev sdau, sector 5668987480 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[54045.048894] scsi 6:0:35:0: [sdau] tag#2820 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[54045.048897] scsi 6:0:35:0: [sdau] tag#2820 CDB: Read(16) 88 00 00 00 00 00 c8 40 62 b8 00 00 00 18 00 00
[54045.048900] blk_update_request: I/O error, dev sdau, sector 3359662776 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
I did a hot-unplug followed by a hot-plug and this came back up as sdba.
 

rthorntn

Member
Apr 28, 2011
81
12
8
Thanks, I can get a photo tomorrow, its all over the place at the moment, the BPN-SAS-216A cages were drilled out of SC216's, I have fans blowing on them.

I don't have fans blowing on the expander or the HBA's, its winter here and the house is cold, I plan to add fans when they arrive.
 
Last edited:

Spearfoot

Active Member
Apr 22, 2015
111
51
28
Hi,

Two Supermicro BPN-SAS-216A with 42 drives connected (open frame), LSI HBAs (2x9400-16i & 9200-8i) and a RES3TV360.

Drives are 4/5TB Seagate 2.5" SATAs. EXT4 formatted JBOD.

Corsair HX1200

Ubuntu 20.04

There seems to be no rhyme nor reason, every time I reboot some of the drives don't come up, once I figure out which ones are missing (a few days ago 6 were missing) a hot-unplug followed by a hot-plug usually gets them listed in lsblk. I'm using the UUIDs in fstab to mount it all the issue is the drives don't exist in lsblk.

Maybe it's SAS cables, dust, heat, power, HBA/expander firmwares or fsck-on-boot issues, I don't really know where to start.

Anyone had similar issues?

My train of thought was around power, the SATA 2.5" drives are 5V motors and most modern PSU's have weak 5V rails, I doubt the backplanes are converting 12V to 5V and 40 drives at 3.75W startup is 150W (HX1200 has 150W on the 5V rail). The backplanes maybe do staggered spinup, I'm not sure, the capacitors on them may dampen the spinup "surge".

Maybe most of the most recent drive issues are off the 2nd BPN-SAS-216A which has the RES3TV360.

I've also started thinking that Linux might boot too quickly to properly support 40+ JBOD drives.

Or it could be connectors.

Or maybe I need to upgrade firmwares, if it's that, I have no real idea how to do Intel expander firmware with LSI HBAs.

Thanks for looking.

Cheers
Richard
If that BPN-SAS-216A backplane is a SAS1 device -- and I'm pretty sure it is -- then your problem is that SAS1 backplanes don't fully support drives larger than 2.2TB. Sometimes you can get away with using just a few larger disks, but your system seems to be fully populated.

Best bet is to get a SAS2 backplane.
 
  • Like
Reactions: rthorntn

rthorntn

Member
Apr 28, 2011
81
12
8
And again, different drive...

Code:
[236542.019646] sd 6:0:32:0: attempting task abort!scmd(0x000000008ec390dd), outstanding for 30732 ms & timeout 30000 ms
[236542.019651] sd 6:0:32:0: [sdat] tag#1636 CDB: Read(16) 88 00 00 00 00 01 44 77 eb 10 00 00 00 18 00 00
[236542.019654] scsi target6:0:32: handle(0x0039), sas_address(0x50000d17039c2c0b), phy(11)
[236542.019656] scsi target6:0:32: enclosure logical id(0x50000d17039c2c3e), slot(11)
[236542.019658] scsi target6:0:32: enclosure level(0x0000), connector name( C0  )
[236542.395301] sd 6:0:32:0: device_block, handle(0x0039)
[236545.144793] sd 6:0:32:0: device_unblock and setting to running, handle(0x0039)
[236545.145772] blk_update_request: I/O error, dev sdat, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[236545.188495] sd 6:0:32:0: [sdat] Synchronizing SCSI cache
[236545.188533] sd 6:0:32:0: [sdat] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[236545.893211] scsi 6:0:32:0: task abort: SUCCESS scmd(0x000000008ec390dd)
[236545.893223] scsi 6:0:32:0: [sdat] tag#1636 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[236545.893226] scsi 6:0:32:0: [sdat] tag#1636 CDB: Read(16) 88 00 00 00 00 01 44 77 eb 10 00 00 00 18 00 00
[236545.893230] blk_update_request: I/O error, dev sdat, sector 5443676944 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0
[236545.893726] scsi 6:0:32:0: attempting task abort!scmd(0x00000000fb2866a2), outstanding for 34604 ms & timeout 30000 ms
[236545.893728] scsi 6:0:32:0: [sdat] tag#1245 CDB: Read(16) 88 00 00 00 00 01 95 8d f6 98 00 00 00 18 00 00
[236545.893731] scsi target6:0:32: handle(0x0039), sas_address(0x50000d17039c2c0b), phy(11)
[236545.893733] scsi target6:0:32: enclosure logical id(0x50000d17039c2c3e), slot(11)
[236545.893734] scsi target6:0:32: enclosure level(0x0000), connector name( C0  )
[236545.893735] scsi 6:0:32:0: No reference found at driver, assuming scmd(0x00000000fb2866a2) might have completed
[236545.893737] scsi 6:0:32:0: task abort: SUCCESS scmd(0x00000000fb2866a2)
[236545.893739] scsi 6:0:32:0: [sdat] tag#1245 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[236545.893740] scsi 6:0:32:0: [sdat] tag#1245 CDB: Read(16) 88 00 00 00 00 01 95 8d f6 98 00 00 00 18 00 00
[236545.893742] blk_update_request: I/O error, dev sdat, sector 6804076184 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[236545.898685] mpt3sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x50000d17039c2c0b)
[236545.898687] mpt3sas_cm1: removing handle(0x0039), sas_addr(0x50000d17039c2c0b)
[236545.898688] mpt3sas_cm1: enclosure logical id(0x50000d17039c2c3e), slot(11)
[236545.898689] mpt3sas_cm1: enclosure level(0x0000), connector name( C0  )
 

rthorntn

Member
Apr 28, 2011
81
12
8
8 drives just dropped, I was able to remount 3 of them, have to unplug and plug the other 5.

This issue is getting worse.

I'm thinking that it could be power because really the only thing thats changed the number of drives.

I'm at 42 drives now, this time last week I was at 36 drives, I have added a couple of drives every couple of days.