Degraded Volume,discuss before I replace disk

JayG30 · Jul 28, 2015

I'm cross posting this with the freenas forums.
Hoping someone here might now anything about this.

Today I noticed one of my freenas servers was in a degraded state. I found out a bit late it seems because my email moved the messages to "clutter" (sigh). Anyway, I'm just trying to determine if anyone might see something other than a disk issue.

When I logged into the web GUI (and initially in the zpool status shown below) the disk had a few hundred write errors showing for that disk. (Mind you, I've seen something like this if a disk falls out of the zfs raid while data is being written, it tries for a while before it realizes the disk isn't available).

In dmesg it shows;

Code:

        (da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 555 command timeout cm 0xffffff8000b02718 ccb 0xfffffe004101f000
        (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000b02718
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 50 00 00 40 00 length 32768 SMID 337 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 10 00 00 40 00 length 32768 SMID 363 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 d0 00 00 40 00 length 32768 SMID 841 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 90 00 00 40 00 length 32768 SMID 220 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 50 00 00 40 00 length 32768 SMID 748 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 10 00 00 40 00 length 32768 SMID 321 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 d0 00 00 40 00 length 32768 SMID 515 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 90 00 00 40 00 length 32768 SMID 745 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 50 00 00 40 00 length 32768 SMID 868 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 8e a0 00 00 40 00 length 32768 SMID 632 terminated ioc 804b scsi 0 state c xfer 0
        (da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 466 terminated ioc 804b scsi 0 state c xfer 0
mps0: IOCStatus = 0x4b while resetting device 0xf
(da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da5:mps0:0:13:0): CAM status: Command timeout
(da5:mps0:0:13:0): Retrying command
da5 at mps0 bus 0 scbus0 target 13 lun 0
da5: <ATA TOSHIBA MG03ACA3 FL1A> s/n            53K7K7JPF detached
(da5:mps0:0:13:0): Periph destroyed

The volume initially showed the disk as unavailable;

Code:

[root@freenas] ~# zpool status -v store
  pool: store
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0h26m with 0 errors on Sun Jul 19 00:26:39 2015
config:
        NAME                                            STATE     READ WRITE CKSUM
        store                                           DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/1c383e96-d315-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
            gptid/90b50eaf-d315-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
            gptid/284a6fc3-d316-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
            gptid/c66e0391-d317-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
            gptid/14a02475-d318-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
            559548462891584750                          UNAVAIL      3   246     0  was /dev/gptid/5178ef38-d319-11e4-98c7-0cc47a335ac4

I rebooted the server but no change. So I had someone on site remove the disk for me, 1 to check the S/N and second to see if I could online it and have it rebuild itself. After removing it the status of the disk changed to "removed". Subsequent reboots of the server have made the volume show as "resilvering" but the disk never came online, even after trying to force it online through zfs online command. It seems the disk initially shows as "unavailable" after reboot and during resilvering, but the disk is now back to showing "removed".

Further more I can't even see the disk in smartctl. It just seems like it is being removed per the dmesg shown above, "(da5:mps0:0:13:0): Periph destroyed". I had hoped to try to check the smartctl readings, but can't since the disk isn't showing up at all.

My gut says the disk went bad. I filed a RMA for it and will go down to check it tomorrow. But perhaps someone might have an idea.

Some more information that I'm sure people will be looking for:
The disks are all the same make/model.
The server was put together in Late March/Early April.
The server was stress tested using the scripts jgreco posted in the freenas forums somewhere. Had no issues.
This is the first real issue I've had with it.

Specifications;

Code:

Case: SuperMicro CSE-826E16-R1200LPB
Backplane: BPN-SAS2-826EL1
Motherboard: SUPERMICRO MBD-X10SL7-F-O
HBA: onboard LSI 2308 (firmware P16, as recommended by FreeNAS)
CPU: Intel Xeon E3-1231V3
RAM: Crucial CT2KIT102472BD160B (2 x 8GB)
HDD: 6 x Toshiba MG03ACA300 3TB Enterprise SATA
Norcoo SFF8087 reverse breakout cable

The errors at the top of dmesg look a lot like the ones in THIS thread.

MiniKnight · Jul 28, 2015

I'd replace the disk if you don't have confidence in it. I'd imagine your data is worth more than the $130 for a new drive.

EffrafaxOfWug · Jul 29, 2015

First off, now is one of the times to make double. triple, quadruple sure that you have a recent backup and that it's restorable.

Secondly, if the array has thrown out the drive (or possibly port, but that's another story) already and it's not being seen buy the OS, it's either already toast or is being lowered into the toaster as we speak.

Thirdly, RMA'd drives (in my experience here in the UK) are almost always refurbs rather than being brand new (which is why if a disc fails in the first year I always push for a refund instead of a replacement). Every minute counts for rebuilds on such large arrays so personally I'd have bought a new drive straight away and kept the RMA on the shelf as a cold-standby replacement.

JayG30 · Jul 29, 2015

Thanks guys. I'm not to concerned to be honest.
This is a 2nd tier backup device and is setup in a RAIDZ2. A lot would have to go wrong all at once for me to lose anything.
I just wanted to make sure nobody saw anything strange, like a backplane/expander/etc issue that I wasn't seeing.

I'm going down now to check it (make sure someone didn't just pop the disk out and not understand how to reinsert it) and the RMA with Toshiba is already underway. The warranty on it is active until 2019 via THIS link. Depending on how long it takes for them to ship me a replacement I might buy one to put in and keep the spare on the shelf when it eventually arrives.

Packing these HDD for warranty replacements looks like a ton of fun with all the conditions to get them out of providing it, lol.

JayG30 · Jul 29, 2015

Well I took the disk out of the server and plugged it into my laptop using one of those multi-format USB adapter things I have.
As the disk started up I heard a LOT of noise (and it doesn't go away), which obviously was the first bad sign. It booted but being originally setup in FreeNAS windows can't see it. Loaded up magic partition to see if I could wipe it out. The disk shows only ~746GB (3TB disk). Obviously something wrong. But I was able to format that 746GB of space and use CrystalDisk to check SMART info.
Health Status: Caution
Reallocated Sectors Count: 100 100 50
Current Pending Sector Count: 100 100 0

So looks like disk went bad. Only 6553 hours it seems, not good.

Code:

----------------------------------------------------------------------------
CrystalDiskInfo 6.5.2 (C) 2008-2015 hiyohiyo
                                Crystal Dew World : http://crystalmark.info/
----------------------------------------------------------------------------
-- Disk List ---------------------------------------------------------------
 (2) TOSHIBA MG03ACA300 : 3000.5 GB [1/X/X, jm1] (V=152D, P=2338)
----------------------------------------------------------------------------
 (2) TOSHIBA MG03ACA300
----------------------------------------------------------------------------
       Enclosure : TOSHIBA MG03ACA300 USB Device (V=152D, P=2338, jm1)
           Model : TOSHIBA MG03ACA300
        Firmware : FL1A
   Serial Number : 53K7K7JPF
       Disk Size : 3000.5 GB (8.4/137.4/3000.5/801.5)
     Buffer Size : Unknown
     Queue Depth : 32
    # of Sectors : 5860533168
   Rotation Rate : 7200 RPM
       Interface : USB (Serial ATA)
   Major Version : ATA8-ACS
   Minor Version : ----
   Transfer Mode : SATA/150 | SATA/600
  Power On Hours : 6553 hours
  Power On Count : 45 count
     Temperature : 37 C (98 F)
   Health Status : Caution
        Features : S.M.A.R.T., APM, 48bit LBA, NCQ
       APM Level : 0080h [ON]
       AAM Level : ----

-- S.M.A.R.T. --------------------------------------------------------------
ID Cur Wor Thr RawValues(6) Attribute Name
01 _99 _99 _50 000000000000 Read Error Rate
02 100 100 _50 000000000000 Throughput Performance
03 100 100 __1 000000002E6B Spin-Up Time
04 100 100 __0 00000000002D Start/Stop Count
05 100 100 _50 000000001466 Reallocated Sectors Count
07 100 _99 _50 000000000000 Seek Error Rate
08 100 100 _50 000000000000 Seek Time Performance
09 _84 _84 __0 000000001999 Power-On Hours
0A 100 100 _30 000000000000 Spin Retry Count
0C 100 100 __0 00000000002D Power Cycle Count
BF 100 100 __0 000000000002 G-Sense Error Rate
C0 100 100 __0 000000000024 Power-off Retract Count
C1 100 100 __0 00000000003E Load/Unload Cycle Count
C2 100 100 __0 002D000F0025 Temperature
C4 100 100 __0 0000000002BD Reallocation Event Count
C5 100 100 __0 000000000013 Current Pending Sector Count
C6 100 100 __0 000000000000 Uncorrectable Sector Count
C7 200 200 __0 000000000000 UltraDMA CRC Error Count
DC 100 100 __0 000000000000 Disk Shift
DE _84 _84 __0 000000001983 Loaded Hours
DF 100 100 __0 000000000000 Load/Unload Retry Count
E0 100 100 __0 000000000000 Load Friction
E2 100 100 __0 000000000069 Load 'In'-time
F0 __1 __1 __1 000000000010 Head Flying Hours

Terry Kennedy · Jul 29, 2015

The good news is you have a RAIDZ2, so you can lose one more drive and still access all of your data. The bad news is that since a replace operation is more stressful to the drives than normal I/O, another drive/drives can fail during the rebuild. That's one reason I never suggest doing a pull/reinsert/rebuild on any type of array - if the array software / controller detects the re-inserted drive as good and starts a rebuild, the rebuild may fail due to the same drive dropping out again. But you'll have stressed the other drives in the array for no purpose. Take the drive out and exercise / test it for an extended period before trying to re-use it (or discover it is bad, as yours is).

A common cause for DOA and low-hour drive failures is mishandling during shipping - either from the manufacturer to the distributor and reseller, or from the reseller to you. The last is the most common, as the manufacturers do a pretty good job of shipping bulk drives. Not to name names, but one popular vendor was (is?) well-known for breaking up bulk packs and just dumping naked antistatic-bagged drives into shipping boxes with some packing pillows. Needless to say, drives from them often arrived with physical damage. I actually tried to get WD to cut them off until they improved their packaging. Since I don't buy drives from that vendor any more (I get 'em from WD and other manufacturers in bulk, direct) I don't know if they've gotten better. eBay sellers are often even worse at packaging drives than that vendor.

Unfortunately, if the drive failure is due to shipping damage, the other drives that arrived in that same order are likely suspect. That's just one of the reasons for another drive to fail during a rebuild operation.

Regarding manufacturer packaging requirements for RMAs, I find that saying "I sent it back to you in the same packaging you used to send it to me" is sufficient to get the RMA processed. That assumes you have the manufacturer's packaging, of course - I have bunches of manufacturer 20-pack, 5-pack and single packaging that I use for RMAs.

Search

Degraded Volume,discuss before I replace disk

JayG30

Active Member

MiniKnight

Well-Known Member

EffrafaxOfWug

Radioactive Member

JayG30

Active Member

JayG30

Active Member

Terry Kennedy

Well-Known Member