Degraded Volume,discuss before I replace disk

Discussion in 'FreeBSD and FreeNAS' started by JayG30, Jul 28, 2015.

  1. JayG30

    JayG30 Active Member

    Joined:
    Feb 23, 2015
    Messages:
    226
    Likes Received:
    46
    I'm cross posting this with the freenas forums.
    Hoping someone here might now anything about this.

    Today I noticed one of my freenas servers was in a degraded state. I found out a bit late it seems because my email moved the messages to "clutter" (sigh). Anyway, I'm just trying to determine if anyone might see something other than a disk issue.

    When I logged into the web GUI (and initially in the zpool status shown below) the disk had a few hundred write errors showing for that disk. (Mind you, I've seen something like this if a disk falls out of the zfs raid while data is being written, it tries for a while before it realizes the disk isn't available).

    In dmesg it shows;
    Code:
            (da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 555 command timeout cm 0xffffff8000b02718 ccb 0xfffffe004101f000
            (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000b02718
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 50 00 00 40 00 length 32768 SMID 337 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 10 00 00 40 00 length 32768 SMID 363 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 d0 00 00 40 00 length 32768 SMID 841 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 90 00 00 40 00 length 32768 SMID 220 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 50 00 00 40 00 length 32768 SMID 748 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 10 00 00 40 00 length 32768 SMID 321 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 d0 00 00 40 00 length 32768 SMID 515 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 90 00 00 40 00 length 32768 SMID 745 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 50 00 00 40 00 length 32768 SMID 868 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 8e a0 00 00 40 00 length 32768 SMID 632 terminated ioc 804b scsi 0 state c xfer 0
            (da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 466 terminated ioc 804b scsi 0 state c xfer 0
    mps0: IOCStatus = 0x4b while resetting device 0xf
    (da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
    (da5:mps0:0:13:0): CAM status: Command timeout
    (da5:mps0:0:13:0): Retrying command
    da5 at mps0 bus 0 scbus0 target 13 lun 0
    da5: <ATA TOSHIBA MG03ACA3 FL1A> s/n            53K7K7JPF detached
    (da5:mps0:0:13:0): Periph destroyed
    The volume initially showed the disk as unavailable;
    Code:
    [root@freenas] ~# zpool status -v store
      pool: store
    state: DEGRADED
    status: One or more devices could not be opened.  Sufficient replicas exist for
            the pool to continue functioning in a degraded state.
    action: Attach the missing device and online it using 'zpool online'.
       see: http://illumos.org/msg/ZFS-8000-2Q
      scan: scrub repaired 0 in 0h26m with 0 errors on Sun Jul 19 00:26:39 2015
    config:
            NAME                                            STATE     READ WRITE CKSUM
            store                                           DEGRADED     0     0     0
              raidz2-0                                      DEGRADED     0     0     0
                gptid/1c383e96-d315-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
                gptid/90b50eaf-d315-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
                gptid/284a6fc3-d316-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
                gptid/c66e0391-d317-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
                gptid/14a02475-d318-11e4-98c7-0cc47a335ac4  ONLINE       0     0     0
                559548462891584750                          UNAVAIL      3   246     0  was /dev/gptid/5178ef38-d319-11e4-98c7-0cc47a335ac4
    I rebooted the server but no change. So I had someone on site remove the disk for me, 1 to check the S/N and second to see if I could online it and have it rebuild itself. After removing it the status of the disk changed to "removed". Subsequent reboots of the server have made the volume show as "resilvering" but the disk never came online, even after trying to force it online through zfs online command. It seems the disk initially shows as "unavailable" after reboot and during resilvering, but the disk is now back to showing "removed".

    Further more I can't even see the disk in smartctl. It just seems like it is being removed per the dmesg shown above, "(da5:mps0:0:13:0): Periph destroyed". I had hoped to try to check the smartctl readings, but can't since the disk isn't showing up at all.

    My gut says the disk went bad. I filed a RMA for it and will go down to check it tomorrow. But perhaps someone might have an idea.


    Some more information that I'm sure people will be looking for:
    The disks are all the same make/model.
    The server was put together in Late March/Early April.
    The server was stress tested using the scripts jgreco posted in the freenas forums somewhere. Had no issues.
    This is the first real issue I've had with it.

    Specifications;
    Code:
    Case: SuperMicro CSE-826E16-R1200LPB
    Backplane: BPN-SAS2-826EL1
    Motherboard: SUPERMICRO MBD-X10SL7-F-O
    HBA: onboard LSI 2308 (firmware P16, as recommended by FreeNAS)
    CPU: Intel Xeon E3-1231V3
    RAM: Crucial CT2KIT102472BD160B (2 x 8GB)
    HDD: 6 x Toshiba MG03ACA300 3TB Enterprise SATA
    Norcoo SFF8087 reverse breakout cable
    The errors at the top of dmesg look a lot like the ones in THIS thread.
     
    #1
    Last edited: Aug 7, 2017
  2. MiniKnight

    MiniKnight Well-Known Member

    Joined:
    Mar 30, 2012
    Messages:
    2,923
    Likes Received:
    851
    I'd replace the disk if you don't have confidence in it. I'd imagine your data is worth more than the $130 for a new drive.
     
    #2
  3. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    1,014
    Likes Received:
    341
    First off, now is one of the times to make double. triple, quadruple sure that you have a recent backup and that it's restorable.

    Secondly, if the array has thrown out the drive (or possibly port, but that's another story) already and it's not being seen buy the OS, it's either already toast or is being lowered into the toaster as we speak.

    Thirdly, RMA'd drives (in my experience here in the UK) are almost always refurbs rather than being brand new (which is why if a disc fails in the first year I always push for a refund instead of a replacement). Every minute counts for rebuilds on such large arrays so personally I'd have bought a new drive straight away and kept the RMA on the shelf as a cold-standby replacement.
     
    #3
  4. JayG30

    JayG30 Active Member

    Joined:
    Feb 23, 2015
    Messages:
    226
    Likes Received:
    46
    Thanks guys. I'm not to concerned to be honest.
    This is a 2nd tier backup device and is setup in a RAIDZ2. A lot would have to go wrong all at once for me to lose anything.
    I just wanted to make sure nobody saw anything strange, like a backplane/expander/etc issue that I wasn't seeing.

    I'm going down now to check it (make sure someone didn't just pop the disk out and not understand how to reinsert it) and the RMA with Toshiba is already underway. The warranty on it is active until 2019 via THIS link. Depending on how long it takes for them to ship me a replacement I might buy one to put in and keep the spare on the shelf when it eventually arrives.

    Packing these HDD for warranty replacements looks like a ton of fun with all the conditions to get them out of providing it, lol.
     
    #4
  5. JayG30

    JayG30 Active Member

    Joined:
    Feb 23, 2015
    Messages:
    226
    Likes Received:
    46
    Well I took the disk out of the server and plugged it into my laptop using one of those multi-format USB adapter things I have.
    As the disk started up I heard a LOT of noise (and it doesn't go away), which obviously was the first bad sign. It booted but being originally setup in FreeNAS windows can't see it. Loaded up magic partition to see if I could wipe it out. The disk shows only ~746GB (3TB disk). Obviously something wrong. But I was able to format that 746GB of space and use CrystalDisk to check SMART info.
    Health Status: Caution
    Reallocated Sectors Count: 100 100 50
    Current Pending Sector Count: 100 100 0

    So looks like disk went bad. Only 6553 hours it seems, not good.

    Code:
    ----------------------------------------------------------------------------
    CrystalDiskInfo 6.5.2 (C) 2008-2015 hiyohiyo
                                    Crystal Dew World : http://crystalmark.info/
    ----------------------------------------------------------------------------
    -- Disk List ---------------------------------------------------------------
     (2) TOSHIBA MG03ACA300 : 3000.5 GB [1/X/X, jm1] (V=152D, P=2338)
    ----------------------------------------------------------------------------
     (2) TOSHIBA MG03ACA300
    ----------------------------------------------------------------------------
           Enclosure : TOSHIBA MG03ACA300 USB Device (V=152D, P=2338, jm1)
               Model : TOSHIBA MG03ACA300
            Firmware : FL1A
       Serial Number : 53K7K7JPF
           Disk Size : 3000.5 GB (8.4/137.4/3000.5/801.5)
         Buffer Size : Unknown
         Queue Depth : 32
        # of Sectors : 5860533168
       Rotation Rate : 7200 RPM
           Interface : USB (Serial ATA)
       Major Version : ATA8-ACS
       Minor Version : ----
       Transfer Mode : SATA/150 | SATA/600
      Power On Hours : 6553 hours
      Power On Count : 45 count
         Temperature : 37 C (98 F)
       Health Status : Caution
            Features : S.M.A.R.T., APM, 48bit LBA, NCQ
           APM Level : 0080h [ON]
           AAM Level : ----
    
    -- S.M.A.R.T. --------------------------------------------------------------
    ID Cur Wor Thr RawValues(6) Attribute Name
    01 _99 _99 _50 000000000000 Read Error Rate
    02 100 100 _50 000000000000 Throughput Performance
    03 100 100 __1 000000002E6B Spin-Up Time
    04 100 100 __0 00000000002D Start/Stop Count
    05 100 100 _50 000000001466 Reallocated Sectors Count
    07 100 _99 _50 000000000000 Seek Error Rate
    08 100 100 _50 000000000000 Seek Time Performance
    09 _84 _84 __0 000000001999 Power-On Hours
    0A 100 100 _30 000000000000 Spin Retry Count
    0C 100 100 __0 00000000002D Power Cycle Count
    BF 100 100 __0 000000000002 G-Sense Error Rate
    C0 100 100 __0 000000000024 Power-off Retract Count
    C1 100 100 __0 00000000003E Load/Unload Cycle Count
    C2 100 100 __0 002D000F0025 Temperature
    C4 100 100 __0 0000000002BD Reallocation Event Count
    C5 100 100 __0 000000000013 Current Pending Sector Count
    C6 100 100 __0 000000000000 Uncorrectable Sector Count
    C7 200 200 __0 000000000000 UltraDMA CRC Error Count
    DC 100 100 __0 000000000000 Disk Shift
    DE _84 _84 __0 000000001983 Loaded Hours
    DF 100 100 __0 000000000000 Load/Unload Retry Count
    E0 100 100 __0 000000000000 Load Friction
    E2 100 100 __0 000000000069 Load 'In'-time
    F0 __1 __1 __1 000000000010 Head Flying Hours
    
     
    #5
  6. Terry Kennedy

    Terry Kennedy Well-Known Member

    Joined:
    Jun 25, 2015
    Messages:
    998
    Likes Received:
    461
    The good news is you have a RAIDZ2, so you can lose one more drive and still access all of your data. The bad news is that since a replace operation is more stressful to the drives than normal I/O, another drive/drives can fail during the rebuild. That's one reason I never suggest doing a pull/reinsert/rebuild on any type of array - if the array software / controller detects the re-inserted drive as good and starts a rebuild, the rebuild may fail due to the same drive dropping out again. But you'll have stressed the other drives in the array for no purpose. Take the drive out and exercise / test it for an extended period before trying to re-use it (or discover it is bad, as yours is).

    A common cause for DOA and low-hour drive failures is mishandling during shipping - either from the manufacturer to the distributor and reseller, or from the reseller to you. The last is the most common, as the manufacturers do a pretty good job of shipping bulk drives. Not to name names, but one popular vendor was (is?) well-known for breaking up bulk packs and just dumping naked antistatic-bagged drives into shipping boxes with some packing pillows. Needless to say, drives from them often arrived with physical damage. I actually tried to get WD to cut them off until they improved their packaging. Since I don't buy drives from that vendor any more (I get 'em from WD and other manufacturers in bulk, direct) I don't know if they've gotten better. eBay sellers are often even worse at packaging drives than that vendor.

    Unfortunately, if the drive failure is due to shipping damage, the other drives that arrived in that same order are likely suspect. That's just one of the reasons for another drive to fail during a rebuild operation.

    Regarding manufacturer packaging requirements for RMAs, I find that saying "I sent it back to you in the same packaging you used to send it to me" is sufficient to get the RMA processed. That assumes you have the manufacturer's packaging, of course - I have bunches of manufacturer 20-pack, 5-pack and single packaging that I use for RMAs.
     
    #6
Similar Threads: Degraded Volumediscuss
Forum Title Date
FreeBSD and FreeNAS zpool not degraded, but certain files cause transfers to hang May 16, 2018

Share This Page