ZFS stripped mirror set degraded. Disk really bad? how to check?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

bp_968

New Member
Dec 23, 2012
45
0
0
Here is what its reporting. And btw, I discovered this thanks to a neat little app on my android phone called "ZFS Monitor". Its awesome!




NAME STATE READ WRITE CKSUM
raptor10k-RAID10 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c7t14d0 ONLINE 0 0 0
c7t15d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c7t16d0 ONLINE 0 0 0
c7t17d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c7t18d0 ONLINE 0 0 0
c7t19d0 ONLINE 0 0 0
mirror-3 DEGRADED 0 0 0
c7t20d0 ONLINE 0 0 0
c7t21d0 FAULTED 0 0 0 too many errors

How do I check the disk and make sure its ok or not ok before I tell ZFS to "get back to work"?

I guess this will prompt me to go ahead and pickup a spare 300GB raptor to keep around (I'll probably put it in something else and use it until/if its ever needed. I'm ok without it sitting as a hot spare).

Thanks!!
 

Mike

Member
May 29, 2012
482
16
18
EU
Although it's not fool proof; the smart data may give you some clues. smartctl -a /dev/...?.../c7t21d0
You may want to give badblocks a run, 300g is not a whole lot to scan and rebuild anyway.
 

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
If ZFS reports a problem, there is a problem (thanks to checksums on all data).
This is quite unique. Nearly no other system can detect any problems sooner if at all.

Too many errors means mostly that the disk failed completely. Insert a spare disk and
do a disk - replace (faulted -> new). I would check the failed disk with a disk testtool from WD.

and
- it is not wise to run many two way mirrors without a hot-spare
- most disk failures are not preseen with smart tests
 

bp_968

New Member
Dec 23, 2012
45
0
0
I did a scrub and it found zero errors. I'll chk smart too. here is another output:
- with Smartinfo if available (works mostly only with SAS Controller)
id diskcap pool vdev state error smart_model smart_type smart_health temp smart_sn smart_check
c4t5000C50046F1A085d0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 ST2000DL003-9VT166 sat,12 PASSED 28 °C 6YD234N1 short long abort log
c4t5000C50046F2D8C1d0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 ST2000DL003-9VT166 sat,12 PASSED 30 °C 6YD2434J short long abort log
c4t5000CCA221D3C439d0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 Hitachi HDS722020ALA330 sat,12 PASSED 35 °C JK1130YAHDGZNT short long abort log
c4t5000CCA221D3CAAEd0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 Hitachi HDS722020ALA330 sat,12 PASSED 35 °C JK1130YAHDJPZT short long abort log
c4t50014EE6007472CCd0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 WDC WD20EARS-00MVWB0 sat,12 PASSED 30 °C WDWMAZA1193578 short long abort log
c4t50014EE655C9C2DCd0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 WDC WD20EARS-00MVWB0 sat,12 PASSED 31 °C WDWMAZA0979429 short long abort log
c4t50024E90044F3EDCd0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 SAMSUNG HD204UI sat,12 PASSED 29 °C S2H7JD6ZB00622 short long abort log
c4t50024E90044F3FAEd0 2000 GB RAIDz2 raidz ONLINE S:4 H:0 T:0 SAMSUNG HD204UI sat,12 PASSED 27 °C S2H7JD6ZB00624 short long abort log
c6t0d0 32.2 GB rpool basic ONLINE S:0 H:0 T:0 - - - - n.a. short long abort log
c7t14d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:3 T:0 WL300GLSA16100 sat,12 PASSED 36 °C LP007105 short long abort log
c7t15d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:3 T:0 WL300GLSA16100 sat,12 PASSED 35 °C LP007072 short long abort log
c7t16d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:3 T:0 WL300GLSA16100 sat,12 PASSED 35 °C LP018886 short long abort log
c7t17d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:3 T:0 WL300GLSA16100 sat,12 PASSED 35 °C LP018961 short long abort log
c7t18d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:3 T:0 WL300GLSA16100 sat,12 PASSED 36 °C LP009085 short long abort log
c7t19d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:3 T:0 WL300GLSA16100 sat,12 PASSED 34 °C LP018960 short long abort log
c7t20d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:3 T:0 WL300GLSA16100 sat,12 PASSED 35 °C LP004010 short long abort log
c7t21d0 300 GB raptor10k-RAID10 mirror FAULTED S:4 H:475 T:175 WL300GLSA16100 sat,12 PASSED 35 °C LP009633 short long abort log
 

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
c7t21d0 300 GB raptor10k-RAID10 mirror FAULTED S:4 H:475 T:175 WL300GLSA16100 sat,12 PASSED 35 °C LP009633 short long abort log
A scrub tests your pool and your pool is ok but a disk has failed completely.
you can try a short or long smart test, but I would remove the disk and check (optionally repair) with the WD Raptor diagnostic tool

for WD Raptor see
WD Support / Downloads / SATA & SAS / WD VelociRaptor
 

bp_968

New Member
Dec 23, 2012
45
0
0
So I can get these two bits of information but I can't put them together to figure out which drive "failed". This is starting to get annoying (you'd figure it would be a quick google search to find a command to match a failed drive to an actual *REAL PHYSICAL DRIVE* but apparently its not. Not for many tonight anyway).


ben@OI_SAN1:~# iostat -Er | grep -i vendor | sort | uniq
Vendor: ATA ,Product: Hitachi HDS72202 ,Revision: A28A ,Serial No: JK1130YAHDGZNT
Vendor: ATA ,Product: Hitachi HDS72202 ,Revision: A28A ,Serial No: JK1130YAHDJPZT
Vendor: ATA ,Product: SAMSUNG HD204UI ,Revision: 0001 ,Serial No: S2H7JD6ZB00622
Vendor: ATA ,Product: SAMSUNG HD204UI ,Revision: 0001 ,Serial No: S2H7JD6ZB00624
Vendor: ATA ,Product: ST2000DL003-9VT1 ,Revision: CC3C ,Serial No: 6YD234N1
Vendor: ATA ,Product: ST2000DL003-9VT1 ,Revision: CC3C ,Serial No: 6YD2434J
Vendor: ATA ,Product: WDC WD20EARS-00M ,Revision: AB51 ,Serial No: WD-WMAZA0979429
Vendor: ATA ,Product: WDC WD20EARS-00M ,Revision: AB51 ,Serial No: WD-WMAZA1193578
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V01 ,Serial No: LP004010
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V01 ,Serial No: LP007072
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V01 ,Serial No: LP009085
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V01 ,Serial No: LP009633
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V01 ,Serial No: LP018886
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V01 ,Serial No: LP018960
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V01 ,Serial No: LP018961
Vendor: ATA ,Product: WL300GLSA16100 ,Revision: 4V09 ,Serial No: LP007105
Vendor: NECVMWar ,Product: VMware IDE CDR10 ,Revision: 1.00 ,Serial No:
Vendor: VMware ,Product: Virtual disk ,Revision: 1.0 ,Serial No: 6000c290cff5bd3
ben@OI_SAN1:~# iostat -En | grep -i vendor | sort | uniq
Vendor: ATA Product: Hitachi HDS72202 Revision: A28A Serial No: JK1130YAHDGZNT
Vendor: ATA Product: Hitachi HDS72202 Revision: A28A Serial No: JK1130YAHDJPZT
Vendor: ATA Product: SAMSUNG HD204UI Revision: 0001 Serial No: S2H7JD6ZB00622
Vendor: ATA Product: SAMSUNG HD204UI Revision: 0001 Serial No: S2H7JD6ZB00624
Vendor: ATA Product: ST2000DL003-9VT1 Revision: CC3C Serial No: 6YD234N1
Vendor: ATA Product: ST2000DL003-9VT1 Revision: CC3C Serial No: 6YD2434J
Vendor: ATA Product: WDC WD20EARS-00M Revision: AB51 Serial No: WD-WMAZA0979429
Vendor: ATA Product: WDC WD20EARS-00M Revision: AB51 Serial No: WD-WMAZA1193578
Vendor: ATA Product: WL300GLSA16100 Revision: 4V01 Serial No: LP004010
Vendor: ATA Product: WL300GLSA16100 Revision: 4V01 Serial No: LP007072
Vendor: ATA Product: WL300GLSA16100 Revision: 4V01 Serial No: LP009085
Vendor: ATA Product: WL300GLSA16100 Revision: 4V01 Serial No: LP009633
Vendor: ATA Product: WL300GLSA16100 Revision: 4V01 Serial No: LP018886
Vendor: ATA Product: WL300GLSA16100 Revision: 4V01 Serial No: LP018960
Vendor: ATA Product: WL300GLSA16100 Revision: 4V01 Serial No: LP018961
Vendor: ATA Product: WL300GLSA16100 Revision: 4V09 Serial No: LP007105
Vendor: NECVMWar Product: VMware IDE CDR10 Revision: 1.00 Serial No:
Vendor: VMware Product: Virtual disk Revision: 1.0 Serial No: 6000c290cff5bd3


Or I can get:

NAME STATE READ WRITE CKSUM
raptor10k-RAID10 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c7t14d0 ONLINE 0 0 0
c7t15d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c7t16d0 ONLINE 0 0 0
c7t17d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c7t18d0 ONLINE 0 0 0
c7t19d0 ONLINE 0 0 0
mirror-3 DEGRADED 0 0 0
c7t20d0 ONLINE 0 0 0
c7t21d0 FAULTED 0 0 0 too many errors

-----------------------------------------------

So I know c7t21d0 is acting badly. Of course I have no actual way to figure out who the hell c7t21d0 is, so go figure. :(
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
As someone with 56 SSD drives attached to one server, I feel your pain. I have resorted to adding drives one at a time and maintaining a file (actually a OneNote page) that maps Solaris device names to physical disk locations.

Perhaps you can use dd to stream data from the failed drive and then look for the drive light that shows activity: dd if=/dev/c7t21d0 of=/dev/null
 

bp_968

New Member
Dec 23, 2012
45
0
0
I read that the device name is based on the card/port/drive. so if it was the last drive number in the list like it was its likely that its on the last controller card on the last SAS port plugged into the last sata cable.... right? So I yanked it. <LOL> It still says the device is there so I'm guessing that the LSI 3081i (i think) probably isn't hotswap friendly. I'm rebooting to see what comes back up. Nothing on that array is important so no big deal at the moment.

The more annoying deal is the ebay seller who sells those drives raised his prices 10$ a drive over what I paid. We will see if he takes the same price I paid 4 months ago. if not I may well just drop that whole array of raptors, sell them, and build it into something else anyway (the 8 RAID10 raptor array was mainly for fun. its becoming less fun...).

I'm honestly considering picking up a couple 3081E cards (their cheap!) and turning a spare rosewill RSV-4000 case I have (4u, 12 drive bays) into a "DAS". I could do that for 120$ in cards and cables (and a power supply) and skip those 2.5" drives totally and stuff it with more 3.5"s and build another RAIDz2 or z3 array. More space and fast "enough".

Lots of ideas, no money.. lol
 

bp_968

New Member
Dec 23, 2012
45
0
0
I pulled c7t16d0 apparently. ;) It was a different mirror so now that RAID10 has taken 2 "hits". lol I'm impressed. Going to have to dig around and see if it makes sense (it shows me pulling drive 3 from top to bottom and I thought I pulled either 4 or 8 based on how I have it cabled...)
 

bp_968

New Member
Dec 23, 2012
45
0
0
I ended up (on porpose) swapping the two drives to different bays to make sure there isn't a backplane or cable issue. It resilvered c7t16d0 without issues and then I did a "zpool offline" on the "bad" disk, put it back into the slot c7t16d0 used to be and then "cleared" it and then "replaced" it. Its showing good right now. I'll let it run for a bit and see where it goes. I am also probably going to order a spare 300GB raptor to hang onto.