Why does ZFS hate me and what am I doing wrong?

Dssguy1 · Apr 20, 2016

Hey folks,
I have been piecing together a server/enclosure setup for my home backup and media serving needs but having nothing but trouble with Freenas (specifically ZFS). I put the post in this forum because I am assuming it has something to do with my LSI9207 / SA120 / R720 since it just happens out of the blue.

Basic overview of my setup:
Dell R720 server, 32GB ram, Dual Xeon, internal PERC H710P disabled, no drives loaded in server.
LSI 9207-8E flashed with 20IT firmware because Freenas was complaining about version 20 drivers not matching my 19IT firmware.
Brand new Lenovo SA120 with the following drives:
5 HGST 3TB SAS Ultrastars (HUS723030ALS640) - I created a Raid Z1 volume
3 HGST 2TB SAS Ultrastars (HUS723020ALS640) - I created a Raid Z1 volume
1 HGST 8TB He8 SATA (HUH728080ALE601) - Didn't do anything with this drive, just sitting blank in the enclosure
1 HGST 4TB Coolspin SATA (H3IK40003254SP) - Didn't do anything with this drive, just sitting blank in the enclosure

I have the enclosure connected to the R720 through a single Mini Sas 8088 cable on the primary (bottom) IO board.

Here is where the problem arises. I made the two different ZFS Volumes and started filling them with content. The 5 3TB drives were being used as Media storage and I was uploading Blu-Ray copies to it for several days. It was at about 60% full. The 2 TB drives were for pictures and I had filled it about 30% full,

About the only things I did past the default Freenas 9.10 stable install was setup a plex media server and click the autotune checkbox. I also accepted the latest stable update a couple days ago. I also messed with CFS of course to get Windows sharing working.

About a day after having everything setup and loaded with content I go to upload another few Blu-Ray copies and can't see the Media Volume in Windows. I log into the server and see that there is a critical error saying "The volume MediaVolume (ZFS) state is UNKNOWN".

So hearing how awesome ZFS is about not losing your data, I go to Storage and Voumes and look at what is going on. I just see the MediaVolume showing as not found. (I can get exact wording if that helps).

- Ok, weird, lets reboot and see if that fixes it.... Nope.
- Lets shutdown the server and the enclosure and see if that fixes it... Nope. I started up enclosure first.
- Lets add another SAS cable to the mix and connect it to the secondary IO board and the other slot in the 9207 and see if that helps.... Nope.
- Lets try to IMPORT the volume back in with Import Volume.... Not encrypted, second menu, no Volume options. So that isn't working.

Funny thing is, FREENAS sees all the disks in the view disks menu. It is more than happy for me to create a new pool with the disks if I just want to start from scratch. It will let me detach the pool but I am afraid to do that because the screen turns RED!!!

I read up on GPT corruption and decided to check with GPART list (I think) and all drives came up with STATE: OK.

Side note, this is the second time in a week this has happened to me out of the blue. First time I was just playing around and creating a volume and a couple days later it was gone. I hadn't even put anything on it yet.

So I haven't really lost any data, I was just putting data on the drives to get comfy with the setup but so far, I am very uncomfy with ZFS and FreeNas.

Can anybody tell me what I am doing wrong or where to look for answers on why a Pool would just disappear? I am more then happy to include logs of around that time frame if I could just find them (sorry, new to FreeBSD so I might need a little help with finding them).

I took a video with everything that is said on startup of the server but I don't know if anyone actually wants to view it. If so, let me know and I will put it on YouTube.

Dssguy1 · Apr 20, 2016

I'm including the video link, I saw a TON of SAS related errors that fly by so quick and are so complicated I can't understand them at all. Hopefully somebody else is more savvy than me and has an easy answer to this debacle.

The errors start around the 1 minute mark and also around the 2 minute mark. I think this report out was when I had the redundant 8088 cables going to the SA120. I did notice FreeNas was seeing redundant connections to the 5 3TB drives in one of the GUI screens but not the 3 2TB drives or the SATA drives.

unwind-protect · Apr 20, 2016

Can't really read the error messages.

You need to start by pulling SMART info off the drives (or error messages when trying to get SMART).

Dssguy1 · Apr 20, 2016

unwind-protect said:
Can't really read the error messages.

You need to start by pulling SMART info off the drives (or error messages when trying to get SMART).

I was able to read a lot of the messages by just stopping the video, or even putting it in slower speed playback from the youtube settings. I think I did poke around in the SMART settings and don't remember there being any errors. If you give me a command to run, I will do it and report back.

Dssguy1 · Apr 20, 2016

This is a clear shot of the middle of the SCSI mass output. I'm sure the numbers are changing from the many screens flying by but this is the general jist:

This is the end of the SCSI Barf report from the first iteration of errors.

unwind-protect · Apr 20, 2016

That is an ordinary read error.

If SMART says the drive is fine then you have a controller or cable or cage problem.

Dssguy1 · Apr 20, 2016

I have a short 1ft cable that came with the brand new SA120, I will use that and see if any of the errors go away. Can you give me some commands to check the Smart status of the drives so I can post the output just to be sure?

Dssguy1 · Apr 20, 2016

Here is why I thought checking with GPART was worthwhile but it said everything was fine:

unwind-protect · Apr 20, 2016

smartctl -a /dev/da0

etc

j_h_o · Apr 20, 2016

What firmware is on your SA120? Are you running V1008? (I think that's the latest, IIRC)

Ensure you flash both I/O controllers on the enclosure.

Dssguy1 · Apr 20, 2016

j_h_o said:
What firmware is on your SA120? Are you running V1008? (I think that's the latest, IIRC)

I see they just updated that firmware. I will install it now! - LOL NVM, that was 4-15-2015. I already had Firmware V1008.

So far I have moved around the 5 drives that make up the Media Volume to different slots in the SA120 and moved the 5 drives that make up the Media Volume over to my MD1200. I get the exact same errors. I have tried another 3' SAS cable and same errors. I even tried a little brand new 1.5' cable that came with the SA120, same errors!

So the last thing it could be if it is truly a communications error, is the raid card? I am moving the card around into different slots of my R720 to see if that helps.

Anybody have any new advice while I try different slots? I am running official 20.00.07.00 firmware from LSI on my card, just FYI.

pricklypunter · Apr 20, 2016

Moving stuff around will likely only complicate matters if data retrieval is your goal. If you can still remember exactly where everything was originally, you should move it all back. I suspect you either have some flaky disks, a bad controller or possibly a damaged/ intermittent backplane issue. Oh, and pull your SATA disks out before testing again

Dssguy1 · Apr 20, 2016

I know where they go in the original setup but I thought I read that ZFS was smart enough to not need the disks in the same spot to work. Is that wrong?

How can it be a backplane issue if it gives the exact same errors in two completely different enclosures?

pricklypunter · Apr 20, 2016

Sorry, I missed where you stated that you moved stuff over to the MD1200, but I see it now that I have re-read the thread. In that case the disks may have been corrupted originally in the SA120. Simply moving them to the other chassis will obviously show the same or similar results now that they are corrupted, if indeed that is what happened. However, I think it's only one of a few possible things it could be.
Yes, you are correct, ZFS will not care where exactly the disks are physically located in the chassis, providing they are being addressed by unique names

Dssguy1 · Apr 21, 2016

pricklypunter said:
Oh, and pull your SATA disks out before testing again

Yes, I pulled out everything but that volume to see if that would make a difference. So right now my 5 3tb drives are in a MD1200 and I actually put my smaller 3 2tb drive volume back in the SA120 and connected them both to the R720 through the LSI9207 and the smaller volume showed up, no problem. So at least that volume is still fully functional.

I did set myself up for remote management so I can test some stuff while at work. I ran the smartctl -a /dev/daX on all my drives and some of the results were easy to read. Some were super long and very confusing. I will post the output from the 5 3tb SAS drives to see if you guys see anything weird.

Thanks so much for your help so far folks! I am learning a lot by messing with this setup.

Dssguy1 · Apr 21, 2016

Ok, so I see LOTS of write errors corrected on all of the 3TB drives. Below is an example. What could be causing this!!

Here is a small Excel sheet I made to put it in a more compact form. Each tab has a picture of the "Smartctl -a" output.

Hard Drive Problem.xlsx

Another thing to mention. These drives are HGST but had a SUN label on them. Perhaps made especially for SUN. Would that cause any issues? Possibly weird firmware that doesn't play nice with my hardware?

Dssguy1 · Apr 21, 2016

Sorry to add another comment but where can I find my log files from Freenas? It keeps zipping them up at 100k and putting them somewhere. I would be more than happy to include the latest one so you could clearly see the SCSI errors instead of watching my video!

EDIT:
I think I found logs in var/log but which file should I be trying to look at for startup errors?

unwind-protect · Apr 21, 2016

The log doesn't specifically say but I think these are on-medium ECC errors (as opposed to cable/controller/communication checksum errors). I see no reason to assume that these drives are not simply dead.

One test that you should do is a complete read test. Just read of the raw device. No filesystem, no mount, no writes (so that it doesn't get a chance to reallocate bad blocks).

Dssguy1 · Apr 21, 2016

Hmmm, I wish it was that simple but i ran all 5 of these disks through HDDScan (Surface, Read, Write, Verify) tasks and HDTune tests only a week ago. All the tests took 2+ days on each drive and all passed (or so said the software). Is there anything else that can cause these errors? All five of the drives have a TON of them. Makes me think something else is wrong. If they were failing at this alarming rate, how can they hold any data?

I'm not trying to dispute your claim. I just want to make sure this assessment is correct because I bought them on EBay for $70 a piece and I am going to have to return them if they really are garbage!

Is there a better sub-forum to post this question to get more ideas? I really wasn't sure what could have been the issue so I picked on that seemed likely.

ttabbal · Apr 21, 2016

From the output, it looks like SMART level data... It's saying the write errors were corrected via ECC, so test software looking for data coming back invalid will not detect it.

You might try a SMART long test, I don't see any in the log. Perhaps a long test would find a failure that the short test is missing.

There's a sub-forum for HDDs and SSDs, but I bet most people check here too.

Why does ZFS hate me and what am I doing wrong?

New Member

New Member

Active Member

New Member

New Member

Active Member

New Member

New Member

Active Member

Active Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

New Member

New Member

Active Member

New Member

Active Member