Help! OmniOS Major Severity Error!

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
Hello,

My fileserver has been working flawlessly until yesterday. I was streaming movies and everything kept skipping, this has never happen before.

I logged in to the VM and got the following error below. I restarted the server, and it is still slow! It seems that network connection is running very very slow, or perhaps keeps dropping.

The VM is on a Crucial M4 SSD. Since it says something about memory dump... Well, the VM for OmniOS is already full since it was only 16GB. I have been meaning to re-create VM from scratch and this time make it bigger but have been lazy. Should I just create a new VM from scratch with 30GB VM size this time around? Would that fix the issue?

I have no experience in this. Appreciate any guidance.

Error:

http://i.imgur.com/0TLVQOG.png



Update: Issue is now resolved. The problem was that my OS drive became full. I had to create a new VM. Before closing existing VM, I took screen shot of user accounts and UID and exported pool. Created new VM, installed latest OmniOS and napp-it. Logged into napp-it and re-created the user accounts matching the backed up UID's. Then disable/re-enabled SMB service. Then imported pool. Re-create jobs. Voila, everything is good as new now.

Interestingly, the network share has never been so snappy. I wonder if it's due to optimizations in VMware 5.5 vs. 5.1 or latest OmniOS?
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,500
842
113
DE
I would asume, that your OS disk is simply full.
Check if you can destroy some bootsnaps.

I would create a new VM from a 20-40GB disks.
Save/restore files in /var/web-gui/_log if you want to restore jobs or appliance groups
 

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
You are right. Only 460MB free for the OS disk, after deleting some older bootsnaps.

I will create new 40GB VM so that this doesn't happen again. The only job I have is auto scrub. Can I just re-create that or should I still save files from /var/web-gui/_log?

Quick Questions:

1) Should I update to VMware 5.5U2? I am currently using 5.1. I remember reading about some issues with 5.5 a while back so that is why I did not update. This server is only hosting file server needs at home. I am not running any other VM's.

2) Should I export pool before creating new VM?

Thanks!

Edit: I read around, looks like 5.5U2 is stable now and recommended.
 
Last edited:

MatrixMJK

Member
Aug 21, 2014
70
27
18
52
I think he means...You should, but do not have to. (Not to put words in his mouth)

It is cleaner to import from an exported pool than to force an import. I have done it many times both ways, but prefer to export the pool if I can.
 

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
Server started getting slow again yesterday. I logged-in and get the following message:



Any suggestions? Is this a hardware failure??

Edit: I just found out that my roomate is working on a red hat certification and he has been messing around with a new server he put in. I was having weird DHCP issues with my main computer. When we shut down his server, DHCP/networking issues are all back to normal on the main computer.

Come to think of it, this started around the same time I had issues with the OmniOS ZFS server. Do you think all of this is related? Can networking issues cause a kernel panic and these other errors? I figured since it is static IP and network keeps cutting in and out that this issue could be caused? Just a thought anyhow.

Bump, videos still skipping... and that roomate's red hat server has been off for last 24 hours. Grrr...
 
Last edited:

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
Server started getting slow again yesterday. I logged-in to napp-it and noticed this. I do not believe I had this last week.



Any suggestions? Does this mean the drive is bad?
 

gea

Well-Known Member
Dec 31, 2010
2,500
842
113
DE
It means, that the drive had some hard errors example bad sectors.
These errors are reported by iostat at a driver level.
If they increase, ZFS will set the disk offline some day but it can also
happen that the number increse without reaction from ZFS.
You will notice in a speed degration as ZFS must wait in each case
until the disk reads the data.

What I would do
- Replace the disk with a spare disk or remove (pool: degraded state)
- do a low level check with a test tool from the disk manufacturer

It is very probable, that this will repair some sectors and you can re-use the disk
 

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
Last night, I disconnected and then reconnected all power and SATA cables for the hard drives. I also used a differnt power supply cable and a different port on the APC. Everything was running smooth as butter.

I then proceeded to run a scrub overnight, and in the morning it was at 50% with no errors in pool and stated estimated 7 hours remaining. I checked 8+ hours later, and noticed it was only at 61% total. Strange, so I checked the pool and now there are some Iostat errors for that same drive again:
S:0 H:4 T:8. I ran a quick SMART test on that drive and it failed. Weird, I thought napp-it would alert me if a drive is bad from at least a smart error?

Anyways, at least now I know the issue for sure. Time to replace the drive. I do have an exact identical spare that I bought on purpose in case a drive went bad in the pool. What exactly do I need to do to replace out that drive? I have copied down the serial number of the drive from the smart report. The pool is raidz2.

Thanks!

Edit: To replace the disk, should I do this?

1) Disk > Replace Disk > Select failed disk and click Replace button
2) Shut down server
3) Remove defective drive, and then replace with spare drive
4) Power back on and then it should detect and auto re-build correct?
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,500
842
113
DE
Edit: To replace the disk, should I do this?

1) Disk > Replace Disk > Select failed disk and click Replace button
2) Shut down server
3) Remove defective drive, and then replace with spare drive
4) Power back on and then it should detect and auto re-build correct?
With hot-plug capable hardware, you can hot-insert and remove a disks,
so 1.) is enough

btw
napp-it alerts only on ZFS pool errors.
 

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
I have standard computer case, so I will have to unplug one by one, until I find the drive with the bad serial number. Sadly, will have to shut down system for that. Wish I could afford a fancier system. :c(
 
Last edited:

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
I shut down the system. Connected spare drive to a free SATA cable on the Mini SAS to SATA connector which is connected to a M1015. Powered on the system and then logged-in to napp-it and performed Disk > Spare > Select bad disk on left and replaced with the new spare on right. It is now resilvering (15.5 hours remaining).

Questions:

1) Once complete. Do I shut down the computer, and take out the bad drive and put this new resilvered drive in it's place?
2) When I do the swap, can I re-use the original SATA cable that was connected to the defective drive for the spare? Because the SATA cable currently on the spare that is being resilvered does not reach the top of the case where the defective drive is located. I had temporarily place spare on the floor of the case while it resilvers because I have no more empty drive bays.

Thank you Gea for your support and efforts. I am still learning all about this and enjoying it. Truly appreciate you!
 

gea

Well-Known Member
Dec 31, 2010
2,500
842
113
DE
1. If you do a disk >> replace, the new disk replaces the faulted disk that you can remove then.

Other option: If you add a disk as a hotspare, it replaces a faulted disk automatically but keeps beeing a hotspare. You should then replace the hotspare with a new disk. The hotspare is then available again.

2. Your disk identification is WWN. This is a unique disk id, similar to the MAC adress of a nic and
therefor independent from a HBA port or Sata connector.
 

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
I understand perfectly. Thank you kindly. I am still amateur to all of this but willing to learn, having quite a bit of fun in all honesty. Nice to have new challenges. :c)

Another question. My pool is 10x 3TB in raidz2. Now, the spare has been resilvering for ~27.5 hours. Full status:
9.82T scanned out of 19.8T at 121M/s, 24h1m to go.

However, now I can hear the defective drive making a weird vibration sound. It seems the health of it is deteriorating fast. Hopefully it will resilver properly in the next few days as the estimated time keeps increasing. Originally it said 12 hours remaining, then after 12 hours it said 14 hours remaining, now its at 24 hours.

My question is, since this defective drive appears to be running extremely slow. Should I have originally pulled the bad drive out and swap it with the spare? Would this have resilvered quicker since there would have been no defective drives in the pool slowing everything down?

Thank You.
 

gea

Well-Known Member
Dec 31, 2010
2,500
842
113
DE
If a semi dead disk slows down the whole pool, remove it, insert a new disk and do a
disk replace "removed" >> new
 

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
I just checked the progress and it is now saying:

9.95T scanned out of 19.8T at 76.2M/s, 37h39m to go
970G resilvered, 50.24% done

The defective drive is spitting errors left and right and getting louder in noise. Strangely, it now shows BOTH the defective drive and the spare drive as "resilvering". Is that normal? Before it showed just the spare drive as resilvering.

Is it too late, or can I somehow cancel the resilver? I would like to shut down the computer, pull out the defective drive, re-wipe the spare and then place it in the defective drive's place and start resilver process all over again if possible. I think it would go faster that way since this nearly dead drive is slowing everything down too much.

Thank you again for assistance. I would be a complete loss without you.
 

ZzBloopzZ

Member
Jan 7, 2013
91
13
8
Day and a half later:

10.2T scanned out of 19.8T at 457K/s, (scan is slow, no estimated time)
998G resilvered, 51.70% done

Jeeez. This will take WEEKS at this rate. Something is up with that drive for sure. Wish there was a way to cancel and just take out bad drive then resilver. :c(