How to build a OpenSolaris derived ZFS Storage Server

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
Basics

ZFS is a revolutionary file-system with nearly unlimited capacity and superior data security thanks to copy on write, raid z1-3 without the raid5/6 write hole problem, with a online filecheck/ refresh feature and the capability to create nearly unlimited data snapshots without delay or initial space consumption. ZFS Boot snapshots are the way to go back to former OS-states. ZFS is stable, used in Enterprise Storage Systems.

Features like Deduplication, online Encryption (from ZFS V.31), Triple Parity Raid and Hybrid Storage with SSD Read/ Write Cache Drives are State of the Art and just a included ZFS property. Volume- Raid and Storage Management are Part of any ZFS System and used with just two commands zfs and zpool. ZFS is now part not only in Solaris derived Systems, but also available in BSD, OSX and Linux under the roof of Open-ZFS.

But Solaris derived Systems are more. ZFS is not just a file-system-add-on. Sun had developed a complete Enterprise Operating System with a unique integration of ZFS with services like a real AD Windows ACL compatible and fast SMB and a fast NFS Server as a ZFS property. Comstar, the included iSCSI Framework is fast and flexible, usable for complex SAN configurations. With Crossbow, the virtual Switch Framework you can build complex virtual network switches in software, Dtrace helps to analyse the system. Service-management is done the via the unique svcadm-tool. Lightweight virtualisation can be done on application level with Zones (KVM, Solaris Zones, LX/ Linux Container). All these features are developed and supported from Sun- now Oracle or Illumos, perfectly integrated into the OS with the same handling and without compatibility problems between them - the main reason why i prefer ZFS on Solaris derived systems-.

Since Oracle bought Sun and closed the OpenSolaris project, there are now the following Operating Systems based on (the last free) OpenSolaris that was forked to Illumos

1. some commercial options

- Oracle Solaris 11

the fastest and most feature rich ZFS server at the moment, the only one with encryption
I support it with my free napp-it Web-Gui

-NexentaStor Enterprise Storage-Appliance (based on llumos)

2. some free options

-OpenIndiana Hipster (the free successor of OpenSolaris), based on Illumos
always dev state, usable for Desktop or Server use. I support it with my napp-it Web-Gui.
download http://openindiana.org

The Illumos Project
is a fork of the OpenSolaris Kernel with the Kernel/ OS-functions and some base tools. Illumos is intended to be completely free and open source. Illumos is not an distribution but the common upstream of the main distributions NexentaCore, OpenIndiana, OmniOS and SmartOS.

3. Use cases:

Although there is a desktop-option with OpenIndiana, Solaris was developed by Sun to be a Enterprise Server OS with stability and performance at first place, best for:

NAS
-Windows SMB Fileserver (AD, ACL kompatible, Snaps via previous version )

SAN
-NFS and FC/ iSCSI Storage

Web
-AMP Stack (Apache, mySQL, PHP)

Backup and Archiv
-Snapshots , Online File-Checks with Data Refresh , Checksum

Some Systems can be used as appliances and managed remotely via Browser and Web-Gui. They run on real hardware or virtualized, best on ESXi with pci-passthrough to SAS Controller and disks.


4. Hardware:

See my build examples
http://www.napp-it.org/doc/downloads/napp-it_build_examples.pdf


5. manual ZFS Server Installation

Download the ISO or USB image, boot from it and install the OS to boot-drive.
Use the whole boot disk. Installation is easy.
see http://napp-it.org/doc/downloads/napp-it.pdf

You can also install the OS as a virtualized SAN.
(All-In-One, Virtual Server + SAN + virtual network switch in a box)
see http://napp-it.org/doc/downloads/napp-in-one.pdf

6. After OS setup, setup the storage appliance (CLI as root)

wget -O - www.napp-it.org/nappit | perl
You can now manage your NAS-appliance via http://ip:81

thats all. Install + setup your ZFS-Server -ready to use- in about 30 min.

7. napp-it to Go
As an alternative to the manual setup, you can use preconfigured images, either a template for ESXi or system images that you can clone to a new Sata boot SSD.

Gea
 
Last edited:

PigLover

Moderator
Jan 26, 2011
2,954
1,262
113
Great writeup, very helpful. I've been planning to do some experimenting and the parts I have fit your writeup perfectly. Two questions:

I love SuperMicro MBs and IPMI. Found this tonight: http://communities.vmware.com/thread/280988. Have you had any similar problems using IPMI with esxi?

The CPU I have available is an L3426. Is this going to be enough CPU to do an All-in-one like you describe?
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,884
4,845
113
If you use the Realtek for IPMI 2.0, then assign another NIC for the VMware management, that fixes that problem.
 

fblittle

New Member
Apr 5, 2011
19
0
1
Gridley, CA (near Sacramento)
I Have set up a Solaris11 NAS server with SMB shares. napp-it has truncated all my passwords to 8 characters in length. This was a problem until I discovered using LastPass that it had recorded the new passwords and showed that they were all 8 chr long. Is this normal? Some of my passwords are longer. I am using the console on a Win7 machine so windows should'nt be the limitation. I can't find any documentation on this. Other than that,I love napp-it. It makes it so easy to administer Solaris, I almost feel guilty using it.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
I Have set up a Solaris11 NAS server with SMB shares. napp-it has truncated all my passwords to 8 characters in length. This was a problem until I discovered using LastPass that it had recorded the new passwords and showed that they were all 8 chr long. Is this normal? Some of my passwords are longer. I am using the console on a Win7 machine so windows should'nt be the limitation. I can't find any documentation on this. Other than that,I love napp-it. It makes it so easy to administer Solaris, I almost feel guilty using it.
Current napp-it nightly allows to set passwords up to 16 characters.
But if you use Nexenta, only the first 8 characters are used.
(The reason, i set the pw-form to 8 char. max).

OpenIndiana and SE11 are ok with longer passwords.

Gea
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
Why using ZFS?

ZFS is Software raid + raid management + dynamic volume management (storage virtualisation).
If you compare to traditional raid 5/6, look at the following:

1.
Raid 5/6 write hole problem: If a problem occurs during a write, you have partly updated data. You can reduce the problem with a battery pack but it can happen that you have a damaged filesystem. You need at least a offline filecheck that can last days on large storage.

You can only solve the problem with a CopyOnWrite filesystem


2.
Bad block problem
You do not need a whole drive failure. A single bad block (producing a read delay or error) can set a disk to failed. Two bad blocks can kill a Raid 5 if they happen on two disks. A good Raidcontroller can handle this with the price of a offline filecheck, that can last days.

You can only solve the problem on filesystem level with a self healing filesystem that can handle a lot of bad blocks and repair them on the fly without a disk failure on timeout.


3.
Silent error problem or disk errors due to cabling or driver problems
With a statistical rate, you encounter data errors by chance or errors occur to cabling or driver problems. Only some of them can be detected on a offline filecheck, that can last days.

You can only solve the problem on filesystem level with data-checksums paired with regular online checks for data integrity.


4.
Data validity
If you, someone else or a virus accidently deletes or modify data, you usually need a backup. If you discover the problem later, your backup is wrong as well as your second backup.

You need versioning or a file history that may go back for week, months or years.
Only stable way to do this are snapshots on a CopyOnWrite filesystem.


5.
Data is growing and hardware dies some day
In my own environment, I have a 50% grow of data per year. I need the option to increase capacity up to the petabyte range without rebuild raids or extend raid stripes (you know, last days..) . I cannot allow that data access depends on a raidcontroller from a special brand. Raid must be controller independent.

This can only solved with software raid and a pooled storage with dynamic filesystem sizes (storage virtualisation).


6.Performance
Every NAS can deliver pure disk or raid performance. But pure disk performance is bad.

You need RAM caching, additional SSD caching or dedicated log-devices for fast and secury sync writes.
These features are part of ZFS. With Solaris any otherwise unused RAM is used for caching. No RAM free.
This is not because ZFS is RAM hungry - this is because your RAM is used to increase performance.
If pure disk performance is enough, ZFS is stable with 1-2 GB RAM.


Summary:
All these problems are daily problems, affects data security or data availabilty or data validity. The problems are huge with large storage in the Multi-Terabyte area and they are the reason for the need of modern filesystems like ZFS as the first and most stable option and Btrfs and ReFS some day. Older filesystems like ext, hfs+, ntfs or xfs cannot handle these problems at all or in a comparable way .

Stay or go with ZFS, newest and only option on Solaris/Illumos where development is done, a stable alternative on BSD and more and more an option in Linux.
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
Affordable high-end storage

You can build a nice cheap ZFS based Solarish home-fileserver on a HP microserver N36/40/54
or a mid-size all-in-one or fileserver based on a SuperMicro X10SL7-F with included LSI SAS HBA 2308.


You can also build real high performance ZFS storage or All-In-Ones with quite minimal costs.
One of the current best offers (Mid 2013) is the following config that I build for some tests.


Appliances based on a SuperMicro X9SRH-7TF
with 2 x 10 GbE and SAS 2308 in IT mode (=LSI HBA 9207) onboard, about 500 $/€


CPU: Xeon 2620-2 GhZ 6core (optional 4 or 8core)
RAM: 32 GB ECC RAM (4 x 8GB, max 256 GB)


Step 1:
flash onboard LSI 2308 to IT mode
http://www.napp-it.org/doc/manuals/flash_x9srh-7tf_it.pdf

Step 2:
Bios setting.
- enable vt-d (advanced-chipset-nortbridge-I/O)

step 3:
Install ESXi 5.1 build 799733 onto a 50 GB SSD on Sata

step 4:
enable pass-through (ESXi vsphere: advanced settings) for 2308, reboot

step5:
upload OmniOS ISO to local datastore on the 50 GB SSD (use ESXi Filebrowser)

step6:
create a OmniSAN-VM (20 GB) with e1000 nic
add the LSI 2308 as a pci adapter
cd=bootup-iso, stable from may

boot and setup

step7:
setup network according to:
napp-it // webbased ZFS NAS/SAN appliance for OmniOS, OpenIndiana and Solaris

step8:
install napp-it via wget -O - www.napp-it.org/nappit | perl

step 9:
start napp-it via http://ip:81
- create a pool (raid-z2 from 6 x Sandisk extreme2 - 480 GB), add a highspeed SSD with supercap (SLC or Intel s3700) or Dram ZIL (ZeusRAM)
- create a filesystem /tank/vm
- share the filesystem via NFS

step 10:
connect the filesystem /tank/vm via ESXi (NFS)

step 11:
- upload a win8 iso to the NFS datastore
- create a new Win8 VM on the NFS datastore on OmniOS with DVD connected to Win8 iso
- setup Win 8
- Win8 ok


next steps,
- update ESXi to 5.1U1 build 1065491: ok
- update OmniOS to newest stable build b281e50: ok
- install vmware tools, setup vmxnet3 (http://napp-it.org/doc/ESXi-OmniOS_Installation_HOWTO_en.pdf): ok


my "desktop testbed"
 
Last edited:

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
Can you tell us what happens when an array has a misbehaving consumer drive? Say it has a bunch of bad sectors. It decides go into 180 second deep cycle recovery.

1. Does the volume mounted on the drive lag at all? if so how long?
2. How long does it try before giving up? A few hours? minutes? 8 seconds?
3. Sometimes drives take a while spin up, does this impact the share?
4. Modern sas drives have PI - sas drives with these features can provide both detection of error and partial read good. Does this support this? It may be able to say this sector is good, this sector is good, this sector is known bad but here is what i've got for data. (allows some controllers to reconstruct the bad sectors first whilst the remaining sectors that not bad still read from the damaged drive to reduce multi-disk rebuild failure).
5. Does this support clustering? Two raid controllers to a sas expanders to dual ported drives for no single point of failure
6. Is the failover in (iscsi,nfs) good enough for esxi? it is rather picky about timing?
7. How's the VAAI support? Any tips? the main thing is thin-reclamation
8. What is a good hardware setup for a D2D storage (inline dedupe,compress optional)?
9. Is the smb fully compatible with windows? what happens with permissions and metadata?
10. How bad is it to use a raid controller? Can you use hardware raid and build a giant unprotected volume?
11. I ask this because PI and one other feature makes the LSI megaraid do checksum verify reads (same as zfs) in hardware. I'm guessing this might give your vm a bit more cpu for dedupe/compress

12. If you had a DL180 G6 would you just throw in say 72-96gb of ram (ecc), a cheap 5520 cpu [or numa two sockets of cheap 5520] and two LSI controllers and 10gbe nic. If you had 6gbps to 12 sata drives what is a good balance of ssd? SLC is expensive and very small (x25-e) and quite slow compared to modern MLC/TLC. What's a good balance per buck?

Can you alter the use of the ssd ie. 512GB of TLC 840 for read only cache and perhaps a raid-1 of 100gb old school samsung slc drives for write back? ZIL? Does the SLC raid-1 for ZIL/LOG need to be a particular speed?


I don't have an IT controller but I could use the P420/1gb FBWC hp smart array and just do raid-0 of single disks to simulate jbod.
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
Can you tell us what happens when an array has a misbehaving consumer drive? Say it has a bunch of bad sectors. It decides go into 180 second deep cycle recovery.

1. Does the volume mounted on the drive lag at all? if so how long?
2. How long does it try before giving up? A few hours? minutes? 8 seconds?
3. Sometimes drives take a while spin up, does this impact the share?
4. Modern sas drives have PI - sas drives with these features can provide both detection of error and partial read good. Does this support this? It may be able to say this sector is good, this sector is good, this sector is known bad but here is what i've got for data. (allows some controllers to reconstruct the bad sectors first whilst the remaining sectors that not bad still read from the damaged drive to reduce multi-disk rebuild failure).
5. Does this support clustering? Two raid controllers to a sas expanders to dual ported drives for no single point of failure
6. Is the failover in (iscsi,nfs) good enough for esxi? it is rather picky about timing?
7. How's the VAAI support? Any tips? the main thing is thin-reclamation
8. What is a good hardware setup for a D2D storage (inline dedupe,compress optional)?
9. Is the smb fully compatible with windows? what happens with permissions and metadata?
10. How bad is it to use a raid controller? Can you use hardware raid and build a giant unprotected volume?
11. I ask this because PI and one other feature makes the LSI megaraid do checksum verify reads (same as zfs) in hardware. I'm guessing this might give your vm a bit more cpu for dedupe/compress

12. If you had a DL180 G6 would you just throw in say 72-96gb of ram (ecc), a cheap 5520 cpu [or numa two sockets of cheap 5520] and two LSI controllers and 10gbe nic. If you had 6gbps to 12 sata drives what is a good balance of ssd? SLC is expensive and very small (x25-e) and quite slow compared to modern MLC/TLC. What's a good balance per buck?

Can you alter the use of the ssd ie. 512GB of TLC 840 for read only cache and perhaps a raid-1 of 100gb old school samsung slc drives for write back? ZIL? Does the SLC raid-1 for ZIL/LOG need to be a particular speed?


I don't have an IT controller but I could use the P420/1gb FBWC hp smart array and just do raid-0 of single disks to simulate jbod.
A lot of question, i will try to answer

1.
A consumer drive is ok with ZFS if you can accept a delay on errors until the drive responds. This can happen on enterprise disks as well but not as often. If ZFS detects too many errors the pool goes to a degraded state (error: too many errors) with the disk offline. A hotspare can replace the faulted disk immediatly.

2. For exact time, you should ask at irc #ZFS
But I had e ESXi timeout on a NFS datastore after 140s while the pool was intact with a few error messages in the systemlog. In any case, you do not need these special TLER disks like with hardware raid.

3. The share is accessable when the disks are up

4. I do not know but disk internal error mechanism are independant from ZFS. ZFS checks validity not on disk level but end to end in RAM

5. Solaris supports multipathing on SAS disks

6. I do not use multipathing, maybee another person can answer but would expect its similar to other solutions

7. VAAI is not a ZFS feature. You can use NexentaStor. They support VAAI

8. Too many options without knowing use cases or finances.

9. Solaris CIFS is the only SMB server outside Windows that can support Windows ACL, Windows SID, Windows previous versions and Windows share level ACL. SAMBA does not.

10 Very bad. You loose self healing features what means that ZFS can detect errors but cannot repair. You also loose data security on sync writes due the controller cache. ZFS needs full control.

11 Do not know

12 Even with consumer SSDs you can build stable storage with Raid Z2/3. You may need enough hotspare disks and should replace them after 3 years or so. If you need a ZIL, use the fastest that you can afford but with a supercap (ex ZeusRAM, Intel s3700-200 GB or a fast SLC) or its worthless.

For heavy write workloads, consider Intel s3700 as well. For sync write always add a very fast ZIL to consumer SSDs to reduce small writes. see http://www.napp-it.org/doc/manuals/benchmarks.pdf for some basic behaviours.

Using slow or old SSDs for ZIL gives you a worser performance than without a dedicated ZIL.
 
Last edited:

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
Hmm I think you would get a datastore lost connectivity if latency got anywhere near 1 second! Ouch!

Can you tune solaris so it will drop the drive in less than a quarter of a second?
 

Boris

Member
May 16, 2015
75
11
8
Could please someone help me.
Which CPU will be better for ZFS storage RAIDZ2 or RAIDZ3 with 8-12-16 HDD. Socket 2011 Xeon 8 cores, but 2.2Ghz, or 4 cores and 3.7 Ghz?
 

PigLover

Moderator
Jan 26, 2011
2,954
1,262
113
Could please someone help me.
Which CPU will be better for ZFS storage RAIDZ2 or RAIDZ3 with 8-12-16 HDD. Socket 2011 Xeon 8 cores, but 2.2Ghz, or 4 cores and 3.7 Ghz?
Just for the filesystem processing and raid calcs? They are both overkill. Massive overkill.
 

Boris

Member
May 16, 2015
75
11
8
And one more question. Hardware LSI controller does not matter? Controller BBU and RAM cache, all useless, right?
Software LSI 2208/etc is good enough?
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
LSI 9211/ IBM 1015 (2008 chip) flashed with IT firmware
or the newer LSI 9207 (2308 chip, IT mode per default) are perfect.

LSI 2208 is a bad choice for ZFS as it is used for hardware raid.