ZFS build for VM storage. SSD or HDD pool advice.

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
Hello!

We are looking to build our first shared network storage (or SAN) for a couple of C6100 machines. We will be using OmniOs + napp-it as a first attempt. Later we plan to add a hot-spare server with ZFS replication. This has grown into a larger hobby project so as always money is an issue :)

After reading a lot of benchmarks (i.e. https://calomel.org/zfs_raid_speed_capacity.html), guides etc I am still unsure how to best build what we want.

The alternatives are all SSD pools or HDD pool + ZIL and L2ARC. We want to do at least raid-z2 and have good performance to our VMs. The SAN is connected through a 10 Gbe connection. The VMs are windows and linux mixed. They will be running primarily game servers (some might be quite IO intensive).

Of course a full SSD pool with i.e. 4 or 6 Samsung 512 GB PRO SSD drives would be the best. But that won't give us that much space for VM storage. With SSD pools we might have to use dedup, with ordinary HDD storage we can avoid that.

In more details we plan the HDD setup like,
6x WD RED 3TB in RAID-z2 (might add 1 hot spare to)
2x Samsung 256 GB Pro SSD for L2ARC (possibly mirrored, we already have 2 of these)
2x Crucial M500 SSD drive for Zil (mirrored, has capacitor)

What do you guys think, can the HDD alternative really provide performance to say +20 VMs? Can it even compare to the SSD alternative?
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
The problems:

- Raid-Z has the same I/O performance like a single disk, If you have VM's with a lot of small cincurrent writes this will be quite slow. This is the reason, you usually use mirrors with spindles or a Raid-Z when using SSD only pools

If your workload is mostly read, you may ignore this if you have a lot of RAM (ARC readcache) or a slower L2ARC on SSD but I would not expect an acceptable performance with up to 20 VMs.

- If you enable secure sync write, you need a dedicated ZIL. This must not be a mirror (in case of a failure the on-pool ZIL is used) but should have a supercap. I would use an Intel S3700 as it is one of the best regarding write performance.

In your case, I would
- use as much RAM as possible (and do an arc check if an extra L2ARC is needed)
- think about a pool of 3 x 2 way mirrors with the WD red or accept the low Z2 performance
SSD only pools are much faster than any one with spindles. But stay always below a 80% fillrate (50% for top performance)

- use the 256 GB SSDs for an extra highspeed pool (mirror, eventually buy some more for a Raid-Z2 SSD pool) and do backups to the WD pool.
- avoid dedup in any case as it eats your RAM for dedup - you need it for performance.

- if you later add a second hot spare/backup server, use there the spindles and the SSD only pool on your main machine.
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Twenty VMs? Not using SSD seems almost criminal! Would it work to give each VM two volumes, one from an SSD pool and the other from a non-SSD pool?

And as usual I have to put in a plug for my own favorite VM architecture for a lab or small network: Locally attached non-RAID SSD drives for VM storage plus block-level incremental replication every few minutes to a separate server. Locally attached SSD means great IOPS, which VMs love. Giving up RAID reduces costs dramatically, while the very frequent replication provides protection against disaster with a maximum data loss window of only a few minutes. Using replication as the DR strategy also means that you automatically maintain very frequent snapshots on the replication sever, protecting against human error as well.
 

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
First I just wanted to say thanks for great advice from you two, I know that you both are very skilled in these areas.

The goal
What I am trying to do is provide VM storage for a number of C6100 machines. Almost all servers will use the same files (i.e. game server files) and host game servers. The game files can be quite big and therefore we need plenty of storage.


The problems:

- Raid-Z has the same I/O performance like a single disk, If you have VM's with a lot of small cincurrent writes this will be quite slow. This is the reason, you usually use mirrors with spindles or a Raid-Z when using SSD only pools

If your workload is mostly read, you may ignore this if you have a lot of RAM (ARC readcache) or a slower L2ARC on SSD but I would not expect an acceptable performance with up to 20 VMs.
- I know about the raid-z performance. But I was hoping the ZIL and L2ARC would improve it so that the performance would be "near" SSD speeds.

I will not have many small concurrent writes to the pool (no big database running here).
The server hosting this will have the following specifications,
  • Min 64 GB DDR3 ECC Ram (planning to get up to 128 GB if needed)
  • Probably building on the X8DTH-6F motherboard
  • 2x L5630 or L5639
With this hardware I doubt we will have the need for a L2ARC right away :). We might add one if we add more pools.
- If you enable secure sync write, you need a dedicated ZIL. This must not be a mirror (in case of a failure the on-pool ZIL is used) but should have a supercap. I would use an Intel S3700 as it is one of the best regarding write performance.
In your case, I would
- use as much RAM as possible (and do an arc check if an extra L2ARC is needed)
- think about a pool of 3 x 2 way mirrors with the WD red or accept the low Z2 performance
SSD only pools are much faster than any one with spindles. But stay always below a 80% fillrate (50% for top performance)
- I did not know it was bad mirroring the ZIL. I read here, Nex7's Blog: ZFS Intent Log that
Code:
 (optional but strongly preferred) Get a second dedicated log device (of the exact same type as the first), and when creating the log vdev, specify it as a mirror of the two. This will protect you from nasty edge cases.
We might get the S3700 or the crucial depending on our needs and the cost (yeah the intel is better).

- I'm a little against the 3 x 2 mirror solution since if two drives in the same mirror fail we lose the raid. It's a bit bigger risk than I want to take.
- use the 256 GB SSDs for an extra highspeed pool (mirror, eventually buy some more for a Raid-Z2 SSD pool) and do backups to the WD pool.
- avoid dedup in any case as it eats your RAM for dedup - you need it for performance.
- if you later add a second hot spare/backup server, use there the spindles and the SSD only pool on your main machine.
Separating the pools might be a good idea. We also have 2x 512 GB Samsung Pro SSDs.

The main issue with the SSD are the price/gigabyte. With the HDD alternative we get 10 TB storage in RAID-Z2. With the same SSD array and a low fill rate we only get ~1.5 TB of storage for a much higher price tag.
I was hoping that by using a big ARC, a L2ARC and a ZIL one could get really good performance even from the slow spindles.

But based on your answers I guess I am hoping for too much :(

Oh and I would personally like to thanks you for your awsome napp-it software. We are also basing our SAN build on the recommended hardware from your website :)

Twenty VMs? Not using SSD seems almost criminal! Would it work to give each VM two volumes, one from an SSD pool and the other from a non-SSD pool?
And as usual I have to put in a plug for my own favorite VM architecture for a lab or small network: Locally attached non-RAID SSD drives for VM storage plus block-level incremental replication every few minutes to a separate server. Locally attached SSD means great IOPS, which VMs love. Giving up RAID reduces costs dramatically, while the very frequent replication provides protection against disaster with a maximum data loss window of only a few minutes. Using replication as the DR strategy also means that you automatically maintain very frequent snapshots on the replication sever, protecting against human error as well.
We are currently running 2xC6100. Each node will run xenserver or proxmox with a bunch of VMs. We will probably have 2-3 VMs per node.

Hehe I'm not a criminal, I'm just too cheap to buy the SSDs when I thought the other alternative would be sufficient :).

I have to investigate if the two volume approach will work with what we want to do.

We are actually using your favorite architecture right now with SSD drives in each node. But now when we aim to build bigger we needed a SAN for xenMotion / live migration / HA etc.


Could one alternative be to use 6x Samsung 512GB SSDs in raid-z2 with dedup on. With our amount of RAM that might be a possibility? Since most of our files will be the same on all C6100 nodes dedup might save lots of space. This might be a good alternative given our space/performance requirements?
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
...
Could one alternative be to use 6x Samsung 512GB SSDs in raid-z2 with dedup on. With our amount of RAM that might be a possibility? Since most of our files will be the same on all C6100 nodes dedup might save lots of space. This might be a good alternative given our space/performance requirements?
I think that it's worth a try, and whatever the results it would make a great contribution to this and other forums. My uneducated guess is that it'd be a great solution. With just six half-Terabyte drives, I think that I'd go RAIDZ and not RAIDZ2. Failure rates are very low, and rebuild times are really short with SSD drives, so I wouldn't think that you'd need the extra drive.

10GbE or Infiniband for the SAN network?
 
Last edited:

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
I think that it's worth a try, and whatever the results it would make a great contribution to this and other forums. My uneducated guess is that it'd be a great solution. With just six half-Terabyte drives, I think that I'd go RAIDZ and not RAIDZ2. Failure rates are very low, and rebuild times are really short with SSD drives, so I wouldn't think that you'd need the extra drive.

10GbE or Infiniband for the SAN network?
The SSD alternative seems to be the best to support the number of VMs we are aiming at while giving good performance. The cost should be about 2300$ for this, I have to discuss it with my partner :).

We will begin with a simple 10Gbe setup. The SAN will have 1 10Gbe nic to a switch that has 2 SFP+ ports. All nodes will use LCAP with 2 or 4 1GbE NICS to the same switch. We think this will be sufficient to begin with. Later we plan on getting a voltair 4036 switch and do QDR infiniband. But thats further into the future :)
 
Jan 4, 2014
89
13
8
Are you planning for iscsi or nfs ?
Iscsi doesnt play well with lacp.
Not sure about xen, but vmware most def does not.

You'de be better off creating 2 seperate paths ( or ip segments) for storage, utilizing vlanning to seperatw the 2.
Ideally 2 individual switches, which also gives a redundancy
 

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
Are you planning for iscsi or nfs ?
Iscsi doesnt play well with lacp.
Not sure about xen, but vmware most def does not.

You'de be better off creating 2 seperate paths ( or ip segments) for storage, utilizing vlanning to seperatw the 2.
Ideally 2 individual switches, which also gives a redundancy
I was planing for iscsi but with your advice i might change that to NFS. Will probably try both before we decide :)

We will look into the networking a bit later, trying to focus on one piece at a time.

I'm going to add a build log for the whole thing later, hopefully already tomorrow.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
Performance wise there should not be much difference between iSCSI and NFS.
Huge advantage of NFS shared storage: You can connect via NFS or SMB (+ access to snaps via Windows previous versions) from other machines. This makes backup/ clone/ restore/ copy from offline VMs or from ZFS snaps a lot easier.
 

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
Performance wise there should not be much difference between iSCSI and NFS.
Huge advantage of NFS shared storage: You can connect via NFS or SMB (+ access to snaps via Windows previous versions) from other machines. This makes backup/ clone/ restore/ copy from offline VMs or from ZFS snaps a lot easier.
The ability to do an ordinary NFS/CIFS share via NFS is indeed a big advantage to us. We have not yet decided on our backup procedure for the VMs.

We have decided to go with the SSD pool alternative. We will go with either 6x or 8x 512 GB Samsung PRO SSDs in raid-z or raid-z2. The hardware will be the one i wrote previously in this thread,
  • Min 64 GB DDR3 ECC Ram (planning to get up to 128 GB if needed)
  • Probably building on the X8DTH-6F motherboard
  • 2x L5639 (probably 2 of these since dedup and compression will eat a lot of cpu)
We will run with dedup and lzjb on the SSD pool. Later we plan to add the WD-red HDD pool alternative as a second pool to the machine for some slow secondary storage :)

I will get back with a build log later tonight.
 

methos

New Member
Dec 19, 2013
20
0
1
Canton, OH
legen -

I just wrapped up a trip to a datacenter that "centered" around upgrading our SuperMicro box with 62GB RAM, (4) Intel s3700 SSD's and dropping in a LSI9211-8i controller running the other (8) SAS 600GB HS disks in the shelf. Installed OMNIOS, and in the midst of configuring my mirror RPOOL - remotely. Since I have (2) SSD's to spare, I may do what GEA stated and use one for ZIL.
 

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
Its included it OmniOS since last 151006 stable (current is 151008 stable)
ReleaseNotes
Ah, my googling skills failed here. Good thing i wont have to go bloody :)

legen -

I just wrapped up a trip to a datacenter that "centered" around upgrading our SuperMicro box with 62GB RAM, (4) Intel s3700 SSD's and dropping in a LSI9211-8i controller running the other (8) SAS 600GB HS disks in the shelf. Installed OMNIOS, and in the midst of configuring my mirror RPOOL - remotely. Since I have (2) SSD's to spare, I may do what GEA stated and use one for ZIL.
How does that setup perform? I'm a little unsure what gea statement you are referring to?
 

ColdCanuck

Member
Jul 23, 2013
38
3
8
Halifax NS
I would like to add a few points and amplify what others have said:

- Raidz acts at the speed of a single disk for random I/O loads, for sequential loads it approaches N where N is the number of data disk in the vdev

- you can easily stripe Raidz vdevs (analogous to RAID60) which will give random I/O a boost. M stripes gives M times improvement in random and sequential

- SLOG (a separate log device for the ZIL) will ONLY HELP SYNCHRONOUS writes. It does nothing for any other work load. That said, nfs uses synchronous writes by default.

- dedup has a lot of potential downsides, especially on spinning rust drives. These are usually not apparent to the new user and become much worse with age. There be dragons there, best to stay away unless you relish lots of "learning experiences". In any event providing enough hard drives to get the performance usually means you have lots of capacity.

- there is no substitute for testing under your load. As others have said you can get a feel for how "localized" you data is, how much ARC (RAM cache) to provide and whether L2ARC would help.
 

dswartz

Active Member
Jul 14, 2011
610
79
28
I know vsphere as an nfs client forces sync, I haven't seen anything that claims they all do by default?
 

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
I would like to add a few points and amplify what others have said:

- Raidz acts at the speed of a single disk for random I/O loads, for sequential loads it approaches N where N is the number of data disk in the vdev

- you can easily stripe Raidz vdevs (analogous to RAID60) which will give random I/O a boost. M stripes gives M times improvement in random and sequential

- SLOG (a separate log device for the ZIL) will ONLY HELP SYNCHRONOUS writes. It does nothing for any other work load. That said, nfs uses synchronous writes by default.

- dedup has a lot of potential downsides, especially on spinning rust drives. These are usually not apparent to the new user and become much worse with age. There be dragons there, best to stay away unless you relish lots of "learning experiences". In any event providing enough hard drives to get the performance usually means you have lots of capacity.

- there is no substitute for testing under your load. As others have said you can get a feel for how "localized" you data is, how much ARC (RAM cache) to provide and whether L2ARC would help.
What potential downsides are there with dedup? The dedup feature has been integrated in ZFS for a long time, can one not count it as stable :)?

Hopefully we can choose to not use dedup; instead of being forced to do it due to space demands.
 

dswartz

Active Member
Jul 14, 2011
610
79
28
dedup is stable in the sense of not being buggy and crashing. Nonetheless, there are serious performance ramifications that can make it problematic to use. To wit: the dedup tables can take a lot of storage. If you don't have enough RAM, they get written to disk. So when you go to delete a file or snapshot or whatever, ZFS may need to do dozens (or even hundreds) of disk reads/writes to delete a single file or snapshot. There are horror stories of people deleting a filesystem or snapshot and having it take (literally) days to complete. And the use cases for it are very limited. In any event, coldcanuck didn't say dedup is not stable, just problematic in terms of performance...
 

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
dedup is stable in the sense of not being buggy and crashing. Nonetheless, there are serious performance ramifications that can make it problematic to use. To wit: the dedup tables can take a lot of storage. If you don't have enough RAM, they get written to disk. So when you go to delete a file or snapshot or whatever, ZFS may need to do dozens (or even hundreds) of disk reads/writes to delete a single file or snapshot. There are horror stories of people deleting a filesystem or snapshot and having it take (literally) days to complete. And the use cases for it are very limited. In any event, coldcanuck didn't say dedup is not stable, just problematic in terms of performance...
Ah yes, i read one of those over here Dedupe – be careful! » ZFS Build. Looks like one must understand the mechanics in more detail to avoid these things. RAM seems to be a common factor.