ZFS Advice for new Setup

humbleThC · Jan 10, 2017

T_Minus said:
I'm going by what your most recent update is with your layout. That's all.

If you're using hardware raid why are the samsung drives in a vdev at all then?

Sorry if my excel layout was confusing just threw it together... I'm totally not using any form of hardware raid. (All my LSI adapters are flashed in to IT mode)

In my layout, i have the (4) Samsung Evo 850s in a 'performance' pool 1 = RAID1/0 [all by themselves]
- Two seperate vdevs of RAID1 devices for RAID1/0 equiv.

The (10) Hitach HDDs are split up in to two RAIDz1 (4+1) groups = pool 0 = 'capacity'
- Again two seperate vdevs of RAIDz devices for RAID5+0 equiv.

And then there is (4) Intel S3710s, which i'm debating using for ZIL/ARC on the HDD pool, or if that doesn't do anything substantial, i'll just build another separate RAID1/0 pool amongst the Intels, and have (3) separate dedicated pools with like drives, evenly split across controllers for perf/redundancy.

TBH i'm not sure i'll stay ZFS if I cant use the Intels to accelerate real-life workload of the HDDs, because thats part of the reason i'm testing it to begin with, but thats another discussion

T_Minus · Jan 10, 2017

You're going to see a big increase with the SSD for SLOG but not as big as using them separate -- so I guess it just depends on what you're after exactly

and you'll find out testing. BUT, be sure to test them properly so the SLOG devices actually come into play

I don't think your going to see it for L2ARC, but maybe I don't know your work load or data-set size, etc...

gea · Jan 11, 2017

humbleThC said:
TBH i'm not sure i'll stay ZFS if I cant use the Intels to accelerate real-life workload of the HDDs, because thats part of the reason i'm testing it to begin with, but thats another discussion

Acceleration of an HD pool with L2ARC/Slog depend on the exact workload and your expectations.
Without them

You have a huge rambased Arc readcache that is part of ZFS that uses nearly all free RAM. It caches not files but datablocks on a last acessed/ most acessed/ prefetch method. This will cache all metadata and small random reads. Nearly all your random reads are coming from this one, Check arcstat for validity. If cache hitrate is too low (say less than 60%) think about more RAM or add an slower S2ARC but I am quite sure in a single user scenario this will not be the case.

As the SSD based L2Arc is much slower than the Arc and as it consumes RAM to work, you want it only if your RAM is too small for your cache needs and you cannot add more RAM (mainly on mainboards where you can't add more RAM) or if you want to cache a sequential workload (i.e. videos viewed/ streamed by several users) that can be enabled for L2Arc.

On writes, ZFS use several GigaByte of fast RAM for caching and to transfer small random writes into one large sequential write. On a regular filer use all writes are going over this writecache and are very fast. An Slog will not improve as it is not used. A ZFS pool can reach several Gbyte/s without problem (40G network=around 4 GByte/s).

Only if you need a transactional save write behaviour similar but better compared to a hardware raidcontroller with its smaller and slower cache + BBU solution, you use sync write. As your regular pool is weak with the small random writes that is the result when every commited datablock must be written immediatly you can massively increase sync write performance when you log these small writes not to your datapool but a dedicated slog up to a level that you can write the data coming in over your network. On an 1G network this means around 100 MB/s regular cached write + sync write logging. This is far above any reachable values of a disk based pool even in a massive raid-10 config but no problem when you add a good SSD as an Slog. With 10G you need around 1 GB/s with an Slog size of around 8GByte. Even the best NVMe as an Slog cannot really reach this overall write value.

If you use an SSD only pool, your regular cached writes can be faster and without the single ssd Slog, all logging can be spread over the pool what means that all writes are improved, sync or normal writes. On reads the pool is faster than a single L2Arc device.

The result is
- You can improve a diskbased pool within technical limits with RAM, L2Arc and Slog
- Real performance = SSD only pools (far above any acceleration options for diskbased ones)

This is why everyone suggests two pools, one for capacity and one for performance.
As disk capacity has come to a physical limit while SSDs with 3D technology can go ahead, in a few years all will be SSD only propably.

humbleThC · Jan 12, 2017

Received the (4) Intel S3710-400Gs. And sure enough (1) of them was a S3610-400G instead...
3hr drive across 2 states, and a successful exchange with the eBay vendor fixed that... (didnt want to wait 3-5 more days... GAH!)
Can't say i'm not determined, (if nothing else!)

It's fair to say, i totally understand the 'average utilization aspect' of my storage. Where;
- Yes, I could utilize the Intel SSDs as cache/log for the HDD pool, and get "some" improvement
- No, it will never be the raw performance capabilities of the SSDs by themselves, when used independently.
- Therefore, for both a pure capacity and performance argument, it makes total sense to keep everything separate.

But again, my ultimate goal, is to not care about the 1.6GB RAW of Intel SSDs, but rather use them a purposeful way to accelerate the 40TB RAW of HDDs (that will be 80-144TB eventually ~3-5yrs). Assuming the SSDs are fast enough to nearly saturate my 32Gbit IB pipe (after IPoIB overhead), subtract more overhead for the final presentation protocol (iSCSI/NFS/SMB etc), and i'm thinking 2GBytes/s is a good realistic target.

Again assuming the SSDs can push 500MBs each , and i'm starting with (4), I'd love to just see a synthetic benchmark of the 'ideal type' saturate the bus, just to do it. If that means RAID0 stripe, and ZIL disabled, just for the sake of the benchmark, so be it. My goal is to be able to determine exactly how adding the features of ZFS one by one affect performance. i.e. What does RAID1/0 look like versus RAID0, is it 1/2? Is that cause/effect expected etc. Then adding ZIL functionality back, without a dedicated SLOG, again test thoroughly. Then add SSDs as dedicated ZIL devices, 1 by one, with and without mirrors.

If/when I found a sweet spot where most of my active data is being served from SSD, but I get to leverage the entire HDD pool capacity, i'll be a very happy camper

humbleThC · Jan 12, 2017

Config #1 - 100% Separate Pools, per unique Disk Type.
2x RAIDz1 (4+1) on HDDs
2x Mirrors (1+1) on SSDs

Initial Benchmarks using Napp-It default built-in benchmarks
Bonnie++

Iozone

FileBench – (fileserver.f) – 60sec

Just messing around .

whitey · Jan 12, 2017

Hope you have 12Gbps HBA's and are planning on raid-0 to get anywhere close to there, not running numbers just looking at it I don't think your gonna get there w/ synthetic testing, let alone real world, ready to see some numbers for sure!

The image is small and my eyes suck apparently so if you've already achieved this congrats, if not back to the drawing board (and open up that pocket book a bit more) or 'good nuff' scenario.

T_Minus · Jan 12, 2017

To reiterate:

- SLOG helps on sync writes it's not for caching anything, and it ONLY helps on sync writes. To test performance be sure to set sync=always
- ARC (adaptive replacement cache) is in RAM - This will accelerate READing. Add as much RAM as possible.
- L2ARC - If you can't add more RAM or your workload could benefit from TBs (and you have the RAM) L2ARC may be for you. Keep in mind it will utilize your RAM and that's why you should add more RAM first before L2ARC.

2GB/s from 4x SSD is an unrealistic target even in RAID0 on bare-metal during continuous real-life work-loads 99.99% of the time. You're talking about getting the absolute maximum performance out of those 4x SSD that they're rated for by manufacturer during sequential, non-random, non-mixed workload. This is completely unrealistic in pretty much all use cases, and certainly true in all ZFS work-loads.

That's in best case, now assume you're using them for SLOG and L2ARC now you're mixed-work load doing sequential, random, read, write, etc.... give them a few hours or a days depending on work load to put them in steady state and watch it drop even more. There's boat loads of information on this on this forum and others and I'm not going to repeat it again

but just know what you're wanting to squeeze out of those 4x SSD is simply not possible, at all.

Load that thing up with 256GB DDR3 if you want more Cache for read and go NVME for SLOG and 12Gb/s HBA for SSD if you want the absolute best performance for the buck right now! In fact just buy my 24x SLC SAS SSD JBOD for your 'speedy' storage, and let me finish upgrading to all SAS3 SSD, ha ha.

humbleThC · Jan 12, 2017

T_Minus said:
To reiterate:

- SLOG helps on sync writes it's not for caching anything, and it ONLY helps on sync writes. To test performance be sure to set sync=always
- ARC (adaptive replacement cache) is in RAM - This will accelerate READing. Add as much RAM as possible.
- L2ARC - If you can't add more RAM or your workload could benefit from TBs (and you have the RAM) L2ARC may be for you. Keep in mind it will utilize your RAM and that's why you should add more RAM first before L2ARC.

2GB/s from 4x SSD is an unrealistic target even in RAID0 on bare-metal during continuous real-life work-loads 99.99% of the time. You're talking about getting the absolute maximum performance out of those 4x SSD that they're rated for by manufacturer during sequential, non-random, non-mixed workload. This is completely unrealistic in pretty much all use cases, and certainly true in all ZFS work-loads.

That's in best case, now assume you're using them for SLOG and L2ARC now you're mixed-work load doing sequential, random, read, write, etc.... give them a few hours or a days depending on work load to put them in steady state and watch it drop even more. There's boat loads of information on this on this forum and others and I'm not going to repeat it again but just know what you're wanting to squeeze out of those 4x SSD is simply not possible, at all.

Load that thing up with 256GB DDR3 if you want more Cache for read and go NVME for SLOG and 12Gb/s HBA for SSD if you want the absolute best performance for the buck right now! In fact just buy my 24x SLC SAS SSD JBOD for your 'speedy' storage, and let me finish upgrading to all SAS3 SSD, ha ha.

HEHE.. I hear ya. And i really do appreciate all the amazing advice this forum/thread has provided.

I'm sure you are all right, and when my benchmarking is complete, I will be disappointed (only because of my own false assumptions about how ZFS works), but educated none-the-less.

whitey · Jan 12, 2017

You'll still have a solid zfs system, just may need a cpu upgrade/mem bump over the lifecycle of your dataset, especially at the scale you are talking.

gea · Jan 12, 2017

ZFS is a next gen filesystem or ZFS is the last word on filesystems.
You may have heard these sentences and they may be the reason to try ZFS. They are really true if you look on data-security or how it can handle huge Petabyte arrays. But they are completely wrong regarding performance. ZFS is NOT a high performance filesystem. Fastest is ZFS on Oracle Solaris. But compare an old Fat filesystem where you can write a video to an empty disk and the data is just filling the disk track by track. This is fast when you re-read the data. A CopyOnWrite filesystem spreads data over the disk and you fall to a more iops limited workload. Read How slow is ZFS with low RAM or readcache disabled and slow disks? and you see the consequences. 3MB/s read without cache on an old WD green, up to 300 MB on a very fast SSD.

All the funky and really sophisticated cache and log technologies in ZFS were there to give performance despite as you want and need the data security but that has its price. A simple lowram ZFS NAS is slower than one with Fat32 or ext4 but it can be faster with all the options that ZFS provide.

Rand__ · Jan 12, 2017

Just to give you an impression (a sneak preview from my current tests, seq, read)
10 slow disks (Toshi 3TB) in 5x2 mirror - no slog, (nfs attached vm) : 5,5 MB/s
10 slow disks (Toshi 3TB) in 5x2 mirror - S3700/200GB, (nfs attached vm) : 231,45 MB/s

humbleThC · Jan 12, 2017

Finished the 1st round of benchmarks against each of the HDDs and SSDs in separate pools to get a baseline.

Just started running benchmarks (against the pool itself) in my "hypothetical best case scenario of a design" where:
(4) Intel S3710 DC 400GBs carved up in to (95% 376GB / 5% Partitions 16GB)

Using the 1st partition of each SSD as L2ARC (376*4 = 1.5TB of stripped read cache)
Using the 2nd partition of each SSD as ZIL (16*4 = 32GB usable via raid1/0)

Too early to tell for sure, but looks like the SSDs are actually slowing down the HDDs in this use case. (You were right Gea, not that I ever doubted you or anyone, I just had to see it for myself)...

For any workload that the HDDs already prefered (most), the SSDs had little to no effect, or even some negative effect.
For the small random sync writes, it helped. But only about 1/2 or 1/4th of the native speed of running the SSDs in their own pool directly.

I'm still compiling my initial results, and havent even tested remote benchmarking yet, so alot more to come soon™

Part of what I need to do is learn how to benchmark ZFS properly, so i'm getting valid results, and understand the results is nice. Are the built in napp-it benchmarks considered valid , and does it come with all the internal benchmark tools required?

FileBench – (fileserver.f) – 60sec
2980004 ops, 49662.460 ops/s, (4515/9030 r/w), 1202.7mb/s, 359us cpu/op, 3.4ms latency

Compare to the Hitachi HDD alone results:

FileBench – (fileserver.f) – 60sec
3377475 ops, 56286.788 ops/s, (5117/10234 r/w), 1364.2mb/s, 415us cpu/op, 3.0ms latency

T_Minus · Jan 12, 2017

You should be doing the test how you plan to utilize the system.

IE: If you're going to be using it for hosting a handful of VMs then simulate that usage for a few hours. Or if you're going to use it from 1 system only over network simulate that, etc...

mpogr · Jan 12, 2017

Just to stress what @T_Minus mentioned, here's an example of the real world use. I have 4 VMware Datastores:
1. Main SSD (2 x Intel S3610 400GB in RAID 1) for performance-demanding VMs.
2. Main HDD (6 x WD RE 4TB in RAID-10 + a hot spare, 16GB SLOG and 240GB L2ARC) for all other VMs.
These are on a bare metal Xeon 1270 v2 box with 32GB of ECC RAM running CentOS + ZoL. Both Datastores are exposed to ESXis via SRP over FDR Infiniband (primary path) and iSCSI over 2 x 1 Gbps Ethernet (secondary path).
3. First backup (6 x WD Red 3TB in RAID-10 + cold spare, 16GB SLOG and 240GB L2ARC) - Solaris 11.3 VM with 4 vCPUs and 8GB of RAM on an ESXi host 1.
4. Second backup (6x WD Red 3TB in RAIDZ2, 16GB SGLOG and 240GB L2ARC) - Solaris 11.3 VM with 4 vCPUs and 8GB of RAM on an ESXi host 2.
Both of these are exposed to ESXis via iSCSI over FDR IPoIB (primary path) and over 2 x 1 Gbps Ethernet (secondary path).

Last night, I decided to move (via Storage VMotion) a bunch of VM mirrored drives from Datastore #4 to Datastore #3 above. BTW, the reason for the original evacuation of these drives was memory failure of the #3 underlying physical host and the need to take it offline for several days (something you should always keep in mind can happen).

While I was running this VMotion, the typical data transfer rate was ~50MB/s. Is it good or bad? It looks quite slow, right?

In fact, at the same time, those VMs I was transferring were live and very responsive. I kept working with them during the VMotion process without noticing any slowdown. Also, both #3 and #4 have CIFS servers which have been used for live video streaming and Torrent activity while the VMotion has been carried out (I have transmission-daemon running directly on the Solaris OS serving #4). Everyone was happy and the VMotion was finished in due course without causing any major trouble to everyone else using the same pretty complex environment.

I think this is what you should be focusing on rather than trying to squeeze the last bit of performance running artificial benchmarks...

humbleThC · Jan 13, 2017

True, i'm just a ZFS noob, so i'm starting with the one thing I can control, synthetic benchmarks. Plus the baseline how of they differ in configuration from each other, will give me some insight to the overhead/functionality of ZFS itself. Also I'll have an impossible time understanding impact of various protocols with and with out RDMA/SRP, if I dont have a baseline for the the pools/file systems can do locally. I very much plan to put ZFS through the full paces before I go production, or move on to another storage focused OS.

My 'final real-life use case' benchmarks will include:

1) CIFS directly from my Windows 10 primary workstation. This will give me best of breed SMB3/RDMA support on the client side.
I'll map a drive, and use CrystalMark, IOmeter, etc

2) ESX datastore mapping to NFS/iSCSI
Windows VM local disk benchmarks

On my Win10 desktop I have 48GB RAM, using SoftPerfect RAM Disk, I carved up 20GB on DDR3 1600MHz 9-9-9-10.
This should saturate the PCI BUS with 5-6GBs sustained bandwidth. (Local tests against RAM drive confirm 5800 Write 6800 Read MBs.

Initial Windows file explorer copy tests look like:

550 MB/s Write
750 MB/s Read
Against large block sequential workload over CIFS. i.e. copying 4x4GB movies (16GB transfer size)

Here's the default out-of-the-box results for CIFS against the Hitachi Pool w/ Intel S3710 SSDs as L2ARC/ZIL.

None of the #s are 4-digits, so i'm not terribly excited yet

These are just my very rough draft, barely started testing results, so dont shoot the messenger, i plan to get more serious on scientific approach here this morning

humbleThC · Jan 13, 2017

Does anyone know if Solarish compatible drivers exist for:
Amazon.com: SEDNA - PCI Express 4 Port ( 4E ) USB 3.0 Adapter - With Low Profile Bracket - (NEC / Renesas uPD720201 chipset): Computers & Accessories

I grabbed one for the SuperMicro NAS, to add USB3.0 support to the server, because I wanted to move my Seagate 8TB USB drives from my main desktop over to the NAS directly. Doesnt look like OmniOS comes with drivers for that particular USB controller, and that OK. Just curious if it's possible to add them.

(It was cheap, and I can use it elsewhere, so really no biggy)

humbleThC · Jan 13, 2017

Looks like if I want SMB3 with RDMA and Multi-Channel support for CIFS, I'm stuck using a Windows 2012/2016 based Server.
Improve Performance of a File Server with SMB Direct

Which is why my CIFS connection right now between Windows 10 and OmniOS will never be as fast as it was on Windows Server

Or have any chance of saturating the IB pipe.

It's not a deal breaker, just an observation. 600MB/s isn't terrible for single socket transfers, and I do care more about how well multiple streams perform than a single high bandwidth stream from my desktop. But I was getting 1.2GBs via CIFS in my previous Win2016 Storage Spaces implementation with (2) Samsung Evo 850 SSDs as write cache to 5x of the Hitachi's when extracting a 2GB+ rar across the network for example.

Knowing that Windows does CIFS better than most, NFS worst than everyone, and iSCSI OK.
The real question is going to be does OmniOS/ZFS support NFS/iSCSI so much better, that its worth taking the CIFS hit.

T_Minus · Jan 13, 2017

You should define what you're going after when you say "so much better".

ZFS does a lot of things "better" but you have a performance penalty... if data integrity is important ZFS is most def. "better".

gea · Jan 13, 2017

For SMB and >=10G you need some tunings on Solarish and Windows as the base settings are optimized for 1G, see
http://napp-it.org/doc/downloads/performance_smb2.pdf

Currently Solarish supports SMB 2.1 and the Solaris kernelbased SMB server with Oracle Solaris a little faster than the free Open-ZFS forks around Illumos. Nexenta adds SMB 3 to its commercial storage server. It may need some time until they upstream this to Illumos.

If you really need SMB 3 features, Windows may be faster and more stable than SAMBA with SMB3 or faster than the Solaris SMB 2.1 server regarding filesharing but I doubt regarding storage and ReFS.

regarding USB3
This is included in Oracle Solaris and nearly ready in Illumos/OmniOS

I always prefer NFS over iSCSI for ESXi. Similar performance with according sync/writeback setting and much easier to handle for copy/move/clone/backup as you can use SMB and Windows preivious version for VM management.

humbleThC · Jan 13, 2017

gea said:
For SMB and >=10G you need some tunings on Solarish and Windows as the base settings are optimized for 1G, see
http://napp-it.org/doc/downloads/performance_smb2.pdf

Currently Solarish supports SMB 2.1 and the Solaris kernelbased SMB server with Oracle Solaris a little faster than the free Open-ZFS forks around Illumos. Nexenta adds SMB 3 to its commercial storage server. It may need some time until they upstream this to Illumos.

If you really need SMB 3 features, Windows may be faster and more stable than SAMBA with SMB3 or faster than the Solaris SMB 2.1 server regarding filesharing but I doubt regarding storage and ReFS.

regarding USB3
This is included in Oracle Solaris and nearly ready in Illumos/OmniOS

I always prefer NFS over iSCSI for ESXi. Similar performance with according sync/writeback setting and much easier to handle for copy/move/clone/backup as you can use SMB and Windows preivious version for VM management.

Exactly what I needed for SMB2.1 tuning thanks!

Its exciting to hear that SMB3 might be coming to OmniOS in the future. It's not a deal breaker for me, that the CIFS portion (which would only generally be used for async large sequential) would be somewhat limited by SMB2.1. (i.e. single threaded workload of me extracting / moving incoming FTP/Torrent data around from my desktop via the NAS, photo archive, movies/TV, ISOs, etc)

I'll 'just wait longer' on the USB3 as well, keep the PCI card in here, and just not put any devices on it, until its driver qualified.

I always prefer NFS over iSCSI for ESXi as well. If your "single volume / single share" presentation is fast enough for everyone / all the time, then there is benefit in to sharing free space. Additionally from a storage efficiency perspective, it would allow compression/dedupe to leverage all block space. (Kinda why I started on Windows 2012/2016 and tried CIFS/NFS from the same 'jumbo pool / jumbo file system'. When NFS puked, I reverted to iSCSI and it was OK (only because I was able to get stable SRP drivers from Win>ESX).

What was interesting over iSCSI, is I was seeing about 500MBs max read/write per VM per Datastore. So I just carved out like (8) 1TB Thin iSCSI Datastores, and used Storage DRS to balance the VMs based on IO requirements. This effectively gave me (8) separate IO streams to work with, and brought the overall bandwidth up from 500MB to 1-2GBs.

Same thing can apply for SMB2.1 -vs- SMB3 with RDMA.
SMB3 with RDMA is nice, because it multi-channels IO , even on a single IO stream, thus usually maximizing your network pipe, for all workloads.
SMB2.1 doesnt have that, so instead of using Windows Copy, use RoboCopy with as many threads as you want sort of band-aid, to get the overall bandwidth you have available. (assuming you are copying more than one file)

The question comes back to with ZFS and NFS, if you do a single mount point/datastore, you may only get a single disk queue buffer worth of bandwidth out of the network pipe. (without substantial NFS/TCP buffer tweaking)

However I suppose you could create several folders on the ZFS, mount them as separate datastores in ESX and still use Storage DRS. (And my 30TB free space will be seen 8x times, so it'll appear my ESX Lab has 240TB)

DOH, looks like I need to purchase the PRO license if I want to auto-tune, as the tuning guide points to a module that looks no longer free, and is now called advanced. I can get the values out of the screenshot and manually apply them (i think).

ZFS Advice for new Setup

Member

Build. Break. Fix. Repeat

Well-Known Member

Member

Member

Moderator

Build. Break. Fix. Repeat

Member

Moderator

Well-Known Member

Well-Known Member

Member

Build. Break. Fix. Repeat

Active Member

Member

Member

Member

Build. Break. Fix. Repeat

Well-Known Member

Member