High performance AIO

ehfortin

Member
Nov 1, 2015
56
5
8
50
I'm currently evaluating the possibility to create a NAS/SAN (iSCSI/NFS) to share between 3 lab servers (all HP ML310e Gen8 v2) with an upcoming 10 Gbps network (ConnectX-2 NIC connected to a shared DLINK DGS-1510-28X). As I would like to have some services "always-on", my goal would be to keep VMware on the 4th server and run the NAS/SAN in a VM beside the virtual FW and AD. This "storage unit" would only be for VMware stuff as I already have some slow speed storage for everything else.

So, my first idea was to use ZFS as it allow the use of flash to increase the speed of read and write (L2ARC and ZIL) or, to go all-flash with just about any offering that would do iSCSI and NFS. Before even getting there, I wanted to do some testing on a single Samsung 850 EVO in various context. I read numerous time that performance to the disks is not affected a lot by the hypervisor but wanted to confirm how my setup is handling this. I started by finding a test that I could easily reproduce with this disk. I found a test on benchmarkreviews.com that was done with IOMeter on various disks and the EVO 850 was there so I decided to compare what they got (Kingston HyperX Savage SSD Benchmark Performance Review) with what I'm seeing.

On their testing with 100% random, 50/50 read/write, 4KB, 32QD, they are seeing about 86K IOPS on their Asus motherboard and i7-2600 CPU with Windows 7.

I downloaded their exact iometer icf file and run it on my ML310e which has 18GB of RAM, an i3-4150 and a HP H220 HBA (LSI-2308-IT mode) on Windows 2016 TP3. I'm getting just a little bit short of 42K IOPS after 120 seconds (same test as they do). I've done numerous tests with this EVO 850 in the last few weeks so it may not be optimal but it is kind of slow at half the speed.

Second step was to try this in a VM. I booted the exact same Windows in VMware (RDM is nice for this) and tried to add the SSD as a VMDK and as a RDM. In both case, I'm getting the same performance of about 26500 IOPS for the same test. I can confirm that VMDK and RDM seems to provide the same level of performance but... I can't say the same when comparing the same test in a VM vs on native hardware. 42K vs 26.5K IOPS is not the same. Latency and CPU usage are also increasing in a VM (1.2 ms vs .7 ms and 52% CPU vs 21% on native hardware). BTW, I'm on ESXi 6U1 and the H220 is officially supported by VMware and HP.

Is there any tuning to do at ESXi level to maximize the performance we can get from a SSD and a good HBA? I'm planning on installing the H220 in one of the ML310e that has a Xeon e3-1230v3 to pass the HBA to the OS and see the impact but I didn't wanted to dedicate a small Xeon for that task. I saw some QNAP and Synology that are able to nearly fill a 10 Gbps while using an Intel atom or celeron so there certainly something to do with a i3.

Any idea or comment? Don't hesitate to ask if you need more details. Thank you.

ehfortin
 

gea

Well-Known Member
Dec 31, 2010
2,585
878
113
DE
Your tuning options are for example
- use a HBA in pass-through mode (real hardware access) instead of RDM or vmfs
- reduce latency for your storage VM, https://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf

I doubt that the 850 Evo is a solid base for a high performance solution compared with enterprise SSDs and I do not believe that high iops values. Even with the best enterprise SSDs like an Intel S3700 you hardly achieve more than 40000 iops under constant write load.
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
Your tuning options are for example
- use a HBA in pass-through mode (real hardware access) instead of RDM or vmfs
- reduce latency for your storage VM, https://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf

I doubt that the 850 Evo is a solid base for a high performance solution compared with enterprise SSDs and I do not believe that high iops values. Even with the best enterprise SSDs like an Intel S3700 you hardly achieve more than 40000 iops under constant write load.
I'll read the paper you suggested. The 850 EVO is not my first choice but I already have a few of those and I think it allows me to do some valid testing of the various setup possible as long as I keep doing the same test each time. It may not be the best one but it is somewhat providing some insight. I plan to do the same test from a remote VM over iSCSI target once the 10 Gbps will be in place to get an idea of the impact of this setup on the same storage.

Based on your experience, would I gain anything using all-flash (like 2 or 4 SSD) or it will provide about the same performance as let say 4x SAS 10K + a good ZIL and a good size l2arc? I understand that ZIL are only useful for synchronous write so if I do NFS to the ESX host, my understanding is that most of the write will be sync so the ZIL would do it's magic, right?

Thank you for your comments.

ehfortin
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
Very interesting. I left the server alone for a few hours and restarted the test. On the first iteration, it finished the two minutes test at 46K IOPS. I restarted it and it ended at 36K IOPS. So, as I can see, it confirms a review that I saw a few days ago that was showing that consumer SSD are all over the place (performance vary a lot while testing). The enterprise SSD they tested were pretty much stable for the duration of the test (I think it was testing for a few hours).

I would say that using this kind of disk for a home lab, it should give relatively good performance as it will handle IO burst easily. However, if somebody has any task (or combined tasks) that runs for more then a few minutes, it seems that it will get slower and slower up to a point. As mentionned in the previous post, if it stabilize at 10K IOPS, that's still a lot faster then a few SATA drive in any type of RAID.

I'll plan for moving the HBA to a server that is able to do vt-d to see if I get better result. Based on current result, after leaving the SSD disk alone for some time, I'm now getting the same result as on raw hardware so I guess I should get similar result. I'll let you know for those that could be interested.

ehfortin
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,090
1,616
113
CA
If you're building an "AIO" you're not going to have it doing nothing though, right? You're going to have numerous VMs always doing something, which means the drive will always have read and write tasks going, and never "idle" time to clean-up/garbage collect... obviously this is dependent upon your workloads of the VMs but I don't think loading up 1 VM and doing a benchmark is really the same as running an "all in one" with numerous VMs.

That 100% read or write work load even at lower IOPs would be a "best case" since the VMs would be reading and writing which could degrade performance so that it performs LESS than an array of spinning hard drives.

Why not load up IOMeter in a couple VMs that you plan to run and run some tests and in another VM benchmark the perf. you get? Just an idea, not exactly sure how to set this up to be as 'real world' as possible ;)
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
100% with you on the first paragraph. I was just saying that adding a disk like the 850 EVO would help for light usage. I don't know how people are using their home lab but I assume it may be enough for some.

I can easily do some testing with numerous VM. Either manually by starting IOMeter in each VM or by having all work Under the control of the same IOMeter host. Then, I can select various load on each VM and see how the central storage host will handle that. Will think about a scenario and will let you know how it goes with the same disk.

For the long run, I'm looking at new SSD like the Intel S3610 and other similar drive. Having a budget, I can't justify somethink like a S3710 which are faster and offer more endurance but at a price. I'm sure endurance is not an issue anyway between the two. The S3510 would probably be fine as well for endurance but it is just too slow for write, being optimized for read. I can compromise but up to a point. I still want a relatively high performance storage server that can keep up with VM running on a total of 4 hosts. That's not that much but it's still 4x time what one host would usually do.
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
The S3610 is the SSD that I also prefer with my current setups as a good compromise between highest quality and price,
Do you create all-flash pool or you always use those as ZIL and L2ARC under ZFS? I think it could be interesting to put 4x 300 GB SAS 10K in RAID10 with a Mirror of S3610 partitionned with 16GB or so for ZIL and the remaining as L2ARC. Any comment? I know it would be preferable to dedicate SSD for each task but... budget is not unlimited. I've not been able to find the 100GB S3610 so using 2x 200 GB for ZIL is kind of expensive.
 

gea

Well-Known Member
Dec 31, 2010
2,585
878
113
DE
I use them for Flash only Pools without a dedicated Slog as the S3610 has powerloss protection and is as fast or faster with the onpool ZIL than with a dedicated Slog ZIL on a single additional SSD S3610.

Using SAS 10k disks is absolutely no alternative.
A single 10k disk has about 100 iops while the S3610 has up to 40000 iops even under load. You would need a raid-0 array of 400 SAS disks to achieve the same iops.

Using your 200GB as a mirror or a raid-z(1-3) with some more SSDs seems more attractive for a good VM performance.
 
Last edited:

ehfortin

Member
Nov 1, 2015
56
5
8
50
Using SAS 10k disks is absolutely no alternative.
A single 10k disk has about 100 iops while the S3610 has up to 40000 iops even under load. You would need a raid-0 array of 400 SAS disks to achieve the same iops.

Using your 200GB as a mirror or a raid-z(1-3) with some more SSDs seems more attractive for a good VM performance.
I'm well aware that SAS don't compare to SSD. However, as ZFS and numerous commercial storage unit are combining capacity drive (SAS, NLSAS) and SSD to achieve performance at a lower cost, I was trying to figure if that was a plausible combination at smaller scale.

Do you use ZFS deduplication? I know it require more memory but... as I only plan to have VMs on this pool, if I go with a Mirror of, let say, 2x 400/480 GB S3610, dedup would only use a few GB of RAM and would probably allow to fit 3-4 times the number of VM on the same physical space. This sound interesting and could alleviate the need to mix SAS/NLSAS and SSD or to buy many SSD, both option being expensive in the end.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,090
1,616
113
CA
2x SSD is not a pool of mirrors it's basically similar to hardware RAID1

You should really start at 4 drives / 2 mirrored vdev in the pool :)

Also, I think the reason commercial setups get away with 24 SAS drives and SSD or NVME cache drives is because they also likely have very large amounts of RAM. This is just my thought though.
 

gea

Well-Known Member
Dec 31, 2010
2,585
878
113
DE
About dedup
With large pools I would avoid in any case because the RAM need is too high and you want the RAM as a readcache for performance. But if you build a pool from a single mirror of say 2 x 480 GB SSD with similar VMs on the datastore, this may be a good option as the extra RAM need of about 2GB is quite low.

A raid-10 gives 960 GB and doubles the iops rate of a mirror but as you are already at 40000 iops this seems not needed regarding iops but you doubles also sequential values. But a raid-Z1 of 4 disks increase sequential performance as well with a capacity of 1440 GB so this is attractive from a cost aspect. From a security aspect raid-Z2 is the next step.

Many options each with their own advantage.
Only option with nearly no advantages if you buy new is SAS 10k or 15k disks as they are expensive as well.
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
About dedup
With large pools I would avoid in any case because the RAM need is too high and you want the RAM as a readcache for performance. But if you build a pool from a single mirror of say 2 x 480 GB SSD with similar VMs on the datastore, this may be a good option as the extra RAM need of about 2GB is quite low.

Many options each with their own advantage.
Only option with nearly no advantages if you buy new is SAS 10k or 15k disks as they are expensive as well.
I already have 3 SAS 10K so I was trying to see if I could use it in the setup for some time combined with the 850 EVO as slog and l2arc. I'll test the performance of it on native hardware to see but as you said, if I'm to buy some disk, it's probably wiser to go with S3610 all the way and forget about the SAS.

I've done more testing with OmniOS in a VM and I'm getting the same kind of performance as with Windows 2012 R2 (in a VM as well). I just installed Solaris 11.3 on native hardware and will do some additional test from there and see how it goes. If performance is fine, I may install Virtualbox on it to have the "always on" services that I need. It won't be as flexible as having everything under VMware as I won't be able to move the VM that reside on the storage server around but I should be able to live with it.

I'll let you know once I have some result with Solaris 11.3.

Thank you.
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
I did numerous other test and there is nothing conclusive at this moment. Performance are very much all over the place but certainly not aligned with what @gea published with Napp-it. I'll align my setup with what was publish and see what kind of result I'll get compared to that document. I'm still awaiting my 10 Gbps switch and my SFP+ DAC cable so... testing on 1 Gbps is not the best thing when playing with SSD. Will restart this once I get everything in place.

Talking about dedup, I'm having an issue with both ZFS and Starwind. I create an iSCSI LUN on each and put identical VM on those. From the storage point of view, I'm using less then what is required for a single VM. However, from the vCenter point of view, I'm using the actual capacity of "x" VM and if I clone or migrate something there, it will stop at 0 bytes free even if the backend is far from full. On starwind, I can create a virtual disk that is bigger than the actual space so I can guesstimate what I should get with dedup and start from there. However, the bigger is the virtual disk, the bigger the reserved memory is. On ZFS, I have to create a volume to assign to iSCSI. From the ZFS point of view, a volume is a reserved space so... it actually reserve physical space. So I can't create a volume bigger then the pool which translate in vCenter not being able to utilize more space then what was reserved. Still, the backend is far from full again.

How can we give some space to vCenter and have it behave accordingly to what ZFS or starwind is telling, not what vCenter think it should be? I figure it could probably work with NFS but, up to now, my tests are showing that a NFS datastore is killing my SSD zpool while using the same pool with iSCSI is a lot "lighter" and give better performance. In a sense, it was expected as NFS is doing mostly sync write but... with SSD and as everybody suggest to put a quick SSD as slog with ZFS, I was expecting the SSD to "enhance" a lot NFS which is far from what I'm seeing on my limited SSD pool (still only using a single 850 EVO up to now)

Any way to make vCenter work and benefit from a volume that has deduplication instead of using it as if it was a standard volume?

Thanks.
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
To summarize my test up to now, either you create a volume (work with ZFS as well as Starwind) that is having a size that is larger then actual physical capacity and you hope for the best as dedup/compression kick in so that you won't lack space or, if you don't want to get into this, you use NFS which will present the remaining free space as it is, whatever the dedup/compression ratio is. So, as I already have too much other tasks to keep me busy, I'll use NFS from now on for my lab storage with inline dedup. Cool.
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
That sounds awesome :)
Certainly hope it will be once 10 Gbps is in place and enterprise SSD start replacing my EVO 850. 10 Gbps should be installed this weekend (awaiting the switch). I'll do some testing on this setup before ordering any enterprise SSD. Actually, I'm even wondering if I should get a PCIe NVMe card to start with (should give a very big boost for random stuff even if used as RDM as I expect cheaper one may not work in passthrough mode with Solaris 11.3) for the first 800 Go or so. Then, I could either use it as a L2ARC/silog for an SSD raidz pool when I need to add capacity.

Lot of options exist and it is hard to take a direction when you can't test the product in your environment. So it can be a wild (and expensive) guess.
 

ehfortin

Member
Nov 1, 2015
56
5
8
50
As published in another thread, I'm now running on OmniOS/Napp-it/ESXI. I dedicated my H220 (LSI 9205) to that VM and assigned a 10 Gbps vNIC to the VM for NFS. I'm doing some testing on a pool of 2+2 250 GB SSD (850 EVO) and I'm having strange result in some tests. I wanted to try the same iometer icf as I did use in my first post. Results are very bad and don't make sense. While it is running in a Windows 10 VM that sit on the same host as OmniOS, I'm getting about 1500 IOPS (6 MB/sec). However, while doing this and looking at "zpool iostat" on that pool, I'm seeing a total bandwidth of about 40 MB/sec for about 100 "operations/sec" as reported by iostat. Why a tiny 3-5 MB/sec as reported by IOMeter is generating about 10-12 times more bandwidth as reported by the ZFS pool?

I'm also doing sequential testing with AJA. From the same Windows 10 VM that reside on the same host as OmniOS, I'm getting 123 MB/sec in write and 782 MB/sec in read on a 4GB file. So, this tell me the 10 Gbps pipe is working but that the pool is not giving expected result while writing to it. The fact that the consumer SSD are not really able to keep a sustained load was one of my concern. My first impression, is that using multiple of these drives together is not helping much as you always work at the performance level of the worst disk in each mirror. The longer you keep the load, the worst the results are getting. As an example, the 16GB file test with AJA is giving a write result of 50 MB/sec while the read stays at 685 MB/sec.

With that kind of result, it confirm that I'm moving to the enterprise SSD. However, I'm still hesitating between a single larger drive like the Samsung SM863 960 GB or another layout. I understand a single disk is not safe but I have daily backup of all the VM that will be on that datastore. For about 20-50% more $$, I could get 4x 200/240 GB (same Samsung line or Intel S3610) but I would only have 400 GB (pool R10) or 600 GB (RAIDZ). The R10 would give better performance for random while the RAIZ would give exactly the same thing as the single SM863 but with half the capacity at a 50% premium price wise.

Logically, the single SM863 960GB is a better option but the next step will be more expensive except if creating a new datastore beside it that could be configured in a different way.

So, if any of you have an idea why the ZFS pool is showing a lot more bandwidth then what IOMeter is generating and how it can be "fixed", it would be great. Any comment on the next acquisition is also welcomed.
 
  • Like
Reactions: Biren78

gea

Well-Known Member
Dec 31, 2010
2,585
878
113
DE
I suppose you have enabled sync write.
This means that with an onpool ZIL, every data must be written twice to the same pool, once as a fast sequential write over the rambased writecache and once as a sync write logging.

You can compare with sync=disabled. This should give similar values as well as better write values.