High performance AIO

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

ehfortin

Member
Nov 1, 2015
56
5
8
53
I suppose you have enabled sync write.
This means that with an onpool ZIL, every data must be written twice to the same pool, once as a fast sequential write over the rambased writecache and once as a sync write logging.

You can compare with sync=disabled. This should give similar values as well as better write values.
You are right. Sync write was enabled. Can you tell if disabling it is kind of equivalent to what a SOHO NAS would do? I understand it means loosing some data in case of a failure but it seems to me that it's about the same as pretty much all computers/servers that are not ZFS based, is that correct? If so, I have to remember that it is a lab so... to increase performance by a lot, it may worth it in that context.

I've done a lot of testing this afternoon always using the same iometer test (4KB, 100% random, 50% read/50% write). With no dedup, no compression, my RAIZ (3x 850 EVO) is sustaining 135 MB/sec (33000 IOPS). However, as soon as I activate dedup, it will drop, after some time, to a ridiculous 20 MB/sec which I don't understand as I gave the server 32 GB of RAM and my pool is only having 470 GB usable with a 12GB file for the test. My expectation would have been that dedup would reduce the performance a little bit but as the DDT is fully in memory and that the CPU is far from being maxed out (E3-1220v3), random write to the RAIDZ should occur at the same rate was able to sustain without dedup. Am I misunderstanding how it work?
 

Biren78

Active Member
Jan 16, 2013
550
94
28
@ehfortin from what I've been reading, older ZFS dedupe was not great, Oracle has improved it but you need a new Solaris version for that.
 

gea

Well-Known Member
Dec 31, 2010
3,161
1,195
113
DE
You are right. Sync write was enabled. Can you tell if disabling it is kind of equivalent to what a SOHO NAS would do? I understand it means loosing some data in case of a failure but it seems to me that it's about the same as pretty much all computers/servers that are not ZFS based, is that correct? If so, I have to remember that it is a lab so... to increase performance by a lot, it may worth it in that context.

I've done a lot of testing this afternoon always using the same iometer test (4KB, 100% random, 50% read/50% write). With no dedup, no compression, my RAIZ (3x 850 EVO) is sustaining 135 MB/sec (33000 IOPS). However, as soon as I activate dedup, it will drop, after some time, to a ridiculous 20 MB/sec which I don't understand as I gave the server 32 GB of RAM and my pool is only having 470 GB usable with a 12GB file for the test. My expectation would have been that dedup would reduce the performance a little bit but as the DDT is fully in memory and that the CPU is far from being maxed out (E3-1220v3), random write to the RAIDZ should occur at the same rate was able to sustain without dedup. Am I misunderstanding how it work?
Every OS, controller or disk is using a cache on writes to improve performance. On a power outage you can loose the last writes in a upper Megabyte to a lower Gigabyte area. Main problem with such a loss. It can happen that some data modification is on disk but not all (transactions) or the affected metadata is not updated resulting in a corrupted filesysten. ZFS itself is not affected due the CopyOnWrite behaviour that keeps the filesystem always intact. This is why you do not need sync for a regular filer but only for databases or older filesystems on ZFS ex via ESXi.

With hardware raid you can use a cache + BBU to fight against. With ZFS you achieve a powerloss safe behavior (a commited write is really on disk) with sync write. This is more safe than a hardware raid +BBU (no write hole problem) and with an Slog even faster. But you are right, this is a general storage not a ZFS problem but the ZFS solution is superiour.

Regarding realtime dedup
You need several GB RAM/TB data for the dedup tables. There are efforts to reduce this. This may result in a lower dedup rate as well. The other effect is a higher latency as you must process the dedup table on every read/write. You must count this against a performance improvement due less reads/writes.
 
Last edited:

ehfortin

Member
Nov 1, 2015
56
5
8
53
@ehfortin from what I've been reading, older ZFS dedupe was not great, Oracle has improved it but you need a new Solaris version for that.
Actually, these tests where on Solaris 11.3 on physical HP ML310e gen8 v2 with a E3-1220v3, 32 GB ram. Should have got all what it needed to perform better than that. I plan to do the same testing with OmniOS to see if it react the same way.

One thing for sure, I tested the same load on Windows 2012 R2 over NFS on the same hardware and it was crap. However, if I run the same IOmeter load directly on the Windows server, performance are as good as they should be.

I was not thinking it was so complicated to get performance from storage...
 

ehfortin

Member
Nov 1, 2015
56
5
8
53
I did recreate the same test on OmniOS in a VM (AIO) and I'm getting result that are a lot more usable. I'll have to push the machine but up to now, I have sustained 150 MB/sec pushing the same IOmeter load and AJA is generating 269/331 MB/sec sequential R/W, both load with dedup and compression activated. The pool is configured the same way (RAIDZ 3x 850 EVO). Right now, I'm running the iometer load while migrating a VM to the ZFS pool and both are operating as if nothing else is running. Very impressive when everything work as supposed.

I will continue to test with this setup and report my finding. I hope it is a keeper.
 

ehfortin

Member
Nov 1, 2015
56
5
8
53
As I don't have much time today to do performance testing, I decided to migrate about 20VM to the RAIDZ set with compression and dedup. The first 10VM had a dedup ratio near 5X and they were only using 19GB on disk which is the size of a single VM (actually, they are 50 GB but compressed, they are using 19GB on disk). Then, after starting the other 10VM (also nearly identical), the dedup ratio started to decrease. This was kind of curious as they are about the same. Will have to retest this as it doesn't make a lot of sense. It should have continue to increase the dedup ratio as I was adding new copies of about the same thing. Is there Something else to dedup on ZFS then pure identical block dedup? For exemple, is there a limit to the number of time a block can be referenced?

I have now about 50VM migrated, the dedup ratio is displayed as 2.66X while there are 185GB allocated. If I do a "du -ms" on the file system, I get 508GB which is aligned with what zfs list is calculating. It is a great technology if you have RAM, cpu power and SSD.