SSD performance - issues again

gea · May 7, 2018

There is no physical barrier. This is more an example of the 80:20 rule what means that you only need to spend 20% of all efforts to reach 80% of a maximum. If you want to reach the maximum you need to spend 80% of all efforts for the missing 20%

I see it like
Reason is the ZFS filesystem with its goal of ultimate datasecurity what means more data and data processing due checksums, CopyOnWrite that increases data that must be written as all writes are ZFS blockwise as there is no infile data update. This also adds more fragmentation. As in a ZFS Raid, data is spread quite evenly over the whole pool there is hardly a pure sequential datastream. Disk iops is a limiting factor then. Without these security related items a filesysystem can be faster.

And even with a quite best of all DC/P 3700 you are limited by traditional Flash what means you must erase a SSD block prior write a page with a ZFS datablock and the need of trim and garbage collection. With 3 GB/s and more you are also in regions where you must care of internal performance limits regarding RAM, CPU, PCIe bandwith so this require a fine tuning on a lot of places. This is propably the reason why a genuine Solaris is faster than Open-ZFS..

So unless you cannot do a technological jump like with Optane that is not limited by all the Flash restrictions as it can read/write any cell directly similar to RAM. Optane can give up to 500k iops, ultralow latency down to 10us without any degration after time or the need of trim or garbage collection to keep performance high. This is 3-5x better than a P 3700 and the reason why it can double the result of a P3700 pool.

In the end you must also accept a benchmark as a synthetic test to check performance in a way that limits effects of RAM and caching. This is what makes ZFS fast despite the higher security approaches. On real workloads more than 80% of all random reads are delivered from RAM what makes pool performance not relevant and all small random writes go to RAM (beside sync writes). Sync write performance in a benchmark is propably the only value that give a correct relation to real world performance. Other benchmarks are more a hint that performance is as expected or to decide if a tuning or modification is helpful or not.

Rand__ · May 7, 2018

I totally agree - sync write is the only thing that I really look at as this is the one that impacts VM write speeds... thats why its frustrating to see that both 3 and 7 way mirrors only reach 360 MB/s

So what are big vendors using for high speed sync writes? Or what did they use to use before Optane? What write speeds do they get?

i386 · May 7, 2018

There are special/custom devices like flashtec nvram that are used in high end storage appliances.
These devices are dram based (+ nand for backup in case of power loss) and can do 200k+ 4k random io @ qd1/1thread ._.

Rand__ · May 7, 2018

Hm sounds interesting albeit not in my price range most likely

Single Pair of P3700's

gea · May 7, 2018

Optane can be used as datadisk and as slog performace booster for slower pools.
Beside Optane there are other fast devices for sync logging, partly based on Dram or fast Flash.

Most vendors show you only sequential values, latency and iops, not the low sync write performance.
If you want you read about sync write on other high security filers with a CoW filesystem (similar to ZFS),
you may look or google for comments on netApp, a leading storage vendor ex

"netapp nfs sync write performance"

Only hint in the specs is latency + 4k/random write iops as there is a relation between them and sync write performance
(and latency does not scale with number of disks/vdevs unlike iops and sequential performance)

see Netapp all Flash arrays
Flash Array – Reduce SSD Latency with All-Flash EF-Series | NetApp
their latency is between 100 us and 800 us with up to 1m iops

If you compare:
The latency of a single DC S3700 is at 50us with 36k iops
A P3500 if you compare is at 35k iops
The latency of a single DC P3700 is at 20-30us with 175k iops

Optane 900/4800 latency is at 10us with 500k iops

Rand__ · May 7, 2018

Very interesting, thanks.

So basically there are three options currently for a high speed VM filer...

1. Stay with ZFS and go optane
2. Stay with ZFS but move to async
3. Leave zfs

gea · May 7, 2018

Leave ZFS + sync write/ Slog mostly means using a hardware raid + cache + bbu/flash protection.
I have never seen comparable sync write values with this so I doubt this will help.

Using filesystems with a lower security level without CoW and checksums is an option but who wants that.

Rand__ · May 7, 2018

Well 'want' is relative.
I *want* good performance with ZFS but that seems to be difficult

The question is how can i get that - throwing a ton of SSDs at it doesn't help as we see. I could sell all of them off and buy 2 960GB 905's (if I make enough money), double/triple write speed and quarter available capacity, but thats not really sounding too great...

What about putting a 900p (or pair) as slog in front of the SSD array? You said usually that wouldnt be needed on SSDs but Optane is way ahead on sync write, is it not?

Rand__ · May 7, 2018

So finally was able to run an optane bench (on a pair of new 900 480's)
Quite nice for a single mirror.

Wonder whether that scales up...

gea · May 7, 2018

I have made benchmarks on OmniOS and Solaris with 4 x 900P, see http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

About DC 3700 + Optane Slog
Your sync write performance (300-400 MB/s) is good enough for 10G sync write. If you want more an Optane Slog can make sense., even with a DC 3700 pool. The boost is not as big than with a slower pool but I would expect an improvement.

Rand__ · May 7, 2018

Yes, thats really a good reference document, thanks for that

And I was aiming at 40/100GB - not really expecting to fill those up but I have those cards and would like to utilize them if possible

_alex · May 7, 2018

Rand__ said:
Wonder whether that scales up...

why not break the mirror and stripe them to get a first idea?

Rand__ · May 10, 2018

Thats not so easy in ZFS i believe. Although I never tried

Here are some results with 7 Mirrors and 900p as slog...

7 drives

7 drives + Optane as slog

gea · May 10, 2018

At least around 30% faster

bzw
Raid-0 in ZFS is ultra easy.
In menu Pools: select two or more disks and vdev type=basic or
Create a pool from one basic disk as vdev and add more basic disk as vdev.

Rand__ · May 10, 2018

Ah, never thought to use basic vdev

thanks

And yes, with optane slog the ssd pool is usable. Still horrible performance for the amount of hardware involved but at least something

gea · May 10, 2018

Rand__ said:
Still horrible performance..

Indeed horrible,
around 1 million read iops and around 500 (FB) -1000 MB/s (dd) sync write performance.....

Rand__ · May 14, 2018

So I created a new VM (no slog yet), ran Benchmark on it;

created a NFS share and moved a win vm to it and...

Not what i was hoping for

gea · May 14, 2018

Can you add details about

Server
- vnic type (slow e1000 or fast vmxnet3)
- vmxnet3 settings/ buffers
- tcp buffers
- NFS buffers and servers

Windows client
- vnic type
- tcp settings especially interrupt throtteling (should be set to off)

background
defaults are optimised for a 1G network and to limit RAM/CPU use

Rand__ · May 14, 2018

Ok, reset the Nappit Box to be sure its ok, updated to the latest and greates (omni OS/Napp-It).

Using ESX6.5U1, vmxnet3, 2658v3 4 cores, 48GB Ram, no tuning on NappIt
VM is located on the NFS share and is writing to local disk, i.e NIC is not in play; Same CPU, 2 cores, 4GB Ram

I will read the tuning guide

Rand__ · May 14, 2018

So added slog again...

Not helping ...

still not checked the tuning guide though

SSD performance - issues again

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member