Intel Optane (32G/800P/900P) for ZFS pools and Slog on Solaris and OmniOS

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
I have got a pair of Intel Optane 800P and updated my pdf with benchmarks
at http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

Some important results especially for the critical sync write vs async write value
that is the most important value for databases, VM storage or even a filer when you want
to ensure a write behaviour where a crash during writes does not result in a dataloss
of commited writes.

Benchmark with a Filebench singlestreamwrite

Single Optane 32G (basic pool) on OmniOS
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                         213.4 MB/s    403.4 MB/s
not so bad but far below the bigger models 800P or 900P especially on async values

Single Optane 800P-118 (basic pool) on OmniOS

Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    202.8 MB/s   689.8 MB/s

Single 900P-280 (basic pool) on OmniOS

Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    674.4 MB/s   1944.5 MB/s
A single 900P is 3x as fast than the 800P

Dual 800P-118 (Raid-0) on OmniOS

Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    304.6 MB/s   1076.3 MB/s
A single 900P is much faster than a Raid-0 of two 800P


Dual 800P-118 (Raid-0) on Solaris 11.4
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    459.8 MB/s   1376.1 MB/s
Solaris is faster than OmniOS/ OpenZFS

Dual 900P-280 (Raid-0) on OmniOS
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    824.2 MB/s   1708.4 MB/s
Dual 900P-280 (Raid-0) on Solaris 11.4
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    938.2 MB/s   2882.2 MB/s
Solaris is faster than OmniOS/ OpenZFS

16 x SSD Sandisk Extreme Pro 960 in Raid-0 without Slog
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    69.2 MB/s   1759.7 MB/s
69 MB/s sync write performance, good for an 1G network but far below the
async value of 1759 MB/s

16 x SSD Sandisk Extreme Pro 960 in Raid-0 with an Optane 800P Slog
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    346.4 MB/s   2123.5 MB/s
There is a hefty improvement with an Optane Slog and even a massive SSD Raid-0 pool
as the SSDs (even the Sandisk extreme) are much slower than a single Optane 800P

16 x SSD Sandisk Extreme Pro 960 in Raid-0 with an Optane 900P Slog
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                    348.8 MB/s   2338.1 MB/s
The difference from 800P to 900P is minimal when used as an Slog for an SSD pool.


Overall:
If you need high sync write values
1.) use an Optane 800p 58/118 GB (care about the 365 TBW endurance)
2.) use a better Optane 900P/905P
or the enterprise 4800X (not faster but with guaranteed powerloss protection)
3.) there is no option three

Intel Optane is a game changer technology
 
Last edited:

Aluminum

Active Member
Sep 7, 2012
431
45
28
Thank you for providing a link I can shove in people's faces when I say even just ~$50 for a little m.2 card is like nitro for ZFS.
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
You can mirror Slogs. This keeps full performance even when one dies and avoids a dataloss in case of a crash during a write with a damaged Slog at this point. An Slog failure at any other time is uncritical as ZFS then reverts to the slower onpool ZIL for logging. Think of a mirrored Slog like a hardwareraid with cache and two battery units.

If you simply add more than one Slog to a pool, you will do a load balancing between them so each must only do a part of the load with the result of a better performance.

16 x SSD Sandisk Extreme Pro 960 in Raid-0 with two Optane 800P Slog (load balancing)
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                         476.0 MB/s   2380.3 MB/ss
compared to one 800P as slog
346 MB/s -> 476 MB/s sequential sync write, around 20% better


16 x SSD Sandisk Extreme Pro 960 in Raid-0 with two Optane 800P Slog + 900P Slog
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                         544.0 MB/s   2354,5 MB/ss
the third Optane Slog gives another but lower improvement (although the 900P is faster)
476 MB/s -> 544 MB/s, another 15%

Especially with the quite cheap M.2 Optane 800P-58 it seems a good idea to use more than one for Slog load balancing to improve performance. This also overcomes their limited write endurance of only 365 TBW as each must only take half of the write load.

Cheap PCI-e boards for two or four M.2 seems a good addon if the mainboard supports bifurcation
ex Super Micro Computer, Inc. - Products | Accessories | Add-on Cards | AOC-SLG3-2M2

Overall
best lowcost Slog for lab use: such a card with 2 x 800P-58 as Slog optionally together with onboard M.2 as this gives extreme good sync write values and the low write endurance of the cheaper Optane is no longer a problem. This is also an option if you want to build a cheap but ultra high performance Raid-Z1 pool from some of them with a good write endurance.

Anyone seen a cheap 4 x M.2 PCI-E 2x or 4x adapter,?
 
Last edited:

SlickNetAaron

Member
Apr 30, 2016
50
13
8
40
This is really interesting!

Thoughts on why using multiple slog devices is only a marginal improvement over single slog?


Sent from my iPhone using Tapatalk
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
read the nice blog from nex7

Sync write: "it is single queue depth random sync write with cache flush"
Nex7's Blog: ZFS Intent Log

Two Slogs is like a small corridor with two doors at the end. The two doors increase the number of persons that are able to cross the corridor a little but will not double.

With the 800P you have such an improvement in throughput (20% is not too bad) and you double write endurance of them as every 800P must write only half of data. The second aspect is more important and the performance increase is a nice add-on
 

Rand__

Well-Known Member
Mar 6, 2014
4,428
863
113
Would this work/have a positive effect with multiple slices of the same 900p also? i.e. pass through 2-3 virtual disks of a 280GB Optane ?
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
I have added a benchmark series with multiple Slog partitions from same Optane 900P
Result: There is no performance improvement (like the 20% from a second 800P) and
as there is no improvement regarding endurance this is not recommended.

This is a barebone setup

11 xHGST Ultrastar 7K4000 2TB in Raid-0 without Slog
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                         36,0 MB/s   1211,5 MB/s

11 xHGST Ultrastar 7K4000 2TB in Raid-0 with one Slog partition from a 900P
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                         539,6 MB/s  1325,9 MB/s
Improvement of an Optane Slog is dramatic (factor 15)


11 xHGST Ultrastar 7K4000 2TB in Raid-0 with two Slog partition from a 900P
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                         547,4 MB/s   1380,7 MB/s
not worth the effort


11 xHGST Ultrastar 7K4000 2TB in Raid-0 with three Slog partition from a 900P
Code:
Fb4 singlestreamwrite.f sync=always sync=disabled
                         495,4 MB/s   1319,5 MB/s
not worth the effort, even slightly slower
 
  • Like
Reactions: james23

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
6,974
1,561
113
CA
16 x SSD Sandisk Extreme Pro 960 in Raid-0 with an Optane 900P Slog = 348MB/s

Yet

11 xHGST Ultrastar 7K4000 2TB in Raid-0 with one Slog partition from a 900P = 540MB/s


Your test shows that 16 SSD in RAID0 is actually significantly SLOWER than 11 2TB Spinning HDD.

How is that possible?

Why are you not testing with 16x S3500 or S3610 or S3710 or other enterprise SSD not consumer "Extreme" drives? These results seem incorrect.
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
Simply because I do not have them laying around. I try to use the same equipment for a test sequence but mostly I can use them only for a short time.

The question for the last test was whether multiple Slog partitions on an Optane are helpful. For this question a slower pool regarding iops seems more relevant than an ultra fast one as I do not need absolute values but a trend.

And yes, all my tests have shown that (at least for sequential sync writes) a disk based pool with an Optane as Slog can give very high sync write values in this case nearly 40% of the async value (where disk based pools are very good). This is because the disk pool sees only large async writes where all the critical small sync writes are going to the Optane. This can be even faster than on SSD based pools where you have Flash related limits on writes.

btw
The Sandisk Extreme was beside the Samsung Pro one of the best desktop SSD two years ago. If you compare this with the benchmarks in https://forums.servethehome.com/index.php?threads/ssd-performance-issues-again.19625/page-2 with 7 mirrors of DC S3700 with 477 MB/s sync write with an Optane Slog, the Sandisk value is as expected.

For a test that checks pure iops without cache effects, the results will be different. But sequential sync write values are not too bad to quantify a system given the superiour read/write RAM caches in ZFS.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
6,974
1,561
113
CA
You can get a single (1) high quality consumer SSD install it in your desktop and transfer sequentially faster than you can with 16 in raid0 SSD on ZFS but not only that the HDD spinning pool with less drives is faster too with the same SLOG device.

Something is not right or the test is severely flawed or those SSD in fact can't do what they claim sequentially.
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
This is what at least the Filebench singlestreamwrite sync workload shows to me. Sequential sync write values scale very bad with number of pool disks. It is the Slog that determines the overall value, less the pool itself as there is no small random write to the pool.

A test mainly for random read without ramcache may give completely different results as then the iops is relevant where a SSD is far better than a disk. But the question here is Slog and the performance degration compared to async writes.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
6,974
1,561
113
CA
This is what at least the Filebench singlestreamwrite sync workload shows to me. Sequential sync write values scale very bad with number of pool disks. It is the Slog that determines the overall value, less the pool itself as there is no small random write to the pool.

A test mainly for random read without ramcache may give completely different results as then the iops is relevant where a SSD is far better than a disk. But the question here is Slog.
Exactly, and that still seems like incorrect results.

Both the HDD and SSD pool had Optane 900P SLOG device for your test, and yet the spinning HDD performed better.

To me that's a huge issue with either ZFS, OS or the benchmark itself.
 

gea

Well-Known Member
Dec 31, 2010
2,472
834
113
DE
Yes, at least with this special Filebench workload (and any other workload gives different results) that I have commonly used on recent tests.

Its not the absolute value ex if 300 MB/s or 400 MB/s sync with SSDs or disks compared to the async 1-2 GB/s, its the trend that syncronous sync write is mostly related to Slog quality not pool quality with very good write results with disk based pools.

If you have an Optane around you can try any other sync workload on a pool ex sync dd vs async dd and with or without Optane slog to see if the trend is different (dd write values should be slighty higher or different to Filebench singlestreamwrite but the trend should be the same).
 

Aluminum

Active Member
Sep 7, 2012
431
45
28
Anyone seen a cheap 4 x M.2 PCI-E 2x or 4x adapter?
Asus and Asrock make retail versions, easy to find in online here just under $100.

Single slot full-height quad m.2 with sizes up to 22110 and a heat "shroud", fan, lights, etc. Motherboard & bios absolutely must support x4x4x4x4 bifurcation on the x16 slots though. You could use u.2 drives with adapters if you really wanted, probably cheaper than trying to source an OEM server u.2 cage and proprietary expansion card.

I have two for playing with on my threadrippers, knowing my buying habits I will have plenty of spare nvme ssds to use eventually.