My Daily Naive Question: Stripe Width...

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

RobertFontaine

Active Member
Dec 17, 2015
663
148
43
57
Winterpeg, Canuckistan
Does the following make sense?
What constraints am I missing?

Working backwards from bandwidth :
QDR infiniband 40gbs
3400MB/s unidirectional ish.
Consumer grade ssd sequential r/w (samsung 850 evo)
500 MB/s

4k r/w
90k iops r/w

I never know whether a K is 1024 or 1000. In my world 4k is 4096 bytes and 90k iops is 90,000 = 3. 6864 × 10^8 / 10^6 = so 350 MB/s give or take

If a zfs mirrored striped pool were efficient and I was targeting transactions 4k or larger I would saturate a qdr connection with a striped pool about 7 or 8 ssds wide

So
If I am constrained to qdr infiniband (rdma nfs)
and I accept 4k as a normal transaction size
then creating a single pool of striped mirrored ssd vdev much wider than 8 would be wasteful.

Thanks
Robert
 
  • Like
Reactions: nickmade

Naeblis

Active Member
Oct 22, 2015
168
123
43
Folsom, CA
while i don't have a 1 to 1 comparison, today, i might be able share some light. With 1 FDR14 port these are the speeds i am getting. Mind you these are NVMe drives and it was my 1st ever test.


I should have some 30 SSD drive with RDMA over IB, benchmarks tomorrow.

i just reran w/ 1 connection, however there the other end is only a mirror of 2 drives


upload_2015-12-31_20-0-26.png

-----------------------------------------------------------------------
CrystalDiskMark 5.0.2 x64 (C) 2007-2015 hiyohiyo
Crystal Dew World : Crystal Dew World
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 4134.836 MB/s
Sequential Write (Q= 32,T= 1) : 1218.326 MB/s
Random Read 4KiB (Q= 32,T=18) : 1356.832 MB/s [331257.8 IOPS]
Random Write 4KiB (Q= 32,T=18) : 1064.630 MB/s [259919.4 IOPS]
Sequential Read (T= 1) : 1039.517 MB/s
Sequential Write (T= 1) : 1014.716 MB/s
Random Read 4KiB (Q= 1,T= 1) : 20.202 MB/s [ 4932.1 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 31.706 MB/s [ 7740.7 IOPS]

Test : 4096 MiB [Y: 5.8% (62.6/1075.9 GiB)] (x3) [Interval=5 sec]
Date : 2015/12/31 20:00:00
OS : Windows Server 2012 R2 Datacenter (Full installation) [6.3 Build 9600] (x64)
 
  • Like
Reactions: Chuntzu

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Does the following make sense?
What constraints am I missing?

Working backwards from bandwidth :
QDR infiniband 40gbs
3400MB/s unidirectional ish.
Consumer grade ssd sequential r/w (samsung 850 evo)
500 MB/s

4k r/w
90k iops r/w

I never know whether a K is 1024 or 1000. In my world 4k is 4096 bytes and 90k iops is 90,000 = 3. 6864 × 10^8 / 10^6 = so 350 MB/s give or take

If a zfs mirrored striped pool were efficient and I was targeting transactions 4k or larger I would saturate a qdr connection with a striped pool about 7 or 8 ssds wide

So
If I am constrained to qdr infiniband (rdma nfs)
and I accept 4k as a normal transaction size
then creating a single pool of striped mirrored ssd vdev much wider than 8 would be wasteful.

Thanks
Robert
In a perfect world, perhaps. But in the real world, drives that run at 90K IOPS on average will jitter and put out far lower IOPS for brief periods. With RAID, the frequency of the dips increases with the number of drives in the RAID.
So in the real world you are going to need more drives than in theory to saturate your network, and you are going to need to make sure that those drives can deliver VERY consistent IOPS, or else you'll find that your 4K throughput stops increasing or even drops as you add drives...
 
  • Like
Reactions: Naeblis

RobertFontaine

Active Member
Dec 17, 2015
663
148
43
57
Winterpeg, Canuckistan
Thanks that's interesting... The Numbers are starting to have more context.

I suppose the next steps might be to a) load a ram disk and measure how much of the latency is network vs. how much is storage. b) create a ram disk slog / l2arc and see if there are any opportunities to reduce latency with faster lookups. Zeus or Even 8 gb of 3d XPoint might tighten the lookups a bit later this year. c) for faster straight reads some kind of dual channel fdr teaming could buy up to 7500 ish till 100gb IB/ethernet is cheap (could be waiting a long time - no use cases outside a data center / between data centres). Beyond improving lookup time and tuning for the appropriate transaction time is there anything that can be done to reduce switching time other than building out a full blown distributed file system? I suspect a few nanoseconds switch time with an overlap per device means that a lot of the jitter is a function of basic physics and without parallel pipelines or smaller pools with more channels can't be beat. These kinds of numbers already assume that we have physically split reads from writes in two separate pools/channels /servers. Are the optimal stripe widths for zfs mirrored sata, sas, nvme ssds well understood and published somewhere yet or is it still a black art? Lustre probably makes a lot more sense for heavy metal but some of us have to race what we bring.

Budget for storage node this year will almost certainly be constrained to something like an sc216 16- 256gb sata mirror/striped, 2- something quick slog, 2 - something quick l2arc, and boot., and a dual qdr card. plus or minus a bit.

I won't have huge datasets this year ( 50% capacity - A terabyte or so should be OK for now).
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
You're still not going to get anywhere near the performance you think from those SSD. With 16 256gb consumer drives that steady-state performance <5k IOPs each and then add ZFS on top of that...
Reference: Experience with consumer drives & this The Samsung SSD 850 EVO mSATA/M.2 Review

Compare to the S3700 200GB that is much greater than 10,000 each.
The Intel SSD DC S3700 (200GB) Review

S3500 is another choice... I just got some in to compare, priced MUCH cheaper.

You would still need a really fast SLOG to get the most performance out of a S3700 pool of mirrors too.

Just something to think about.

If you don't need capacity and want performance then NVME makes the most sense.
 

RobertFontaine

Active Member
Dec 17, 2015
663
148
43
57
Winterpeg, Canuckistan
I go back and forth on hotrodding nvme local drives in raid 0 as fast scratch disk and going with a much slower and reliable storage server.

You may be right... nvme is about $1/gb, sata ssd is about 50 cents. So a terabyte of nvme is around $1000 Canadian and a cheap spinny disk zfs storage server is very cheap $500

vs

4tb *. 50 /gb = 2k in ssds plus 750 server say 3k all in. So twice the price and less performance.

Doesn't seem like a brilliant idea if I look at it like that.
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
You're still not going to get anywhere near the performance you think from those SSD. With 16 256gb consumer drives that steady-state performance <5k IOPs each and then add ZFS on top of that...
Reference: Experience with consumer drives & this The Samsung SSD 850 EVO mSATA/M.2 Review

Compare to the S3700 200GB that is much greater than 10,000 each.
The Intel SSD DC S3700 (200GB) Review

S3500 is another choice... I just got some in to compare, priced MUCH cheaper.

You would still need a really fast SLOG to get the most performance out of a S3700 pool of mirrors too.

Just something to think about.

If you don't need capacity and want performance then NVME makes the most sense.
I completely agree that NVMe is the way to go if your budget allows for it.

If you are going to use consumer-grade SSD drives, however, there is a very simple way to make most of them perform like Enterprise drives, with lower and more consistent latency under load, higher IOPS, and no GC "pauses": Overprovisioning. I have a 72-drive SATA RAID using Samsung 840 and 850 Pro drives. The drives are 128GB formatted to 100GB. With that small change, they perform splendidly. Without the overprovisioning, the system would have been a dud.

Refer to the graph here:
Samsung SSD 850 Pro (128GB, 256GB & 1TB) Review: Enter the 3D Era
Click 0n the Samsung 840 Pro 256GB button and look at the latency graph. It is dismal, with frequent dips below 1000 IOPS and even some dips down to 100 IOPS. 24 of these in RAID would be pathetic. Now click on the 206GB button - the same drive but with overprovisioning. IOPS goes WAY up, and the consistency looks fantastic. Newer drives like the 850 Pro behave better with default OP, but none the less it is still VERY highly recommended to bake in more OP when using consumer drives in a server... which is why Enterprise drives come in 200/400/800 GB and not 256/512/1024GB.

But, just to repeat, if I were to build the same system today, it would be NMVe.