Server 2016 TP4 – Storage Spaces Direct Performance

Chuntzu · Dec 15, 2015

This will be a work in progress with the intent to implement this with the recommended 4-nodes (here is a link to current recommendations for Storage Spaces Direct hardware requirements link). I have the intent to give Storage Spaces Direct a trial and then setup ScaleIO on the same hardware and compare the results. I may not be able to post my ScaleIO results due to some limitations with EMC's ULA but I will reach out to them when I get to that point. I am starting with two nodes simply because of cost, I did not want to purchase 4 full systems just to find out that performance was not where I wanted. If scaling of performance is acceptable I will purchase the 1 extra node for ScaleIO (minimum of 3 nodes) or the 2 extra nodes for Storage Spaces Direct (S2D) (minimum of 4 nodes)

**Hardware used to test per node**
- 2x e5-2670
- 128GB ddr3 ecc registered 1600mhz
- 4 x 400gb intel 750 nvme drives
- 2x mellanox ConnectX-3 40/56gb VPI dual port cards
- SuperMicro x9drh-if

***Spoiler Large IO is easy for S2D***
(1024kb 32thread 1 outstanding io at 29.6 GB/s)

Day #1 (Trial using 2-nodes)

**1st test single node singe nvme drive, to see the speed of a single drive. For the sake of all of these tests small random IO will be the only thing I will list since sequential large IO is very easy for S2D to do but small IO scaling will the issue. I use diskspd 2.1.5 for windows for testing and can really push the IO. This is wi
./diskspd.exe -c100G -d10 -r -w0 –t32 –o32 -b4K -h -L D:\testfile.dat

417,670 random 4k read iops

./diskspd.exe -c100G -d10 -r -w100 –t32 –o32 -b4K -h -L D:\testfile.dat

250,728 random 4k write iops

CPU during both reads and writes, to show that diskspd and full utilize all the cores of dual cpus.

**2nd test single node 4x nvme drive 4 column simple space w/ default interleave

./diskspd.exe -c100G -d10 -r -w0 –t32 –o32 -b4K -h -L D:\testfile.dat

326,857 random 4k read iops - 4x NVME 4 column simple space

First issue to solve, why did this happen? To dig deeper I wanted to play with thread count since I have experienced this issue once before with socket 1366 nodes before and numa placement was an issue with those nodes. Observe what happens around thread count of 14* using this script which runs the benchmark through with increasing thread counts.

Code:

1..32 | % { 
   $param = "-t $_"
   $result = C:\Diskspd-v2.0.15\amd64fre\diskspd.exe -c100G -d10 -w100 -r -b4k $param -o32 -h -L D:\testfile.dat
   foreach ($line in $result) {if ($line -like "total:*") { $total=$line; break } }
   foreach ($line in $result) {if ($line -like "avg.*") { $avg=$line; break } }
   $mbps = $total.Split("|")[2].Trim()
   $iops = $total.Split("|")[3].Trim()
   $latency = $total.Split("|")[4].Trim()
   $cpu = $avg.Split("|")[1].Trim()   
   "Param $param, $iops iops, $mbps MB/sec, $latency ms, $cpu CPU"
}

1 node 4 drive read io 4k increasing thread count 32 outstanding io.PNG

712,465 random 4k read iops with 4x NVME 4 column simple space, this is better than 326,857 but only ~42% of what 4 drives should be.

So lets go back and check what 4 seperate NVME drives can do.
./diskspd.exe -c100G -d10 -r -w0 -t8 -o8 -b4K -h -L d:\test.dat e:\test.dat f:\test.dat g:\test.dat

1 node 4 drive independent read io 4k.PNG

Even worse, lets try this test using just one cpu.

./diskspd.exe -c100G -d10 -r -w0 -t4 -o8 -b4K -h -L -n d:\test.dat e:\test.dat f:\test.dat g:\test.dat (-n is used to disable default affinity to cpu, and adjusted threads to 4 from 8 so that it would only hit 1 cpu and ran twice since it will hit cpu 0 1st next run ran on cpu 1)

1 node 4 drive independent read io 4k only using 1 cpu..PNG

983,278 random 4k read iops over 4 independant NVME drives. This is getting better, but will have to do some reading to figure out why this is happening. I am not having any problems with my dual e5-2680v2 setup and scaling. I will post images when I move drives back into it to reverify this problem.

./diskspd.exe -c100G -d10 -r -w100 -t8 -o8 -b4K -h -L d:\test.dat e:\test.dat f:\test.dat g:\test.dat

1 node 4 drive independent write io 4k.PNG

Still need to figure out what is going on with these systems and hopefully get them close to the 1.5-2 million read iops that I am under the impression they can do.

**3rd test two node hyperconvered cluster mirrored 4 column nvme
**12/15/2015 since I spent a lot of time today trouble shooting I only have a few shots of these benchmarks, more will be added soon**

Ran this increasing thread count script for reads on cluster shared volume 1

Code:

1..32 | % { 
   $param = "-t $_"
   $result = C:\Diskspd-v2.0.15\amd64fre\diskspd.exe -c100G -d10 -w100 -r -b4k $param -o32 -h -L C:\ClusterStorage\Volume1\testfile.dat
   foreach ($line in $result) {if ($line -like "total:*") { $total=$line; break } }
   foreach ($line in $result) {if ($line -like "avg.*") { $avg=$line; break } }
   $mbps = $total.Split("|")[2].Trim()
   $iops = $total.Split("|")[3].Trim()
   $latency = $total.Split("|")[4].Trim()
   $cpu = $avg.Split("|")[1].Trim()   
   "Param $param, $iops iops, $mbps MB/sec, $latency ms, $cpu CPU"
}

2 node hyperconverged cluster mirrored 4 column nvme.PNG

312,513 random 4k read iops from the system not under control of the cluster shared volume. Since I am seeing single system limited speeds of 650,000-675,000 iops running a mirror set and geting about half of that isnt too bad, but we are now getting random reads speeds of less than a single nvme drive. I have no doubt that its going to take some tweaking to fully maximize the speeds of these systems but all things considered not a bad first attempt.

A note to consider the second node that was the control node for the cluster shared volume was at 100% utilization during this test, and as the above image show the node that was running diskspd was only seeing 46.40 % untilization. When I changed over to the node running diskspd as the controller of cluster shared volume it was then utilizing 100% cpu and the other node was only at 46%.

There will be more to come and I have no doubt I can pull more speed from this setup (perhaps to hit my goals different hardware, ie bios updates, v2 cpus, etc, not sure yet) but there is one saving grace and that is large IOPS they were pretty impressive

./diskspd.exe -c1000G -d10 -w0 -t32 -o -b1024K -h -L C:\ClusterStorage\Volume1\test.dat

2 node hyperconverged cluster mirrored 4 column nvme 1024kb reads.PNG

Yep, thats 29.618 gigabytes per second read speeds, which is faster than my 8 drives in the dual e5-2680v2 running the same command by about 6 gigabytes per second.

cesmith9999 · Dec 15, 2015

Can you put your NVMe cards in one slot per processor? I bet that some of what you are seeing here is that the threads that are spilling over to the 2nd processor are swamping the bus between processors.

a long time ago I deployed a SQL server on a DL585 G1. and the guys in the DC put all the RAM in CPU 0 and 1... needless to say... that we had a 25% CPU utilization all of the time at IDLE. and the issue was cross talk over the HyperTransport bus for memory access... a week later... we leveled the ram across all of our nodes. 2% at idle..

Chris

Chuntzu · Dec 15, 2015

As I was running the test I was thinking that this may have actually happened. I was pretty careful to read the motherboard manual prior to and while installing cards to make sure 1 mellnox card and one nvme plx card with the two nvme drives were allocated to each cpu, but I defiantly have a feeling I still might have screwed it up.

Naeblis · Dec 15, 2015

TLDR; well when i got to the 4 drives 29.6 GB/s i stopped. My brain could not compute

edit-- so i did go back and force myself to see and to believe. wow.. awesome numbers dude.. now I gotta go tinker

Chuntzu · Dec 16, 2015

Naeblis said:
TLDR; well when i got to the 4 drives 29.6 GB/s i stopped. My brain could not compute

edit-- so i did go back and force myself to see and to believe. wow.. awesome numbers dude.. now I gotta go tinker

Yeah I have verified pcie cards are in the right slots. I will swap the v2 cpus into the boards and see if v1 vs v2 maybe the issue, which I don't think is the issue, but have to give it a go. Next will be bios settings which is probably the issue.

Chuntzu · Dec 19, 2015

Well treatment tried with ntfs and hit 1.6 million read iops. Now I need to figure out why refs is causing a slow down. Probably integrity streams but I will find out.

Chuntzu · Jan 15, 2016

So....196tb just arrived on my door step today real bummer I have to work! But hopefully my 6.4tb of s3700 arrive tomorrow. 8x 6tb 4x 400gb s3700 and 2 x 400gb nvme drives in each node of the 4 nodes. Pictures to come.

ultradense · Apr 13, 2016

Chuntzu said:
So....196tb just arrived on my door step today real bummer I have to work! But hopefully my 6.4tb of s3700 arrive tomorrow. 8x 6tb 4x 400gb s3700 and 2 x 400gb nvme drives in each node of the 4 nodes. Pictures to come.

Any update on this? I'm very interested!
Since a few days I'm experimenting with a setup of:

4 nodes:
S2600CP2, dual E5-2760v1, 128MB
1*SATA SSD 960GB
1*Intel 750
40Gbit Mellanox Infiniband dual port (RDMA and dual subnet is working)

I've set it up like this:
Enable-ClusterS2D –S2DCacheDevice NVMe
New-StoragePool -StorageSubSystemFriendlyName <FQDN of the subsystem> -FriendlyName <StoragePoolName> -ProvisioningTypeDefault Fixed -PhysicalDisk (Get-PhysicalDisk | ? CanPool -eq $true)\
Get-StoragePool HVC1-pool1 | Get-PhysicalDisk |? BusType -eq NVMe | Set-PhysicalDisk -Usage Journal
$mt = New-StorageTier -StoragePoolFriendlyName HVC1-Pool1 -FriendlyName HVC1-Pool1-MirrorTier -MediaType SSD -ResiliencySettingName Mirror -NumberOfColumns 2 -PhysicalDiskRedundancy 1
$pt = New-StorageTier -StoragePoolFriendlyName HVC1-Pool1 -FriendlyName HVC1-Pool1-ParityTier -MediaType SSD -ResiliencySettingName Parity -NumberOfColumns 3 -PhysicalDiskRedundancy 1
New-Volume -StoragePoolFriendlyName HVC1-Pool1 -FriendlyName HVC1-VDisk1 -FileSystem CSVFS_ReFS -StorageTiers $mt,$pt -StorageTierSizes 200GB,2000GB
(also tried NTFS)

I'm testing with your script for variable thread count, with -w100.
When using parity, as stated above, I get at most 60K IOps. When using only mirror, I get about 200K IOps. I really expected this setup to run faster. Any tips or ideas where to start looking?

Search

Server 2016 TP4 – Storage Spaces Direct Performance

Chuntzu

Active Member

cesmith9999

Well-Known Member

Chuntzu

Active Member

Naeblis

Active Member

Chuntzu

Active Member

Chuntzu

Active Member

Chuntzu

Active Member

ultradense

Member