Decision to make: SAN Replacement with Direct Attached Storage

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Benten93

Member
Nov 16, 2015
48
7
8
Hi guys,

right now i am looking in converting my SAN Setup (see below) to DAS Setup for my 2-Hosts VMware-Cluster.

The Specs of the SAN right now:

Chassis: Intertech 4U 20bay
CPU: Intel Xeon E5-1630 V3 (4x 3,7Ghz + HT)
Mainboard: Supermicro X10SRH-CLN4F
RAM: 2x 32GB Samsung DDR4-2133 (64GB)
Networking: 2x Intel X520-DA2
Software: FREENAS 11

Disks:
8x 5TB WD RED in Mirrored vDevs (aka RAID10)
6x 480GB Samsung PM863a in mirrored vDevs (aka RAID10)
1x SSD internal for OS

VMware Cluster:

2x HP DL380p G8 each with:

CPU: 2x E5-2690 v2
RAM: 256GB DRR3
Networking: 10Gbit 2-Port SFP+
no harddrives, enough room for HBAs

The setup is running fine right now, but i notice weak IO Performance of my vms on the iSCSI volumes, even when the SSDs are only 10-20% busy.
The Reason why im not going with the storage in the VMware hosts and using vSAN is the licensing costs.
I am running about 40 vms on the cluster, from AD, Exchange, minor My- and MSSQL Databases to some game and testing ones.
Due to network latency the maximum IOPS should be about 4000 (1sec/0.25ms) if requesting QD1, but i cant tell if thats the bottleneck to pull the vms down that much.
All in all im reaching out to you now and hopefully get some hints if its a good idea to replace my SAN with a DAS solution like an HP D2600 or similar.

If you need any further information... no problem!
 

Benten93

Member
Nov 16, 2015
48
7
8
Running 10g...?
Latency shouldnt be a problem on your san config....
Latency is 0.25ms (ping-Time), so Maximum IOPS with QD1 is like mentioned above isnt it?
If higher QD Its definitly better, But i don't Know why my vms get Bad responsiveness...
So my guess was The SAN "Overhead"

EDIT: To add, Each Host has one dedicated Link to The SAN direct and another connected via a 10G switch:

SAN_L1 -- Host1_L1
SAN_L2 -- Host2_L1
SAN_L3 -- SWITCH -- Host1_L2
SAN_L4 -- SWITCH -- Host2_L2

The interfaces in VMware Are configured As roundrobin with 1 iop Change frequency!
 

wildchild

Active Member
Feb 4, 2014
389
57
28
Pretty much running the same config as you are, difference is my hdd pool size ( 12 * 4 tb hgst sata disks) 2* intel 3700 as slog and samsung 840 pro 256 mb op'ed tot 70% .
Memory 128 gb and intel 10 g.

San Os is omnios
Vmware 6 2* hp dl360 g6 , 96 mb

Not seeing any of you latency's , so either you're having freenas iscsi issues, or vmware 10g driver errors
 

wildchild

Active Member
Feb 4, 2014
389
57
28
What Ping Time Do you have Host to san?
Why would you want to set i/o to 1 on 10g ?
That would be limiting your throughput.
Even on 1gb 10 would be more appropiate

Experiment a bit with that.

I have multipathing too, running through 2 seperated 10 ip subnets, on 2 l2 switch fabrics ( ubiquiti unifi 10g switches, running their alpha firmware)

Vmkping times vary between 0.201 - 0.280 under full load
 

Benten93

Member
Nov 16, 2015
48
7
8
Why would you want to set i/o to 1 on 10g ?
That would be limiting your throughput.
Even on 1gb 10 would be more appropiate

Experiment a bit with that.

I have multipathing too, running through 2 seperated 10 ip subnets, on 2 l2 switch fabrics ( ubiquiti unifi 10g switches, running their alpha firmware)

Vmkping times vary between 0.201 - 0.280 under full load
I tested that some Time ago, If i have The Test results nearby i will post them.
For best Performance Direct attached via SAS would Be better?
 

wildchild

Active Member
Feb 4, 2014
389
57
28
Maybe, depends on what is needed, if you have a disk behaving badly, as DAS wouldnt give you a better performance.
Point being, i think there's something off with you config, or maybe with the freebsd you're running..

Try an openindiana or oracle solaris live cd and test once more.

Occationally even performing a low level format on your ssd's may improve things a lot
 

Benten93

Member
Nov 16, 2015
48
7
8
Maybe, depends on what is needed, if you have a disk behaving badly, as DAS wouldnt give you a better performance.
Point being, i think there's something off with you config, or maybe with the freebsd you're running..

Try an openindiana or oracle solaris live cd and test once more.

Occationally even performing a low level format on your ssd's may improve things a lot
Thanks for your answer!
Today i tried an centos 7 on that SAN and i don't Know what should Be wring.. i got similar results..
I am Not sure what i should expect of my SAN Performance.
 

wildchild

Active Member
Feb 4, 2014
389
57
28
Now, as omnios/openindiana/solaris are the birthplace of solaris, i woukd suggest using them ( maybe combine with gea's excellent napp-it webgui) to baseline your stuff, start back checking from the what could be off, because them you would have a "know to work" well base os.
If zfs is still off, a good look at your pool(s) and their config would be needed.

Have you tried full erasing (check thomas krenn ag for a how to) yet ?
 

niekbergboer

Active Member
Jun 21, 2016
154
59
28
46
Switzerland
Latency is 0.25ms (ping-Time), so Maximum IOPS with QD1 is like mentioned above isnt it?
The latency of a once-per-second ping is not indicative of the latency you'll see on a busy system; that one ping has to wait for all power-saving modes (hosts, VMs) to exit.

Try pinging 100 times per second, and you'll see far lower times.

Case in point (pinging a Proxmox VE peer for another peer):
$ ping -c 10 -n 192.168.13.11
PING 192.168.13.11 (192.168.13.11) 56(84) bytes of data.
64 bytes from 192.168.13.11: icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from 192.168.13.11: icmp_seq=2 ttl=64 time=0.295 ms
64 bytes from 192.168.13.11: icmp_seq=3 ttl=64 time=0.251 ms
64 bytes from 192.168.13.11: icmp_seq=4 ttl=64 time=0.152 ms
64 bytes from 192.168.13.11: icmp_seq=5 ttl=64 time=0.360 ms
64 bytes from 192.168.13.11: icmp_seq=6 ttl=64 time=0.406 ms
64 bytes from 192.168.13.11: icmp_seq=7 ttl=64 time=0.147 ms
64 bytes from 192.168.13.11: icmp_seq=8 ttl=64 time=0.170 ms
64 bytes from 192.168.13.11: icmp_seq=9 ttl=64 time=0.354 ms
64 bytes from 192.168.13.11: icmp_seq=10 ttl=64 time=0.348 ms

--- 192.168.13.11 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9206ms
rtt min/avg/max/mdev = 0.147/0.269/0.406/0.093 ms


and many times per second:

$ sudo ping -A -c 100 -n 192.168.13.11
PING 192.168.13.11 (192.168.13.11) 56(84) bytes of data.
64 bytes from 192.168.13.11: icmp_seq=1 ttl=64 time=0.073 ms
64 bytes from 192.168.13.11: icmp_seq=2 ttl=64 time=0.086 ms
64 bytes from 192.168.13.11: icmp_seq=3 ttl=64 time=0.049 ms
64 bytes from 192.168.13.11: icmp_seq=4 ttl=64 time=0.039 ms
64 bytes from 192.168.13.11: icmp_seq=5 ttl=64 time=0.031 ms
64 bytes from 192.168.13.11: icmp_seq=6 ttl=64 time=0.052 ms
64 bytes from 192.168.13.11: icmp_seq=7 ttl=64 time=0.053 ms
64 bytes from 192.168.13.11: icmp_seq=8 ttl=64 time=0.041 ms
64 bytes from 192.168.13.11: icmp_seq=9 ttl=64 time=0.056 ms
[...]
64 bytes from 192.168.13.11: icmp_seq=97 ttl=64 time=0.024 ms
64 bytes from 192.168.13.11: icmp_seq=98 ttl=64 time=0.024 ms
64 bytes from 192.168.13.11: icmp_seq=99 ttl=64 time=0.023 ms
64 bytes from 192.168.13.11: icmp_seq=100 ttl=64 time=0.025 ms

--- 192.168.13.11 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 3ms
rtt min/avg/max/mdev = 0.022/0.027/0.086/0.010 ms, ipg/ewma 0.037/0.024 ms
 
Last edited:

Benten93

Member
Nov 16, 2015
48
7
8
Thanks for your suggestions!

I made two test benchmarks, for each Datastore one. In comparison i noticed the nearly identical 4K-QD1 (2000-3000 IOPS read, 4000-5000 IOPS write) results.
Regarding the 4K-QD32 on the HDD Datastore (111MB/s read, 20MB/s write), one can cleary see the caching effect of the SANs RAM.
If i take a look at the 4K Performance on the SSD Datastore, i was expecting far more.. (50k read, 27k write)

I am in the process of setting up and intermedium Storage to tinker around with that SAN Box a little more.

Anyone has a number or info, how much IOPS an iSCSI 10G connection can deliver (theoretical maximum)?
Later i will try the same benchmarks again and will watch the esxtop statistics. Maybe there is a hidden misconfiguration...
 

Attachments

Last edited:

Benten93

Member
Nov 16, 2015
48
7
8

Attachments

Last edited:

Benten93

Member
Nov 16, 2015
48
7
8
Now, as omnios/openindiana/solaris are the birthplace of solaris, i woukd suggest using them ( maybe combine with gea's excellent napp-it webgui) to baseline your stuff, start back checking from the what could be off, because them you would have a "know to work" well base os.
If zfs is still off, a good look at your pool(s) and their config would be needed.

Have you tried full erasing (check thomas krenn ag for a how to) yet ?
I tried the latest napp-it version on my SAN Box yesterday, and after quite some testing with different setups (hardware side) i could not achieve any better 4K Performance results than those from before.
I let my pools detached and hooked up 4 more of the Samsung SSDs to test with. I erased them as i took them out of use 4 weeks ago.

What i cant understand is the fact, that if i test the performance with CrystalDiskMark or ATTO the SSDs are doing more or less nothing, the CPU is peaking at about 40-50% when i test the speed..

Any help is really welcome!