VSAN 3Node - Real numbers?

CaptainPoundSand

Active Member
Mar 16, 2019
122
85
28
Ashburn, VA
What are realistic read/write numbers from VSAN? I know this is a *big* it depends.

My setup
3x HP DL325 7402P 64GB - each host with 1x NVME 1725b, namespace split into 5 (600GB cache, 590GB Capacity)

LINK - VMware vSAN + NVMe namespace magic: Split 1 SSD into 24 devices for great storage performance

Since this is a 3 node system, my understanding is I have a FT=1 (Can only loose 1 node, but then I have no rebuild ability) - This makes it 2x Mirrored + 1x Witness?

So, if I'm running 10gbe network - and I have a VM doing writes - getting 500MB is good? Does this VM have to write to 2 other nodes for its mirrored aka FT=1 - therefore write bandwidth is split between 2 nodes - or am I completely off?

2x 500MBs connections = 1GBs(which maxes out the 10gbe network).

I think if I can figure out LACP on esxi (not having luck so far) - I could get to 1GBs writes (Doubled-ish)

I'm thinking to scale correctly Ito go to a 40gbe switch and maybe a 4th node to get the most out of these speeds?
 

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
500 MB/s is a relative number for a test - how many threads and which QD did you run this with?
Or was that a VM storage vMotion?
 

CaptainPoundSand

Active Member
Mar 16, 2019
122
85
28
Ashburn, VA
500 MB/s is a relative number for a test - how many threads and which QD did you run this with?
Or was that a VM storage vMotion?
I'm embarrassed to admit it was with Crystal Disk (lol) - I just couldn't' get HCI benchmark to work. (nor lacp in vsan dswitch)

I've done some testing with glusterfs on 3 nodes and it worked kinda the same.. (Spread the load across nodes) - which is limited to 1GBs over 10gbe from desktop to esxi cluster.


------------------------ /--esxi1 - 500MBs
desktop<--10gbe----|---esxi2- 500MBs
------------------------ \--esxi3 - (witness)

Would it scale like this?
------------------------ /--esxi1 - 1000MBs
desktop<--20gbe----|---esxi2- 1000MBs
------------------------ \--esxi3 - (witness)

------------------------ /-- esxi1 - 2000MBs
desktop<--40gbe----|---esxi2- 2000MBs
------------------------ \-- esxi3 - (witness)
 

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
Well lacp will not join the bandwith into a 20gbe link, but it would enable two 10g links - which you can only utilize if you have at least 2 concurrent requests.
Switching to a 40G line might work, but in reality thats a 4x10G line which is aggregated at NIC level and I never know whether it actually combines them to 40g for single connection (properly).
You could get 25G, thats a single 25g line;)

I have dabbled with vSan before, up to 6 diskgroups (in 6 hosts or 2 goups per host) and it does only scale up with multiple connections - else write speed is always that of a single cache disk. And that is not actually using 100% of the disks capability due to the multi user optimization.

Which value was it from which version of CDM?
 

CaptainPoundSand

Active Member
Mar 16, 2019
122
85
28
Ashburn, VA
Well lacp will not join the bandwith into a 20gbe link, but it would enable two 10g links - which you can only utilize if you have at least 2 concurrent requests.
Switching to a 40G line might work, but in reality thats a 4x10G line which is aggregated at NIC level and I never know whether it actually combines them to 40g for single connection (properly).
You could get 25G, thats a single 25g line;)

I have dabbled with vSan before, up to 6 diskgroups (in 6 hosts or 2 goups per host) and it does only scale up with multiple connections - else write speed is always that of a single cache disk. And that is not actually using 100% of the disks capability due to the multi user optimization.

Which value was it from which version of CDM?
Didn't even think about the 4x10gbe - though it does make sense. I was going to try the following:

3x T62100-LP-CR - 40/50/100gbe NIC
1x DCS-7050QX - 40gbe switch
(ERR, but put offer on a DCS-7060CX-32S 100gbe switch)

no more mellanox anything!!

My Cache and capacity drives for testing right now are pm1725b - specs are read/write - 3500/3100 - so I'm thinking that probably not my bottleneck.. (yeah yeah, its no optane)


Only reason I wanted to do VSAN was to stop droping my nfs datstore on freenas which housed my VM's everytime I was doing something on it... WAF had a lot to play into this move.. Trying not to blow cost of a small car on this project for "fun".

 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
I have a few of the 62100's but have not used them in vSan yet, only Windows to FreeNas testing. I am not sure whether they fallback to 4x10 on a 40G switch but it might be, so not sure they actually offer a benefit on single connections.

You not happy with MLX? Surprising, I rather like their stuff.

I run 900p's as cache drives and I am not getting 500 MB/s VM writes, but my goal is always 1T/QD1 so not the common use case;
however vSan is *not* utilizing the theoretical performance of the optanes at all.

I am currently planning to move my VM datastore off of vSan back to a ZFS Filer for better performance;)
Ill most likely keep vSan as secondary backup location though if I have to do work on the zfs filer (in addition to a zfs 2 zfs backup).
Still in planing phase though...
 

CaptainPoundSand

Active Member
Mar 16, 2019
122
85
28
Ashburn, VA
I have a few of the 62100's but have not used them in vSan yet, only Windows to FreeNas testing. I am not sure whether they fallback to 4x10 on a 40G switch but it might be, so not sure they actually offer a benefit on single connections.

You not happy with MLX? Surprising, I rather like their stuff.

I run 900p's as cache drives and I am not getting 500 MB/s VM writes, but my goal is always 1T/QD1 so not the common use case;
however vSan is *not* utilizing the theoretical performance of the optanes at all.

I am currently planning to move my VM datastore off of vSan back to a ZFS Filer for better performance;)
Ill most likely keep vSan as secondary backup location though if I have to do work on the zfs filer (in addition to a zfs 2 zfs backup).
Still in planing phase though...
I've been in planning phase consistently for 2 years now.. lol - Can't seem to settle for anything.

Maybe the new MLX stuff - but trying to get IB to work and network managers (or make run as ethernet) - more I don't want to really learn - sometimes plug and play is worth a lot.

You say with optane as cache you don't even achieve 500MB/s or you do more then that?

I know vsan won't be as fast as dedicated freenas box - but I want some redundancy and bring WAF up.

Just ordered a 100gbe 32 port switch and some chelsio cards - and some 7.68tb NVME capacity drives. What could go wrong :D
 

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
lol what was that with not spending a car's worth on this?,)

I mean I can't say anything, I do have 100g too (not setup yet), just missing the large nvme drives (but no real need for large, just need speeed;))

And it all depends on # of threads and qd, thats why I have been asking for it - especially it will depend on your use case in general (# of users, vms, looking for aggregated or individual performance etc )
 

CaptainPoundSand

Active Member
Mar 16, 2019
122
85
28
Ashburn, VA
lol what was that with not spending a car's worth on this?,)

I mean I can't say anything, I do have 100g too (not setup yet), just missing the large nvme drives (but no real need for large, just need speeed;))

And it all depends on # of threads and qd, thats why I have been asking for it - especially it will depend on your use case in general (# of users, vms, looking for aggregated or individual performance etc )
I already have a newer car.... so... This is just to run windows AD and pihole - maybe run my Plex and use up the space on the 8tb NVME drives I just got. LOL

This is more for fun I suppose - proof of concept - so I can know how to outdo my customers in the local Data Centers I do work for.. LOL. Though I have a customer running 10x100gbe DIA - so my POC is still small potatoes.
 

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
if you let me know which CDM version you run I can create a comparison run for you, or search my older threads for older results :)
 

vangoose

Active Member
May 21, 2019
268
70
28
Canada
You make me re-think my new storage server configuration with name space for ZFS.

Instead of single 6.4TB disk, I can divide it into s few name spaces and use raidz or mirror. The HGST SN260 supports 128 name spaces.
 

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
Just make sure your application can utilize vdev based scaling (i.e. multiple threads), and of course to ensure redundancy groups are on different physical devices (just for other readers as i am sure you are aware)
 

CaptainPoundSand

Active Member
Mar 16, 2019
122
85
28
Ashburn, VA
upload_2020-1-24_22-20-1.png
Fresh install of 6.7u3 on all 3 nodes - looks like your optanes edge out my 4k numbers - Curious what my numbers will look like on this 100gbe switch when it comes in - I'll have 1x pm1525b and 7.68tb intel NVME - wanna try multiple disk groups using nvme namespaces and the 1525b as a 600gb cache drive for each drive group...
 

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
With 1G testsize you might be hitting the cache pretty good.

I am not sure 100G will help much here; most of the results are not even close to a 10G limit so even if that is raised to 25G+ there is little gain to be expected. Performance improvements on the network side most likely could be lower latency or RDMA.
RDMA is the thing I am looking into currently, but it seems as if ESXi does not run its native traffic via RDMA enabled NICs yet (https://forums.servethehome.com/index.php?threads/esxi-rdma-enablement.26983/), but I hope that will come in the future.

Running multiple diskgroups should help theoretically (if you set the striping accordingly, and with 128 namespaces you sure can go beyond the 6 physical devices I used, but you need to make sure that you retain properly distributed redundancy with that, else all failure domains might end up on a single device and when that goes you're done for.
 

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
Not sure you're still interested, but I actually managed to make vSAN work faster for a change...

upload_2020-3-29_23-6-36.png

Same 4 node cluster as before, this time not local vm performance but vMotion to and from my new FreeNas Box (NFS).
So vSan *can* do - it just usually doesn't want to;)
 
  • Like
Reactions: Patrick

Rand__

Well-Known Member
Mar 6, 2014
4,575
910
113
Another data point for the interested reader -
managed to get 2 more P4800x for an acceptable price so moved the 4 Node cluster to these now (with p3600/p4510's 2TB as capacity drive).

upload_2020-4-9_18-5-25.png
Note this is 1G test size accidentally. will redo with 8G

upload_2020-4-9_18-8-14.png

And as comparison the new ZFS filer (non redundant, so unfair comparison) via NFS:
upload_2020-4-9_18-14-53.png
 
Last edited:

Dark

Active Member
Mar 9, 2019
155
77
28
Not happy to see this thread. Just deployed a relatively large VSAN cluster and we’re seeing abysmal write speeds.
7 hosts, 14x PM883 3.84TB capacity disks and 2x P4800X 750GB cache (per host).

Crystal disk mark shows under 1GB on the SEQ4K.

screenshot to follow.

there must be something wrong here. I removed the diskgroups and did only PM883 disks for capacity and flash and saw nearly identical numbers.

tested were done on a Windows 10 VM, only vm on the vsan. Used ATTO and CDM.