iSCSI + MPIO + ESXi6

BSDguy · Nov 2, 2016

This is really interesting all these ideas and thoughts, so many thanks again! Geez you guys are awesome and so helpful.

Right, I'm going to ignore jumbo frames for now!

*I* can't answer that, my home lab is in a full rack in the basement and very much full of fans. Someone else here could answer that better. I might suggest getting another switch, even if just a loaner, to confirm the issue before you worry about anything else. Get one with fans and noise and confirm it all works fine. THEN see what you can get to minimize the noise. I'm not sure how well they work, but something like the Quanta L4BM or IBM Blade G8000's are around $100 or so. They'd be a good place to try, but see what work or other's in a local user group might have that you could try for a weekend. It would validate your configs are correct and that it's the hardware.

Unfortunately I live in an apartment so my gear simply has to be quiet/practical/fanless (if possible). It all lives in my lounge! I could ask our comms guys at work to borrow a switch but the chances of them letting my take it home are slim!

What "problems"? First, don't get hung up on Atto throughput benchmarks. If you want IOPS, then focus on that. You don't CARE (or shouldn't) how long it takes a file to copy inside the VM. You care how long it takes to clone the VM so you can try something, how long svMotion takes, how long snapshot commits take, etc. That's all generally going to be IOPS based. So don't go chasing mythical numbers if they don't help your end goal. Which, for most home labs is "how much productive time can I get out of this in an evening" or "how many times can I try this over and over again in a night, to learn the most, end to end".

Its not just ATTO benchamrks, storage vMotion is sloooooooow. I'm going to be investing hours of my time using this lab every week so I don't want to be twiddling my thumbs while something vMotions, clones or whatver. The file copy isn't that serious but doing VMware "stuff" is. I've had slow VMware storage before at home and it drove me nuts with all the time you waste, thats why I want to get it right this time ;-)

I've only ever used FC on NetApp's, and have no idea what home built SAN solutions there are available like StarWind. It's going to be a lot less, especially in the Windows world, as most people just go ISCSI. More to that point, if you're going to get FC cards and such, why not just go point to point with 10GbE and still stay ISCSI but switchless? In either case you're likely going to want 4 dual port cards - 1 for each host to have 2 ports and 2 for the "SAN" so that it can actually GIVE 2x10GbE to each host. Will set you back a little more than FC, but now you're future ready to drop in 10GbE switches and expand. FC experience is "good" but it's a dying art.

Actually in light of the above comment, and how your crossover solution worked - you could just pick up another NC364T for the "SAN" and just do that. That's certainly another way to do it. It starts falling down (with any solution) as soon as you want to start having a 3rd node or have a second SAN node for HA, etc.

Thats an interesting thought! I never thought to do crossover cables for BOTH hosts. Can that work?? ie:

SAN - 2 x quad NICs - lets call it Quad-A and Quad-B
Server1 - quad NIC run crossover cables to Quad-A
Server2 - quad NIC run crossover cables to Quad-B

Will this setup work ok with iSCSI/MPIO/Round Robin and ESXi? Also will HA/DRS work? Will the shared storage be ok with this kind of setup? I won't see duplicate drives/datastores etc?

I'm only concerned with getting two nodes up and running right now. ;-)

My two new Supermicro servers have dual Intel 10Gb NICs onboard but the prices for a dual port 10Gb NIC for my SAN server make me cringe when I look at the prices online...

BSDguy · Nov 2, 2016

TuxDude said:
I can probably write up a short guide on building a linux-based FC target - nothing too fancy, but a short overview. It can't be that hard to do.

One other option you have, seeing the results you got when you removed the switch, would be to just stick with 1Gb ethernet and a lot of crossover cables. Stick a second quad-port NIC into your iSCSI target box if you want 4gbps to each node, or just do 2gbps to each node with what you have.

Thats very kind of you! I know nothing about Linux IO but I am starting to lean towards using crossover cables to bypass the switch to resolve my performance problems so may not need FC. Lets see what happens...

I like the idea of a second quad NIC in the iSCSI target server. Hmmm this is getting interesting....8 x 1Gb NICs = 1GB/s of bandwidth in total! Another quad NIC is cheap too. I think 500MB/s per host is plenty for my needs.

NetWise · Nov 2, 2016

Too many layers to quote...

First, don't underestimate what the comms guy will agree too. It's not a gift and it's got an end date - you need it literally for a night to vet that it works. Your peers/management should be excited to see you are taking an interest in self learning, and hopefully contributing to it. You're asking to borrow a switch, not for a $4000 week long course after all

svMotion is always slow. vMotion and svMotion are designed to be the lowest priority tasks at all costs, so as to protect user facing performance issues. It should never impact actual production in any way. So even with really fast stuff, it's going to often be slower than you think. You shouldn't be moving storage around willy-nilly, and there's a reason that Storage DRS has blackout time periods and such to only allow svMotion in certain time windows.

I whole heartedly agree with the being driven nuts with slow storage while learning. Depending on what you're learning at the time, I'm going to suggest avoiding the SAN at all. SAN is great for knowing vSphere and clusters and storage. But if what you're using the lab for is to learn about... upgrading from Windows Server 2008R2 to 2016 and swapping out DC's, and upgrading your SQL box or your Root Enterprise CA... then that's all "VM" level stuff. Put it on a local SSD in one box and go as fast as you can. Use PowerCLI to move it to the SAN at 2AM or something, or something like Veeam to replicate it to the SAN or other node. Shared storage is great for clustering and HA - but it's seldom the "faster" answer. It's value comes in being shared and centralized and consolidating the cost.

As for can you do 2 nodes the way you describe with HA/DRS/etc and will it work? Yes. It would work as well if it was a dual port SAS enclosure with SAS cables and no networking at all. All that really matters is both hosts see the SAME LUN with the same presentation, and that the SAN side agrees to allow multiple host initiators to access it. As long as that's one, you can do point to point all day long. The only real reason one puts FC or Ethernet switches in between is after 2-3 hosts you start needing to have a much bigger SAN to accomodate all the IO coming in/out of it - you can't keep adding another 2-4pt NIC/HBA forever, especially as many low-mid range units only often have 1-2 expansion slots. The other thing to consider is that in 'theory', this is faster. In the other scenario, you'd likely have 2-N hosts with 4x1GbE connecting to the SAN with 4x1GbE - and sooner or later if you spin up something on host 2 through X at the same time, your 4x1GbE (or 2x8GBitFC, etc) in your SAN is going to be your bottleneck. Again though, we don't often go to SAN's for the fastest solution possible - we go so we can do it with consolidated disks, increased # of spindles, ease of maintenance, etc.

Your next post about the number of connections suggests you keep getting hung up on "total bandwidth". I'd really focus on the quality of your 4KB IO and IOPS, and the latency you get at that level. You may even want to run your Atto benchmarks from 512Byte to .. 64KB only, to limit how shiny that bottom row looks, as you're not going to hit that. Focus very much on the 4KB line. 4KB * 5000 IOPS = 20,000KB or 20MB. You don't need a lot of throughput/bandwidth to push massive IOPS. I regularly get 5000-10000 IOPS on a very low end Nimble array with just 2x1GbE. Make sure you're chasing the right number!

whitey · Nov 2, 2016

Stay away from FC unless ya wanna become a serious FC SAN junkie, dying art indeed. SMH, 10GbE...10GbE...10GbE, just scored a juniper EX3300 PoE+ 24 port for $300 today (also had two 10G LR optics), I already have the non-PoE model driving high rates of throughput and my vSphere env is solid, may be an option for you. I can't quite hunt down the buffer size for the EX3300 but it must be solid. I HAVE seen 2960/3750's fall on their face as well LOL

EDIT: This may be helpful in switch comparisons as well as vetting it back by folks here.

packet buffers

NetWise · Nov 2, 2016

Found that packet buffers link earlier, great resource. The problem can be in the way vendors present the info. The SG300-28 for example lists a 8Mb buffer per switch - and it's not clear if that's 8megaBIT or MB written wrong, or....

But you can very very clearly see why for example a Cisco 2960G is NOT a good switch! (For ISCSI - great for access switches)

What's frustrating is the difference between a bad and a good ISCSI switch (based on this criteria) is like $3 worth of RAM

if 4-8MB is enough, how much could it possibly cost to go from 1 to 4? On a switch that might have cost $4K out the door originally, that's disappointing.

whitey · Nov 2, 2016

I got a Procurve 2910al switch that has 6MB buffer that I'd let go dirt cheap...just saying, PM me if interested, it's been sitting in a box for a good year now and I only used it for 3 months when I received it NIB. Needs a good home and may do the trick (AKA it's an access layer switch but worked for me to drive a substantial vSphere env w/ NFS/iSCSI at 10G speeds). A lot of the higher switches end have 8-12MB buffers it seems.

PM me if interested.

NetWise · Nov 2, 2016

2910 will have fans though, I think yeah? Might do him out.

However, there are a few companies on eBay that have figured out what fans work and sell "quiet" versions - 1x Quiet Replacement Fan for Nidec B34955 on HP Procurve 2900-24G P/N J9049A for example. Might be worth a shot. I bought some for my Dell PowerConnect 6248's that I never installed, so I don't know if they work. Might be worth me revisiting that.

Don't, however, assume that the 1800/1900 series without fans will do ISCSI well. I have some from some client upgrades, that... fell on their face. (no experience with the 2800/2900 other than to know from 3rd party they're considerably better...)

TuxDude · Nov 2, 2016

I'd almost take you up on that switch offer whitey, except that I need more than 20 1G ports on it. Something with 40ish 1G ports and 2 or 4 10G would be perfect (hmm..., maybe its time for me to get on the LB4M bandwagon). As it is I'm still running everything on the HP 2824 that I bought from you a little over a year ago, which I've run out of ports on.

whitey · Nov 2, 2016

Funny, that ole' 2824 is a BEAST right, she takes a lickin' and keep on kickin'! Gotta LUV quality made switches. Yeah the 2910al has 24 ports and 4 ports 10G IF you add on the J9008a modules which everyone seems to wanna try to highway rob ya on over the last 6 months or so I have noticed.

BSDguy · Nov 3, 2016

Bit of an update. I ordered another quad gigabit NIC last night which should arrive next week. I've already connected my one ESXi host to the SAN using a direct connection with the one quad NIC I have (ie: bypass switch).

I thought I'd give jumbo frames a try because, what have I got to lose on a direct connection? So I set the MTU on the iSCSI virtual switch, vmkernel adapters and on the quad NICs in the SAN box to 9000. I can ping between the ESXi host and the SAN using ping with an 8972 packet size so I assume it's all working? The VMs are still running and I can run benchmarks still.

Your next post about the number of connections suggests you keep getting hung up on "total bandwidth". I'd really focus on the quality of your 4KB IO and IOPS, and the latency you get at that level. You may even want to run your Atto benchmarks from 512Byte to .. 64KB only, to limit how shiny that bottom row looks, as you're not going to hit that. Focus very much on the 4KB line. 4KB * 5000 IOPS = 20,000KB or 20MB. You don't need a lot of throughput/bandwidth to push massive IOPS. I regularly get 5000-10000 IOPS on a very low end Nimble array with just 2x1GbE. Make sure you're chasing the right number!

I think I need to understand IOPS better. I've read a bit about it but still find it a bit confusing! With jumbo frames enabled for the iSCSI network (the one with the ESXi host connected directly to the SAN) I ran ATTO from 512B to 64KB only (like you suggested) and here are the results:

ATTO ran on boot drive (C

which is on one datastore:

ATTO ran on data drive (T

which is on another datastore:

Are these descent numbers? Also, in the Starwinds console it shows the IOPS for a device(s) but it jumps around quite a bit but when I ran the above ATTO benchmark I saw it hit 3000IOPS (and briefly 10000). Is this good?

Stay away from FC unless ya wanna become a serious FC SAN junkie, dying art indeed. SMH, 10GbE...10GbE...10GbE, just scored a juniper EX3300 PoE+ 24 port for $300 today (also had two 10G LR optics), I already have the non-PoE model driving high rates of throughput and my vSphere env is solid, may be an option for you. I can't quite hunt down the buffer size for the EX3300 but it must be solid. I HAVE seen 2960/3750's fall on their face as well LOL

Assuming all goes well with my direct connect option using 1Gb NICs I'll be giving FC a miss for now. Very tempted with 10Gb but my SAN server doesn't have a 10Gb NIC and the Intel ones are pricey.

Found that packet buffers link earlier, great resource. The problem can be in the way vendors present the info. The SG300-28 for example lists a 8Mb buffer per switch - and it's not clear if that's 8megaBIT or MB written wrong, or....

But you can very very clearly see why for example a Cisco 2960G is NOT a good switch! (For ISCSI - great for access switches)

What's frustrating is the difference between a bad and a good ISCSI switch (based on this criteria) is like $3 worth of RAM if 4-8MB is enough, how much could it possibly cost to go from 1 to 4? On a switch that might have cost $4K out the door originally, that's disappointing.

I sure learnt a lot from this. I had NO idea that the packet/port buffers were so important for iSCSI traffic! It got me wondering, does VMwares VSAN performance suffer if you use the wrong switch? I was going to purchase a 10Gb Netgear 8 port switch for VSAN traffic only (and maybe iSCSI!) so I wonder how well this switch handles storage traffic.

Marsh · Nov 3, 2016

BSDguy said:
SAN - 2 x quad NICs - lets call it Quad-A and Quad-B
Server1 - quad NIC run crossover cables to Quad-A
Server2 - quad NIC run crossover cables to Quad-B

Forgive me if I don't understanding nor reading the entire thread.
What I did is to install a cheap $25 dual port Mellanox 10G SFP+ card in the "SAN" host.
Two ESXi hosts, each host have a dual port 10G card, connect each ESXi host to the "SAN" via direct PTP SFP+ cable. Also there is a direct connect SFP+ between two ESXi hosts for Live Migration.

NetWise · Nov 3, 2016

Marsh said:
Forgive me if I don't understanding nor reading the entire thread.
What I did is to install a cheap $25 dual port Mellanox 10G SFP+ card in the "SAN" host.
Two ESXi hosts, each host have a dual port 10G card, connect each ESXi host to the "SAN" via direct PTP SFP+ cable. Also there is a direct connect SFP+ between two ESXi hosts for Live Migration.

That totally works. We'd discussed going switchless 1GbE, 10GbE, and 4/8Gbit FC, direct to SAN, point to point. Just a different variant.

I _think_ the OP has opted for the 1GbE solution as that allows him to expand into 1GbE switching and "do it right" (eh, right is the wrong word, maybe... more like business production), than going SFP+/IB/etc. Your way is just as effective, and probably ultimately faster, albeit without the redundancy to the SAN and multiple links - the former being a "business downtime protection" issue which likely doesn't apply here and the latter being a "learning/labbing MPIO, Round Robin, etc" which has value over the actual throughput.

NetWise · Nov 3, 2016

BSDguy said:
Bit of an update. I ordered another quad gigabit NIC last night which should arrive next week. I've already connected my one ESXi host to the SAN using a direct connection with the one quad NIC I have (ie: bypass switch).

I thought I'd give jumbo frames a try because, what have I got to lose on a direct connection? So I set the MTU on the iSCSI virtual switch, vmkernel adapters and on the quad NICs in the SAN box to 9000. I can ping between the ESXi host and the SAN using ping with an 8972 packet size so I assume it's all working? The VMs are still running and I can run benchmarks still.

If you can do a ping, specifying a jumbo frame packet size, and you don't get fragmentation (ping REMOTE_HOSTNAME -f -l 8972 -- don't forget your -f), then you should be good, yes.

BSDguy said:
I think I need to understand IOPS better. I've read a bit about it but still find it a bit confusing!

Busload full of 50 passengers doing 55mph has a "speed" of 55mph, but doesn't accellerate, move, behave, handle well. This is akin to your throughput benchmark. The most the pipe will carry, with a full load.
Freeway full of 50 crotch-rocket sport bikes doing 55mph _also_ has a "speed" of 55mph. But each one is individually faster, more nimble, can do what it needs, accellerates in an instant, etc. That's more akin to your IOPS - which is the number of operations you can do in a second, vs how much data the operations can move. If you're dealing with 4KB blocks on a SAN vs 1MB Word Documents, this makes a giant difference - that 1MB Word document may take 256x 4KB IOPS to read from disk, but it's only one file and only 1MB.

IOPS vs Throughput is the reason guys will put two SATA disks into a QNAP and put 3 hosts on them with 40 VM's and try to run a business and wonder why things are "slow", because "I should be getting 125MB/sec" - and they should. But your average 7.2K SATA disk is going to do about 80-100 IOPS, and THAT is the important number unless it's just "bulk storage"

BSDguy said:
With jumbo frames enabled for the iSCSI network (the one with the ESXi host connected directly to the SAN) I ran ATTO from 512B to 64KB only (like you suggested) and here are the results:

ATTO ran on boot drive (C which is on one datastore:

View attachment 3749

ATTO ran on data drive (T which is on another datastore:

View attachment 3751

Are these descent numbers? Also, in the Starwinds console it shows the IOPS for a device(s) but it jumps around quite a bit but when I ran the above ATTO benchmark I saw it hit 3000IOPS (and briefly 10000). Is this good?

It looks consistent, and healthy. It's got better 4KB speeds than many setups I've seen in the past, so it looks decent. 3000-10000 IOPS is pretty healthy, and the more important number. Remember also that virtualization in general and VMware more specifically, is geared to having many hosts running many VM's all getting similar performance. So it's less about one amazingly fast VM than it is knowing your data center is going to have a consistent user experience, be it with 1, 10, or 100 VM's. It's tough to get a good window on the performance from just one VM. When you look at peformance whitepapers, such as from Dell where they test their SAN for performance numbers and such, you'll note they often show that there was like 4x Hosts of X spec and 20x VM's of Y spec on each, ALL running the benchmark at the same time, with BenchmarkSpec, etc.

BSDguy said:
Assuming all goes well with my direct connect option using 1Gb NICs I'll be giving FC a miss for now. Very tempted with 10Gb but my SAN server doesn't have a 10Gb NIC and the Intel ones are pricey.

You can use 2x 1GbE until your other card comes in, you should be just fine for performance. Then all you have to do is cable up the extra pairs later.
As another guy posted, you could use Mellanox or other cards, and not have to use Intel ones. 10GbE can be done on the cheap, especially if point to point with DAC cabling.

BSDguy said:
I sure learnt a lot from this. I had NO idea that the packet/port buffers were so important for iSCSI traffic! It got me wondering, does VMwares VSAN performance suffer if you use the wrong switch? I was going to purchase a 10Gb Netgear 8 port switch for VSAN traffic only (and maybe iSCSI!) so I wonder how well this switch handles storage traffic.

We all learn it sooner or later - better to learn in a lab than in production, with the boss breathing down your neck

VSAN is a different beast a bit. There, it wants to copy the data from the local host where the IO is, but can do so slightly buffered. It's capable of running on 1GbE as well, though not recommended - for a business with high loads. Home lab, no problem.

Marsh · Nov 3, 2016

Once the bare metal ESXi host infrastructure is satisfy, then I play with virtual networking as well virtual ESXi , vSan Cluster.
It is much simpler to add virtual network cards and virtual switch to one's heart content.
Later you could expand to Proxmox HA cluster, Window HA Cluster.

BSDguy · Nov 3, 2016

Forgive me if I don't understanding nor reading the entire thread.
What I did is to install a cheap $25 dual port Mellanox 10G SFP+ card in the "SAN" host.
Two ESXi hosts, each host have a dual port 10G card, connect each ESXi host to the "SAN" via direct PTP SFP+ cable. Also there is a direct connect SFP+ between two ESXi hosts for Live Migration.

Thats a really good idea! This whole time I have been thinking of RJ45 with 10Gb (for some reason). I'm not familiar with Mellanox, how do the compare to Intel NICs? I had a quick look on eBay and couldn't find a dual port Mellanox 10G SFP+ for $25 but they seem to be around the $60 or so mark? Are Qlogic 10Gb NICs worth looking into as well? This is an option I will definitely look into once I have give the quad 1Gb NICs a spin with "direct connect". Thanks for the suggestion.

[/quote]
I _think_ the OP has opted for the 1GbE solution as that allows him to expand into 1GbE switching and "do it right" (eh, right is the wrong word, maybe... more like business production), than going SFP+/IB/etc. Your way is just as effective, and probably ultimately faster, albeit without the redundancy to the SAN and multiple links - the former being a "business downtime protection" issue which likely doesn't apply here and the latter being a "learning/labbing MPIO, Round Robin, etc" which has value over the actual throughput.
[/quote]
He. yes. I like the idea if Round Robin and the redundancy and performance it brings. Its quite useful in a lab when you need to move cables around, you don't need to power anything down, just disconnect/reconnect one cable (or up to 3 on a quad NIC) at a time and everything keeps working...great!

I am keen on 10Gb and direct connect now that SFP+ has been brought to my attention.

Busload full of 50 passengers doing 55mph has a "speed" of 55mph, but doesn't accellerate, move, behave, handle well. This is akin to your throughput benchmark. The most the pipe will carry, with a full load.
Freeway full of 50 crotch-rocket sport bikes doing 55mph _also_ has a "speed" of 55mph. But each one is individually faster, more nimble, can do what it needs, accellerates in an instant, etc. That's more akin to your IOPS - which is the number of operations you can do in a second, vs how much data the operations can move. If you're dealing with 4KB blocks on a SAN vs 1MB Word Documents, this makes a giant difference - that 1MB Word document may take 256x 4KB IOPS to read from disk, but it's only one file and only 1MB.

IOPS vs Throughput is the reason guys will put two SATA disks into a QNAP and put 3 hosts on them with 40 VM's and try to run a business and wonder why things are "slow", because "I should be getting 125MB/sec" - and they should. But your average 7.2K SATA disk is going to do about 80-100 IOPS, and THAT is the important number unless it's just "bulk storage"

I appreciate the explanation, thats very helpful. Cringing at your QNAP example...

It looks consistent, and healthy. It's got better 4KB speeds than many setups I've seen in the past, so it looks decent. 3000-10000 IOPS is pretty healthy, and the more important number. Remember also that virtualization in general and VMware more specifically, is geared to having many hosts running many VM's all getting similar performance. So it's less about one amazingly fast VM than it is knowing your data center is going to have a consistent user experience, be it with 1, 10, or 100 VM's. It's tough to get a good window on the performance from just one VM. When you look at peformance whitepapers, such as from Dell where they test their SAN for performance numbers and such, you'll note they often show that there was like 4x Hosts of X spec and 20x VM's of Y spec on each, ALL running the benchmark at the same time, with BenchmarkSpec, etc.

Thanks for the feedback. Yeah I need to change my way of thinking about storage, speeds, IOPS and virtualisation in general. I'm getting there with the home lab I am building currently. Have learnt loads so far and I haven't even gotten started on the VMware side yet (well not much). I promised myself I would spend as long as I needed putting a solid foundation in place (ie: VLANs, firewall rules, cabling, labelling everything etc etc) before moving onto the "juicy" stuff like clustering/HA/DRS etc. Its taken me a month just to rack the kit, label it and get the storage built so far.

If I go the 10Gb route, lets say I put a dual port 10Gb card in the SAN but a single port 10Gb NIC in an ESXi host, would I see quite an improvement over using the quad 1Gb NICs I am currently? I can only imagine what its like with dual 10Gb ports in each server with Rounf Robin/MPIO!!

Many thanks again for all your valuable and excellent replies/posts.

Marsh · Nov 3, 2016

If you want cheap and fast IOPS, look for $125 FusioIO 640gb pcie SSD. Ubuntu VM boot in 3-5 seconds.
HP 600478-001 640GB Fusion io Drive PCIe Flash MLC SSD Accelerator 600282-B21 | eBay

Spent $125 and prepare to blow your mind on how much IOPS for this cheap.

Crystal Dew World : Crystal Dew World
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 833.245 MB/s
Sequential Write (Q= 32,T= 1) : 610.949 MB/s
Random Read 4KiB (Q= 32,T= 1) : 254.429 MB/s [ 62116.5 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 597.167 MB/s [145792.7 IOPS]
Sequential Read (T= 1) : 743.424 MB/s
Sequential Write (T= 1) : 604.149 MB/s
Random Read 4KiB (Q= 1,T= 1) : 68.431 MB/s [ 16706.8 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 179.677 MB/s [ 43866.5 IOPS]

Test : 4096 MiB [E: 0.0% (0.2/595.9 GiB)] (x5) [Interval=5 sec]
Date : 2016/04/28 13:20:44
OS : Windows Server 2012 R2 Datacenter (Full installation) [6.3 Build 9600] (x64)

TuxDude · Nov 3, 2016

BSDguy said:
I think I need to understand IOPS better. I've read a bit about it but still find it a bit confusing!

Lets put it another way...

IOPS means Input/Output Operations Per Second, though modern storage can really do more than just reads and writes (SCSI reservations, advanced commands like used by VAAI, TRIM/UNMAP operations, etc.) and so IOPS can also be thought of as just the number of commands per second that the target is doing. Every one of these commands takes time to complete, and sometimes everything has to stop and wait until a command completes (sync VS async - a sync write will wait/block until the storage confirms the data was successfully written). There is also no pre-determined fixed size for a single IO - every app and every OS has its own preferences in that regard. 4KB IO's are quite common, as many filesystems default to 4KB blocks - but then usually somewhere in the OS's storage path adjacent IOs will be combined into a single larger one before being sent to the disk. If you add up the size of all of the IOPS done in a given amount of time, and divide by the amount of time, you get bandwidth.

Now for spinning disks, IOPS are everything. With all the moving parts in there a lot of things limit how many commands you can do, and except in the case of purely sequential IO the disk spends most of its time waiting for the heads to move to the right track, or waiting or the platters to rotate until the right sector is under the heads. Doing a bunch of math on the physical characteristics of various drives gets you to the standard numbers of around 80 IOPS for a 7K RPM disk, increasing to around 200 IOPS for a 15K RPM disk. So if you're doing 4K reads against a standard desktop drive and it isn't sequential (either you have a fragmented drive, or you are doing multiple things at the same time, or many other reasons), you can plan for a bit over 300KB/s of bandwidth. Yes, spinning disks suck for random IO, hence we have SSDs.

For SSDs, IOPS have a far smaller impact. With most drives having a bit of DRAM cache in them, and having no penalty regardless if the IO is sequential or not, they typically end up limited by how fast the SSDs controller can write to NAND (which can be impacted by garbage collection and other things, but lets not get too complicated). For the most part, you can just not worry about IOPS on SSDs unless you are doing a DAMN LOT of them and start hitting the limits of how fast your HBA or SSD's controller can process commands.

And lastly - from the network point of view, IOPS are mostly irrelevant. None of the ethernet gear knows anything about SCSI commands or anything else, it just sees packets with source/destination addresses and doesn't look inside. And network packets are almost always far smaller than storage IOs, so theres probably not even much correlation between them. Standard ethernet packets can only hold 1.5KB, so even 4K IOs have to be fragmented into at least 3 packets each. Jumbo-frames will get you up to 9KB in a single packet, which can mean a lot less overhead (both in the packet-processing itself, as well as in being able to get up to 8KB IOPS into a single packet), but if you're doing 1MB IOPS they'll still have to be split into a damn lot of little packets.

And all of that is why benchmarks that just measure bandwith are mostly useless. Using a tool like iometer or fio will give you the full control to test whatever type of workload you want. Want to see some big bandwidth numbers - configure them to do 1MB sequential reads, with a queue of 32 or more, and for bonus points restrict the size of the area being tested to something that will fit in cache - with such a config I could make a single 7K sata drive saturate its 6gbps SATA3 link - a 7K SAS3 drive would saturate a 10GbE iSCSI connection. Of course that's with a totally unrealistic workload. If you want to know how a VMware farm will likely perform, setup your test more along the lines of 4K or 8K IOPS, queue depth of 16 or 32, 100% random and 60% reads. You won't see much bandwidth being moved around anymore, but you will get an idea of how your storage will actually perform with a more real type of workload. And in those results, the bandwidth doesn't even matter, latency does. If the latency in that test is high, then you need to reduce the queue depth (and accordingly, know that if you start doing too much at once things will slow down). On the other hand if latency is still very good at QD32 you can start increasing the load - bigger queues and/or more VMs running the benchmark at the same time, and then you know that even if you suddenly throw a huge extra workload at your storage that the existing VMs are likely to not notice any difference.

BSDguy · Nov 3, 2016

I've just been having a look at Mellanox 10Gb SFP+ NICs on eBay. I'll need two low profile cards and one high profile. Would the following be a good choice:

HP 516937-B21 Dual-port 10Gb Ethernet Card 518001-001 / Mellanox ConnectX-2 SFP+ | eBay

HP Mellanox ConnectX 10Gb PCI-e G2 Dual Port HCA 518001-001 516937-B21

I assume it doesn't matter that they are HP branded?

These don't come with cables/SFPs so what cables/SFP modiles would I need to connect these together? DACs?

BSDguy · Nov 3, 2016

Lets put it another way...

I just saw your great explanation of IOPS, thank you so much! I'll need to think about this and re-read it a few times for it to sink in!

Marsh · Nov 3, 2016

BSDguy said:
I've just been having a look at Mellanox 10Gb SFP+ NICs on eBay. I'll need two low profile cards and one high profile. Would the following be a good choice:

Yep, those cards are the same what I have.
Now for small $$ more. you could go to 40GBe
Sun Oracle 40Gb/Sec Dual Port QDR Infiniband PCIe HCA Adapter M3 Card 7046442 | eBay

iSCSI + MPIO + ESXi6

Member

Member

Active Member

Moderator

Active Member

Moderator

Active Member

Well-Known Member

Moderator

Member

Attachments

Moderator

Active Member

Active Member

Moderator

Member

Moderator

Well-Known Member

Member

Member

Moderator