Fastpath is very much misunderstood.
1. Enabled, the controller sets its queue depth to like 975
2. in a ESXi vm environment, the LSI PARALLEL/BUSLOGIC PARALLEL use QD1, with LSI SAS it is 16 (or 32) , and with PVSCSI it is 32 (but tweakable to 255).
3. You must also tweak the quantum's and depth for ingress and if permitted use SIOC. The fair sharing algorithm is completely stupid on esxi - For instance a vm with latency sensitive (java) timer will tick off at 1042/sec and a regular VM will run at 84/sec - the sharing algorithm was based on world changes (similar to ticks). Also the system penalizes you for seeks that are > 2000 sectors apart, assuming you are using hard drives , which you are not. The maximum setting is like 2 million, but really the linear doesn't really count much.
Just by tweaking the driver to PVSCSI QD 64, disk quantum 64, you can see immediate gains in benchmarks that uses QD64 as part of testing, which result in heavy queue depth performance. Without this you'll never get to QD64 due to inherit limits built into the O/S.
It does make you wonder if you need to tweak a bare metal o/s as well to perform? Buffering (easy removal, or performance), and system caching all play into the mix here.
Most people will find they cannot sustain a high queue depth for any period of time except for massive ETL jobs.
Something I've been wondering is if defragmentation (since trim is not an option) would help. Given that large database could have 6 million extents if auto-grown (log files too), there is a natural order of extra work to manage this at hypervisor and vm level. A lot of people use thin provisioning and auto-grow by default - sql server defaults.
I was only able to reach QD1 a few times without altering my server to AD-HOC Queries, forced parameterization, maxdop 1(per query maxdop 4), lock pages in ram 32gb, and enabling the resource governor with no script allowing 49% ram for query (!! reduces tempdb usage by 80% for me !!).
I feel defragmentation must be considered! If you are dealing with sharing worlds , the work to seek 100 fragments could cost you a couple of context switches, which due to [lack of consideration] for the application/ssd fastpath esxi will not let you stretch your legs out.
I'm going to do some benchmarks now that I have the LSI 2308 in the dl380e and see if it can perform any different (better) - The DL360e is a cheap server, and it comes with B120i sata raid controller on the mobo[optional 512fbwc], and a b320i raid controller on the riser[lsi2308 uses the same cache so I think only one can be active]. The riser has dual SAS connectors, and the B120i is on the motherboard with 1 SAS connector with only 6 sata ports enabled. The B320i has both ports enabled for 8 drives but you have to enable the KEY to allow for SAS. The key comes with the server if you buy it with the LSI riser board. Lastly you can throw in a P420/1GFBWC controller, but like most full featured raid controllers, the extra junk adds overhead.
The P420 can only perform with SSD with 100% cache to write, 0% to read. If you try to disable acceleration, it sucks butt. Benchmark testing is inadequate. In theory if you use READ-AHEAD you might increase performance (remember -defragment the drive!) at the cost of latency , but in essence you are forcing a high QD which until you reach 32 QD per drive in your raid, you are not being penalized much.
I don't have the FBWC for the B120i/B320i so I can't test that.
Also HP authenticates and boots any drive that does not match its signature if you choose to use their system of lights. There are ways around this, as some folks sell pir8 sleds now, but imo if you are going to do this, just live without the drive lights and label your drives with their serial/wwn and rely on agents to let you know what has failed.
HP is rather smart in the fact they default in simple mode to picking raid pairs (1+0) that are on separate cables and optimal to the physical layout for both Horizontal and Vertical raid. LSI not so much, you have to force the drives to be on separate cable pairs. There is something to this as I had a double drive failure on a raid-10 because I left 0/1 as an arm of raid-10 span. I should have mixed them up. I should have benchmarked if it is faster to have all drives on arms split evenly.
LSI recommends raid-0 only for fastpath performance. Which is really unacceptable given that their controllers love to reset with 840/840 PRO (unsupported remember) and drop like flies.
You can tell if you using an LSI controller with HP, it has "SMARTer" - drop drives on regular failer, or drop drives on "SMART error". I've tried SMARTer on this b320i.
It will be interesting to see if I can compare the B120i to the B320i and see if the on-motherboard intel C600 chipset sata is faster than the LSI.
If you put the B320i into IT mode, I suspect it will give the big brother P420/1GBFBWC a serious run for the money.
This dl360e was the cheapest model out there. I think it was like $999 with a simple e5-2403 quad core [no frills], with b320i and sas enabler key, real rails, no front video port dongle[grr].