NVME vs SAS - real world 'homelab' findings

Dreece

int 21h
Although this subject is often discussed in favour of NVME across the web, after spending some time doing countless benchmarking (really not fun!) I've come up with a conclusion, would love to hear other opinions/findings on this.

a) NVME drives only perform well with dedicated CPU resources, their performance drops considerably in shared CPU environments, ie Virtualisation.
b) SAS SSD drives with a quality hardware raid controller have a read latency on par with NVME, although write latency is NVMEs' forte. However if we use write-back caching on hardware raid with DDR4 on controller, write latency is on par with NVME.

So, in a non-shared environment, ie. a dedicated NVME storage server/cluster, NVME truly shines.
However, in a shared environment, SAS SSDs on a hardware RAID controller perform just as well.

Now they say IOPs is where its at with NVME, but again, with around 400K IOPS on typical modern SAS SSDs, I really can't see that being a bottleneck in a general 10-20 VM environment.

I'm just putting this out there for debate really.

NVME is more CPU hungry than SAS SSD via RAID hardware, and NVME doesn't crunch out the advertised numbers in a heavily shared server, that is my conclusion thus far. Indeed NVME-oF kicks in here, and thus far with my tests in this arena with RDMA, very impressive performance! (Proxmox to Windows clients, SPDK target to Starwind NVMEoF initiators).
 
  • Like
Reactions: T_Minus

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,046
1,583
113
CA
Please quantify your environment more beyond RAID Controller. I've never seen a SAS or SATA SSD perform just as good or better than NVME in a local shared environment but I also don't run hardware RAID. The cache will run out at some point.
It sounds like it may be worth a test to see, I do know it's >perf than ZFS but also lacks why we\I use ZFS.

"400K IOPS on typical modern SAS SSDs"

Which SAS SSD are you able to get 400,000 IOPs from?

Even on small systems like E3 V3 w\32GB RAM using a 280GB Optane gives me smoking performance for a handful of VMs, performance 4x SATA can't compete with or baremetal with a database. This is an ultra-low power system relative to performance!

Regarding NVME needing CPU, yes, yes, yes... there's posts I made about this from 4 or 5 years ago where I was hitting CPU limitations on big (at the time) servers.

With the price of E5 V3\4 CPU now and $250\2TB NVME I still believe this is the way to go. If you need higher performing NVME you can step it up to $300-400 for 2-3.8TB, still unbelievable price\performance.

Something VERY IMPORTANT here that was left off is COST

Simply put, In the last 5 years I've been able to acquire NVME Capacity at a much reduced price to SAS3 and SATA SSD.
This is not taking performance into consideration, only Enterprise SSD vs Enterprise NVME.
If you consider performance based NVME vs performance based SSD then the cost\saving ratio of NVME is off the charts.
But simply put $ per-GB NVME has blown away SATA\SAS SSD for cost savings.


Please update us with which CPU(s) you were using and how much RAM you were using on your test systems and what you were seeing as far as CPU Load, RAM Usage, etc.
 
  • Like
Reactions: Aluminat and Dreece

Dreece

int 21h
NVME setup - used for testing, 3.2tb PM1725a 8x pcie cards and Intel P4610 4tb drives.
SAS setup - a Broadcom 9460-16i 4gb cache raid, with 2 attached SSDs PM1643 (400K iop sas drives) in raid 0
Server running dual Platinum 8160s with 256gb.

CPU usage on the NVMEs takes all the cores of processor 0 at certain points of the benchmark to 100% and it sits there for the duration, when not all cores maxed out it just maxes out core 0.
CPU usage on the SAS barely cracks 54% on just one core for most tests and only on the high queue depth + multi-thread test I see it momentarily flash max then slap back down to around 50%.

I didn't bother saving any screenshots as this testing wasn't really designed for a review, I used ATTO and Crystal for the benches.
I have a screenshot I shared earlier with a colleague of a Crystal disk test.

Terminals_s5nfpLSt00.png

The hardware SSD SAS raid is set to direct-io (no caching) but with write-back caching enabled. Please note this is by no means a 'scientific test', but in general NVME on overload is far hungrier than hardware raid on cpu usage, if others would like to run some tests on their labs and get some numbers together too, I'm all ears.

For now I've decided to stick to using NVME over fabrics, so they have a dedicated server all to themselves.
 
Last edited:

Dreece

int 21h
The writes are going to be high, because as said in the first post, I use writeback caching. But there is no read-caching going on, in both cases its direct from drive. I even see the tiny little green lights flashing away on the two drives (yes I too had my doubts!).
 

Whaaat

Member
Jan 31, 2020
44
11
8
Is it MPIO in action? How is it possible for two 12Gbps SAS drives in raid0 to give such high numbers?
 

Dreece

int 21h
I wasn't really looking at the sequential numbers to be frank as its an unrealistic benchmark, only threw that picture up as it was the only one I had relative to another discussion.

Samsung claim the PM1643s reach speeds of up to 2000mbps themselves, though don't understand how that can be via a 12G connection. I haven't enabled any kind of MPIO whatsover, unless that is being engaged automatically? I'm all ears for checking up on MPIO in windows, haven't a clue where to check for that?

 

Whaaat

Member
Jan 31, 2020
44
11
8
Samsung claim the PM1643s reach speeds of up to 2000mbps themselves, though don't understand how that can be via a 12G connection.
Those Samsung's numbers are for both ports active simultaneously on each drive. CDM results reflect either Windows or controller caching effectiveness.
 

Dreece

int 21h
Those Samsung's numbers are for both ports active simultaneously on each drive. CDM results reflect either Windows or controller caching effectiveness.
I'll do another quick test using ATTO, that at least has a flag for direct-io so that can effectively cancel out Windows caching.
The raid controller configuration is already at direct-io, I rather leave it like that because I only need the write-back caching facility and don't desire read-caching for the kind of workload this array will be doing.
 

Dreece

int 21h
Terminals_GDU3qEtyV2.png

I'm presuming the above is self explanatory, as expected, because of the fixed writeback caching, it hits the 3gb/s and kind of sits there for the duration, where as the direct-io read hits a peak of around 3.4gb/s and then starts dropping rapidly down. If there was caching on the controller side or even windows side, the read speed would hit a peak and continue at that peak till the end.

But the performance wasn't the point of this debate. My concern was around the CPU usage, as my findings showed that NVME when hit ambitiously, takes a toll on the CPU the pcie-lanes are bound to.
 

Dreece

int 21h
I guess it all comes down to use-case. In most cases, it shouldn't be a problem. Benchmarks saturate the load to the max so it is a little on the unrealistic side, but still would love to hear what others are seeing with their server CPUs whilst NVME drives are hit with artificial loads like benchtests.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,046
1,583
113
CA
Yes, a decent % of the CPU will be used when taxing NVME. I don't remember seeing heavy utilization when testing only 1GB though that seems a bit strange, usually it's sustained random read\write iirc

I wouldn't use CDM to benchmark multiple NVME or SSD though. Additionally there's really nothing worth discussing\comparing testing a 1GB file on a single NVME, multiple NVME, or multiple enterprise SAS or SATA drives. You need to test much larger files to get outside any potential cache onboard or on controller.

I would go for a 12GB file test and ideally use iometer with various configurations (check the spec sheet for what they used too).
 
  • Like
Reactions: Dreece

Dreece

int 21h
I think somehow this debate turned from CPU+latency to bandwidth because of me throwing up a drag-queen screenshot. I should have avoided that, nobody is touching NVME performance outside of Optane, although recent SAS drives are pretty stout, they're a far cry from NVMEs outright bandwidth superiority.

Primary discussion really is around CPU hogging. In my setup, the raid controller has less of a hit on CPU resources than NVME drives. I am going to swap out the windows boot drive and do some tests with barebones linux too to see if the cpu hit differential between nvme and raid controller is the same on linux.

In summary having no hit on cpu resources on the consumer side by relying on rdma to offload across fabric is the way forward, so I've got my notebook and pencil out and am mapping a whole new strategy for a fairly simple homelab, where I have one server filled to the brim with nvmes and a very large sas ssd pool for archive data.

Recently I got rid of all spinners apart from a cluster of 8 12tb drives setup as a hardware raid 10 for regular backups. That was probably the best feeling ever. I've also dropped zfs and all other software raid setups, nothing wrong with them per-se, I just elected to go back to hardware raid all round.
 
  • Like
Reactions: itronin

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,046
1,583
113
CA
I think it's a worthy discussion in all regards :)

" NVMEs outright bandwidth superiority. "

Yes, but technically the NVME should be superior with low latency \ low QD IOPs too since you're not seeing this really I think the RAID controller is coming into play and I'm curious to see your results with larger test file and\or linux tes too.

ANother thought is to do 1 NVME vs 1 SAS SSD comparison with no raid controller, but I don't think this was your purpose ;)
But should show the latency and IOP differences of raw drives without software raid or HW raid in front.

Look forward to any other tests you do :)
 
  • Like
Reactions: Whaaat and itronin

i386

Well-Known Member
Mar 18, 2016
2,048
542
113
31
Germany
Open Crystaldiskmark directory, go to CdmResource\DiskSpd, open cmd.exe and run diskspd64.exe with these parameters:
Code:
diskspd -b4K -c20G -d120 -L -o8 -r -Sh -t4 -w20 testfile.dat
This will create a 20GByte file, use (aligned) 4 KByte for reads & writes, run for 120 seconds, with a queue depth of 8, flag the io to not use any cache , 4 threads and a 80% read and 20% write worklad.
Code:
Total IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
-----------------------------------------------------------------------------------------------------
     0 |     24589053952 |      6003187 |     195.41 |   50024.20 |    0.140 |     0.194 | testfile.dat (20GiB)
     1 |     25200467968 |      6152458 |     200.27 |   51268.06 |    0.136 |     0.198 | testfile.dat (20GiB)
     2 |     26748911616 |      6530496 |     212.57 |   54418.23 |    0.128 |     0.193 | testfile.dat (20GiB)
     3 |     26932502528 |      6575318 |     214.03 |   54791.73 |    0.128 |     0.133 | testfile.dat (20GiB)
-----------------------------------------------------------------------------------------------------
total:      103470936064 |     25261459 |     822.27 |  210502.22 |    0.133 |     0.181

Read IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
-----------------------------------------------------------------------------------------------------
     0 |     19673640960 |      4803135 |     156.34 |   40024.23 |    0.139 |     0.197 | testfile.dat (20GiB)
     1 |     20159467520 |      4921745 |     160.21 |   41012.60 |    0.136 |     0.193 | testfile.dat (20GiB)
     2 |     21400768512 |      5224797 |     170.07 |   43537.92 |    0.128 |     0.191 | testfile.dat (20GiB)
     3 |     21554036736 |      5262216 |     171.29 |   43849.73 |    0.127 |     0.136 | testfile.dat (20GiB)
-----------------------------------------------------------------------------------------------------
total:       82787913728 |     20211893 |     657.91 |  168424.49 |    0.132 |     0.180

Write IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
-----------------------------------------------------------------------------------------------------
     0 |      4915412992 |      1200052 |      39.06 |    9999.96 |    0.141 |     0.181 | testfile.dat (20GiB)
     1 |      5041000448 |      1230713 |      40.06 |   10255.46 |    0.138 |     0.215 | testfile.dat (20GiB)
     2 |      5348143104 |      1305699 |      42.50 |   10880.31 |    0.130 |     0.199 | testfile.dat (20GiB)
     3 |      5378465792 |      1313102 |      42.74 |   10942.00 |    0.129 |     0.119 | testfile.dat (20GiB)
-----------------------------------------------------------------------------------------------------
total:       20683022336 |      5049566 |     164.37 |   42077.73 |    0.134 |     0.182
 
  • Like
Reactions: T_Minus

Dreece

int 21h
Using the above command I get an average latency of 0.343 on reads and 0.191 on writes via sas raid controller. Not xpoint territory on the read, but definitely on par with typical nvme nand, and the write via writeback-caching is scraping at the heels of xpoint.

It is kind of pointless to throw around xpoint numbers. Only someone with very little data is going to be running an all xpoint storage tier, great for a reasonable size database or cache, other than that practically unrealistic for large storage pools.

I'm building up the dedicated ssd storage server today, once I have that completed - I'll do some comparisons between the sas and nvme pools across fabric and call it a day.
 

Rand__

Well-Known Member
Mar 6, 2014
4,592
912
113
So i had typed up that lengthy reply below before re-reading your initial question seeing you are primarily concerned about cpu utilization. I left it in because it seemed a waste to delete it, maybe its interesting despite being unfinished (ie no numbers etc).

Regarding your primary concern of CPU usage - I think thats again a question of the use case - NVMe most likely will be able to provide better performance per cpu cycle than SAS.
But what you do in your example is taking highly optimized SAS infrastructure and comparing it to generic cpu based NVMe. O/c ages old SAS is more efficient in processing data (having put the required routines in silicon in your raid controller) then relatively young NVMe . Also you are looking at a best case scenario for SAS - very high IOPs drive, reads, no scaling issues (try that with 8 or 16 drives).

Whether one or the other is better is as always totally use case dependent.
It might be perfectly fine in a company storage system if you hog 50% of a low power multi core cpu since you got 28 cores which serves 24 NVMe drives which provide a total performance that exceeds your 2 100G links.

The important thing is to understand what your requirements are and what is the right tool for the job.

If you are fine with SAS drives, their advantages and limitations then there is nothing/nobody that should keep you from it.

Me personally I am looking to migrate from 12 SS300's to 4 PM1725a's at this time (ZFS filer) since i hope for lower power consumption. I ran a bunch of fio tests (which also have cpu utilization), but have not beautified the results, so if you want the raw logs I am happy to provide them














-----------------------------------------------------------
In the end the question is based on what primary resource you want to build your system.

Your system can be bound on CPU, available space, max performance, drive slots and of course it can be cost bound (and others o/c like power consumption).

What is important for you will be based on the actual use case (company => minimum performance for as many as possible users or better performance for less users [prime tier], homelab maybe best performance & space you can get of a certain amount of cash), and o/c on what you already have.

The next question is how things scale (if you want/need them to scale); by faster cpus, more users/vms (#jobs), or more drives - that is dependent on the hw in use and and again of your use case.

O/c I can't tell you how things work in your use case, but in mine (homelab with few users and high perf requirements) it turned out that the single item that improved the total performance the most was individual disk performance.
O/c that is due to the fact that I have low number of users/qd/jobs, so nothing the system can actually scale upon, which might or might not be the same in your use case (and no, I am not counting 20 odd VMs that are running here since they are running in the background and when I am not interactively using one of them I dont care if it takes a minute longer to do something; but again thats use case dependent, i dont run 10 databases with heavy IO after all).

So how is that related to your question?
Well, what you do is, you optimize single disk performance by using a raid controller to aggregate more than one SSDs into a single drive which you then compare to a single NVMe drive.
What you see is (the expected result) that reads (which can be distributed) are scaling by devices while writes are not.
 
  • Like
Reactions: Dreece and T_Minus

Dreece

int 21h
It is all relevant, I think it is very helpful to cover more of the implications and potential use cases as any discussion where peoples own experiences cover a wider range of use-cases can come in helpful to a larger audience.

The update this far, so I literally last night completed my new storage server fully loaded with all my data and now serving all my VMs both to the Proxmox host and to the Hyper-V host - all flying like a top fueller launching on every request.

Now CPU usage is no longer a concern whatsoever, over fabrics via rdma takes that concern away completely and throughput is well and truly next generation now because of the dedicated isolated resources.

To be honest, I can discern no difference between the SAS and NVME pools at the clients in my setup, the only negative is that the NVME pool generates a considerable amount of heat in comparison to the large SAS pool.

From the few benchtests thus far, the NVME pool can handle more simultaneous requests whilst delivering higher throughput at low latency, in contrast the SAS performance starts dropping as the parallel requests increase, not abysmal by any means, but I do envisage it impacting a low latency high-throughput environment greatly whereas NVME is able to keep up with the load, albeit at the cost of more power and heat, however that comes with the territory of greater bandwidth requirements irrespective of tech.
 
Last edited: