M2 SSD - PCIE Version effect on IOPS

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

UnknownPommes

Active Member
Aug 28, 2022
101
32
28
I am currently looking for drives for an upcoming proxmox server build.
So it is going to be mostly random read / write and not a lot of sequential.
One option would be to go with "normal" m2 ssds on a pcie riser.
The server is pcie gen3 and now i am thinking about just going with gen4 drives,
bc the cost is not to much different and i have them in case i upgrade in the future to a newer platform.

I am aware that they will obviously be limited by the gen3 pcie interface for sequentiell speeds.
But what about random speed / iops?, most of the gen3 ssds i found where around 350k iops while gen4 where around 700k
Would a gen4 ssd on a gen3 board still achieve those iops (or at least better ones than the gen3 ssds like reach 500k or sth)?

My mathematical thoughts where the following
pcie x4 gen3 bandwith - 3500mb/s = 3 560 000kb/s
4k iops ~ 4000kb each
3 560 000 / 4 = 896 000

Am i wrong or would a pcie gen3 x4 be able to do close to 900k iops (theoretically) at 4k size?

Also in case i upgrade to a gen4 board in the future i would definitely have faster drives if i go with gen4 now.
And yes i am aware of low endurance / tbw and missing plp on some consumer ssds, but lets leave that out for now.

What are your thoughts? :)
 

nexox

Well-Known Member
May 3, 2023
674
279
63
I haven't looked up any specs, but I presume that with double the bandwidth PCIe 4.0 also gives you about half the latency of 3.0, and that will start to become important to the IOPS throughput at some point, I'm guessing at around 75% of the maximum bandwidth.

You also haven't mentioned whether you're looking at reads or writes - PCIe (all versions) are full duplex so that 32Gbps for 3.0x4 means reads and writes can hit ~4GB/s concurrently. Naturally within the SSD there is some contention, so adding writes decreases read IOPS, quality drives usually give a 70/30 or 80/20 IOPS spec in addition to full read and full write numbers so you can estimate how your workload will fare.

Of course if you're doing that many random writes a consumer grade SSD is going to fill its various levels of write cache rather quickly and then you'll get not so good performance until the drive gets a break to flush those caches out to slower NAND.
 

UnknownPommes

Active Member
Aug 28, 2022
101
32
28
I haven't looked up any specs, but I presume that with double the bandwidth PCIe 4.0 also gives you about half the latency of 3.0, and that will start to become important to the IOPS throughput at some point, I'm guessing at around 75% of the maximum bandwidth.

You also haven't mentioned whether you're looking at reads or writes - PCIe (all versions) are full duplex so that 32Gbps for 3.0x4 means reads and writes can hit ~4GB/s concurrently. Naturally within the SSD there is some contention, so adding writes decreases read IOPS, quality drives usually give a 70/30 or 80/20 IOPS spec in addition to full read and full write numbers so you can estimate how your workload will fare.

Of course if you're doing that many random writes a consumer grade SSD is going to fill its various levels of write cache rather quickly and then you'll get not so good performance until the drive gets a break to flush those caches out to slower NAND.
thanks, yeah i am still not sure if i will go with m2 anyway, the main thing i am concerned regarding U2 is the power draw, they are all listed with like 15-20w if i have 4-5 of them thats like more than the entire server (server is more power efficiency focused) m2 on the other side is like way way lower most under 1w.
SATA on the other hand is too slow.

Btw i am looking to get total usable storage ~6-8tb, iops 400k-500k read/write, seq like 4-5gb/s range and one drive fault tolerance (probably raidz1).
I would go with the lowest power usage solution that checks those boxes.
Available connectivity on the server pcie x16 gen3 and 10+ sata ports.
 

nexox

Well-Known Member
May 3, 2023
674
279
63
U.2 drives will only draw that much power when they're under heavy load, idle is usually in the 4-5W range, and any M.2 drive that's going to perform anywhere close to what you need will also draw somewhat more than 1W when under that kind of load, some models will exceed 8W, and you may need more of them to maintain performance over longer periods of time.
 

UnknownPommes

Active Member
Aug 28, 2022
101
32
28
U.2 drives will only draw that much power when they're under heavy load, idle is usually in the 4-5W range, and any M.2 drive that's going to perform anywhere close to what you need will also draw somewhat more than 1W when under that kind of load, some models will exceed 8W, and you may need more of them to maintain performance over longer periods of time.
ok, yeah the thing is i havent really found two many idle numbers for them, i actually currently have two intel p4600 laying around but cant test them bc i dont have the pcie to u2 adapter card, also i would need two more to get to the capacity i need.
Can you recommend any particular models for the server?
 

nexox

Well-Known Member
May 3, 2023
674
279
63
The p4600 is pretty good, the Micron 9300 Max would also probably hit the performance numbers you want, though I don't see idle power specified for those.
 
  • Like
Reactions: UnknownPommes

UnknownPommes

Active Member
Aug 28, 2022
101
32
28
Have a look at this thread, if you plan on using enterprise m.2 they can draw up to 10 watts each on load.
jup, thanks already noticed that.
I am probably going to go with 2tb Seagate FireCuda 530 tho.
They are dimensions faster than the u2 drives i looked at, they still offer 2.5pb endurance each (since i will get 4 drives in raid z1, this should gibe me a bit over 7pb in writes across the entire pool), are probably going to be around 120€ or less on black friday new and
i am also saving like another 100 - 120€ since i only need one pcie to m2 card and not need 4 SFF-8639 cables and a pcie to 4x sff8643 card.
And if i upgrade my platform to pcie4 in the future i would get an additional speed bump out of the pool.
 

UnknownPommes

Active Member
Aug 28, 2022
101
32
28
Understand that consumer speed #s are best case, enterprise speed #s are guarantees.
yeah i am aware of that but we are talking in therms of official specs 617.5k/238k for my p4600 vs 1000k/1000k read/write 4k iops on the firecuda and i am already running a bunch of those seagate firecudas in some of my other systems and also got two p4600 (but only two so not really an option for this project) and from my tests with crystal disk mark, on the same settings the p4600 where around 600k/240k while the firecuda was like 850k/800k, i mean one is gen3 the other is gen4. But yeah compared to the old mte220s i am using they are both super fast.
 

nexox

Well-Known Member
May 3, 2023
674
279
63
That 1000k IOPS spec is only until the DRAM cache fills up, which could be good for just a few seconds of writes at that rate, then you usually have a larger SLC NAND write cache, which runs slower and may take a few minutes to exhaust. If you're going to use them for a real high write throughput workload both levels of write cache will fill up quickly and they will run slower than those Intel drives, plus you'll likely kill them long before the rated lifespan, because consumer drives assume mostly sequential writes when they test for that.

There's not much point benchmarking a brand new SSD, because they're very fast until most of the NAND is used, at which point they suffer a huge hit to write performance, which just gets worse over longer periods of time when NAND cells begin to fail. Enterprise drives come with way more spare NAND and things like PLP which help maintain performance both when the drive initially fills up and after the NAND starts to age a bit.

Also from what I understand RAIDZ is pretty slow for random writes, part of that is probably an indication that it causes some amount of write amplification, so the 7PB number could be rather optimistic, depending on your workload.

Given how rapidly SSDs have been dropping in price recently it doesn't make much sense to consider future needs, buy what you need now and expect much better deals in the future.
 
  • Like
Reactions: i386

Tech Junky

Active Member
Oct 26, 2023
351
120
43
Given how rapidly SSDs have been dropping in price recently it doesn't make much sense to consider future needs, buy what you need now and expect much better deals in the future.
I agree here. Also, M2 doesn't make sense for a production environment as you will get denied for warranty claims and have to buy new disks out of pocket anyway.

M2's also run hot when under load and will require active cooling if you intend to run things full throttle. They will bottleneck besides the depletion issue.

Endurance will not multiply w/ drives since you're mirroring things with raid.

I think ideally dedicating a flash cache type drive would be of benefit that handles these 4K uses and then hand off to the drives for longer term storage. Tiering will reduce the overall wear and tear or bulk up the RAM and use a RAM disk for processing the data before storing it.

I recently switched from HDDs to SSDs and tried a couple of different options like a M2*5 Raid10 and eventually jumped ship to using U.3 NVME instead. The thing to keep a close eye on it the setup of the U drives as they vary depending on intended use. Some are Read focused / Write / mixed.

CD8-R Series NVMe™ SSD 2.5-inch | KIOXIA - United States (English) KCD8XRUG15T3

I ended up going with one of these after killing 2 Micron drives somehow. 1 lasted a couple of hours post install and the other less than a week.

Anyway... once you get away from the crappy consumer drives things work better. Also, you don't get hit with the idiot tax for cheap devices. Even some other options at lower capacities make more sense as the 8TB U option ($400) is 50% less than the M2 version($800). Putting the drive inside a 2.5" case also drops the temps with passive dissipation and improved even with a modest amount of airflow across the case. My idle temp hovers around 40C and under load not much higher but, then again I'm not sustaining load on it typically.
 

UnknownPommes

Active Member
Aug 28, 2022
101
32
28
yeah, the thing is this is for a hypervisor / vm disks not as cache / super heavy usage drive.
Endurance will not multiply w/ drives since you're mirroring things with raid.
As far as i know this depends on the raid type, on sth like a raid1 where everything is mirrored it obviously doesnt, but on others like stripe / raid5/6 it at least partially should. For raid0 it should just multiple with drives since the data is split up and spread across the drives evenly and for 5 / 6 it should multiply for the amount of non parity drives so lets say i got 4 disks in raid5 (one parity) i should get approximately 3x the endurance of a single drive.

Regarding the cache, i couldnt really find to much informations about the cache of the intel dc p4XXX ssds (like the p4600). Do they even have cache or do they really sustain all the speed directly to the drive with no cache?
Also as far as i understand it the dram cache on most ssds is usually used as some sort of registrar and not to store actual files that are written to the drive, correct?
Regarding the benchmarks i did, i am aware that empty ssds perform generally better than full ones, but since both where empty shouldnt this even out? also i dont think caching played to much of a roll since i tested with the 128gb option, which i think should be a lot larger than the cache can handle, right?
How large is the SLC cache on the Firecuda 530 actually, didnt really find anything in the datasheet?
thanks for all the help
 

nexox

Well-Known Member
May 3, 2023
674
279
63
4 disks in raid5 (one parity) i should get approximately 3x the endurance of a single drive.
You're leaving out a certain portion of write amplification for parity RAID, because every time you write 4K you have to re-write another 4K of parity (though many implementations will write the entire parity stripe,) so for random writes that means you get half of the total endurance, at best, and from what I understand RAIDz writes a lot more than the minimum in this case.


Regarding the cache, i couldnt really find to much informations about the cache of the intel dc p4XXX ssds (like the p4600). Do they even have cache or do they really sustain all the speed directly to the drive with no cache?
Enterprise drives tend not to use DRAM for write caching, because server applications tend to sync writes often enough that the drive would have to write everything to NAND all the time anyway and not gain any performance. I imagine anyone that knows exactly how those Intel drives use their DRAM is under NDA and can't share the info, but benchmarks don't show any sign of significant write caching - they do support that performance pretty much direct to flash.


Also as far as i understand it the dram cache on most ssds is usually used as some sort of registrar and not to store actual files that are written to the drive, correct?
For enterprise drives, yeah, they tend to store the block mapping table, small amounts of user data that has to be moved around for background maintenance, and maybe some read cache (which is again not super useful for enterprise workloads because they have system memory for read cache.)


but since both where empty shouldnt this even out?
Not really, not across drives designed so differently. Enterprise drives are specifically made to do consistent IO over long periods of time, desktop drives are primarily designed to win trivial benchmarks, because that's what their respective buyers want.

How large is the SLC cache on the Firecuda 530 actually, didnt really find anything in the datasheet?
Modern consumer SSDs tend to use some kind of variable caching strategy, TLC chips can be run in SLC mode when conditions are suitable. You really just have to hammer the disk with random writes, chart the speed vs quantity of data written, then look for the performance steps, that will show you about how large each cache is. Realistically you want to test with your actual workload, or a close simulation, and you should fill the drive up with random data immediately before you kick off the benchmark, so the drive is closer to steady state operating conditions.
 

Tech Junky

Active Member
Oct 26, 2023
351
120
43
You'll see a difference.

When I was testing TB enclosures I tested 4-5 different drives and for that purpose a new SN770 w/o DRAM worked better than the SN850 that does have it. However, internal use may vary a bit more based on how much RAM the system has to compensate. With 32GB I didn't see any deprecation in speed.

When I was initially looking into the U drives Intel didn't really do much other than have inventory. Their specs didn't have much appeal. Micron looked good until they didn't work in such a short lifespan which was a blessing due to being able to return for a refund vs RMA. Ruling those two out only leaves a couple of other options Kioxia / Samsung. SS for me is a no go for their higher prices and no real performance difference.

Seagate.... SMH... never again will they be in any of my systems.

Caching for me at this point isn't a concern since the drive hits 6.5GB/s in throughput and I put most things into write-through mode direct to disk instead of OS caching and then passing it on. The nice thing about some of the drives is no need to Raid them to get the speed you want like dealing with spinners. That was kind of the point for me switching to SSD in the first place besides it being lighter in weight for moving the system location. Loading up a case with drives adds considerable weight to be dealt with. Anyway... I went with the 16TB (15.36) option as that's what my prior setup had in raid. Figuring the prior setup was in use 24/7 for 8 years and still the drives could probably go another 10 years+ according to the SMART data on them. The SSD should be good for a decade which makes the cost a bit more palatable.