What storage approach and drives would you go for?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

nabsltd

Well-Known Member
Jan 26, 2022
721
511
93
How much do I have to pay from CPU performance in handling the I/O demands of 12x61TB or 24x30TB NVMe drives*?
There is no CPU made that can handle 24x NVMe drives at full speed and still keep compute at 100%. The I/O will cause the CPU to heat up too much and so it will be throttled. But, since you won't be creating the 200GB/sec of data you would need to require 12 NVMe drives to absorb it, it really doesn't matter.

My objective is not to saturate the network, but to not have one to slow me down.
It won't. You won't be writing 100Gbps 24/7/365. You will be writing in bursts, which should easily be cached by a much smaller size of NVMe, which could then be written to SATA/SAS SSD or spinning disk.

Using a properly configured clustered storage solution, you could have 1000TB+ of storage, most of it spinning disk, and it would handle write speeds of 5-7GB/sec 24/7/365. If you went with SAS SSD as the final storage instead, you could likely handle writes at 30GB/sec 24/7/365.
 

ca3y6

Well-Known Member
Apr 3, 2021
391
305
63
Using a properly configured clustered storage solution, you could have 1000TB+ of storage, most of it spinning disk, and it would handle write speeds of 5-7GB/sec 24/7/365. If you went with SAS SSD as the final storage instead, you could likely handle writes at 30GB/sec 24/7/365.
That being said most solutions are really bad at that. ZFS doesn't even have anything that can be called a SSD cache. Windows Storage Space will only accept an SSD cache in front of HDD, not a mirror nvme cache in front of parity SAS SSD drives. And even with HDD it will not bother to cache any sequential writes. Not sure how lvm caching does, my only experience is through Synology/DSM, which sucks equally hard, no caching for sequential writes. Then you have PrimoCache on windows, however it is not power loss safe, so you are playing russian roulette with your data, you might as well use a RAM Cache. Still on windows, stablebit drivepool will at least redirect sequential writes to an SSD cache, but it's not block level, so can't deal with files being always locked (think VM virtual disk) or files larger than the space available on the cache drives. So I am sure there is some good solution somewhere, but not in the major products.
 
  • Like
Reactions: Dennisjr13

kapone

Well-Known Member
May 23, 2015
1,392
825
113
That being said most solutions are really bad at that. ZFS doesn't even have anything that can be called a SSD cache. Windows Storage Space will only accept an SSD cache in front of HDD, not a mirror nvme cache in front of parity SAS SSD drives. And even with HDD it will not bother to cache any sequential writes. Not sure how lvm caching does, my only experience is through Synology/DSM, which sucks equally hard, no caching for sequential writes. Then you have PrimoCache on windows, however it is not power loss safe, so you are playing russian roulette with your data, you might as well use a RAM Cache. Still on windows, stablebit drivepool will at least redirect sequential writes to an SSD cache, but it's not block level, so can't deal with files being always locked (think VM virtual disk) or files larger than the space available on the cache drives. So I am sure there is some good solution somewhere, but not in the major products.
Tangentially...Adaptec cards...with maxCache.

They'll take an array of SSDs and use it as a read/write cache on an array of HDDs. And that's before the onboard memory cache...all power loss protected. Now, how well does it do on sequential writes is up for debate, but it does work. I'm using it.

All completely transparent to the OS and file system on top.
 

kapone

Well-Known Member
May 23, 2015
1,392
825
113
though I presume you are limited to SAS SSDs, i.e. no NVMe caching?
That's correct. Even worse, the SSDs have to be on the same card as the HDDs, can't have HDDs on one card and SSDs on another.

Edit: Actually I take that back. I haven't tried this with any of the newer tri-mode cards, so maybe it's possible?
 

ca3y6

Well-Known Member
Apr 3, 2021
391
305
63
yeah but everything I read about tri-mode is that it's little more than NVMe over SAS. But as you said, everything on one card might work for some, but for crazy weirdos like me who build 24+ SSDs arrays, probably won't work, need a software solution. What is frustrating is that the computing power is there, it is just that no major software solution will use it.
 

kapone

Well-Known Member
May 23, 2015
1,392
825
113
yeah but everything I read about tri-mode is that it's little more than NVMe over SAS. But as you said, everything on one card might work for some, but for crazy weirdos like me who build 24+ SSDs arrays, probably won't work, need a software solution. What is frustrating is that the computing power is there, it is just that no major software solution will use it.
I think the reason is fairly simple. It's quite difficult to do this in a general purpose way. Like Adaptec's implementation, which works up to a point, but doesn't scale beyond a single card.

I suspect as flash storage gets cheaper, this is gonna become a moot point. Hell, Microsoft did an experiment with lasers and storing data on crystals, and they got some mind boggling numbers if I recall. It think they were able to store like 100TB on a crystal the size of a postage stamp...
 

ca3y6

Well-Known Member
Apr 3, 2021
391
305
63
I agree, but every CPU manufactured in the last 10 years can XOR data at 10s of GB/s, I just don't understand why all those solutions are that slow. Parity calculation isn't a reason. I suspect this was all built and tested on HDD with slow SSDs in front of them (maybe) but not in the modern world of 10GB/s PCIe5 NVMe SSDs. There must be super inefficient and over-engineered code everywhere.

As a .net developer, it is striking to see the performance difference between modern versions of .net core and .net 4.8. The difference being the team actually spent some time on performance optimization, killing any unecessary memory allocation. And the impact is brutal and visible to the naked eye.
 

kapone

Well-Known Member
May 23, 2015
1,392
825
113
Oh absolutely. My main backend codebase (for my business) is .net9 (now), run on Linux containers and it’s blazing fast compared to what .net used to be.

And C# is just… amazing.
 
  • Like
Reactions: ca3y6

homeserver78

Member
Nov 7, 2023
94
57
18
Sweden
Note that the drives alone will require at least 10KW to run, plus about the same amount of power in cooling. That's assuming you go with 100TB+ drives. At 30TB, it will be closer to 40KW.
How do you figure that 12 drives à 25 W (max) will draw 40 kW? (Or did you think that the 134 PB was the total amount of storage needed? Then the figures make sense. It was not though, it was the total write endurance if going with the 122 TB drives, as you can see in the post you replied to.)

(Also a nitpick: K is the symbol for kelvin (temperature). The kilo prefix is lower-case k.)
 
  • Like
Reactions: SnJ9MX

joerambo

New Member
Aug 30, 2023
27
8
3
Intel/Solidigm 5336 and 5316 are not exactly the fastest drives due to them being QLC.

In fact 5316 is slower than good SATA SSD in random 4K writes and that might become a problem, if You have multiple threads writing many files, that need syncing, metadata updates etc, each of those writes will eat into performance.

Where they shine is having huge capacity of flash, decent read performance. So bulk storage of write-once read-many that provides orders of magnitude more read IOPS than comparable capacity spinning rust would.

We do have multiple arrays of ~110TB out of 30TB 5316's. Having that much storage directly on CPU lanes has worked wonders for us. All for ~11k euros per array for storage. Sadly the prices have changes since last flash glut.
 

nabsltd

Well-Known Member
Jan 26, 2022
721
511
93
(Or did you think that the 134 PB was the total amount of storage needed?
I think I just gave up on trying to parse the rambling "I've got this super-secret project and I don't want you to have details that allow you duplicate it, but I do want you to help me design it" to see that the data was being generated then tossed after each "project" finished, so only one set of drives was needed.
 

fatherboard

Member
Jun 15, 2025
37
1
8
Intel/Solidigm 5336 and 5316 are not exactly the fastest drives due to them being QLC.

In fact 5316 is slower than good SATA SSD in random 4K writes and that might become a problem, if You have multiple threads writing many files, that need syncing, metadata updates etc, each of those writes will eat into performance.

Where they shine is having huge capacity of flash, decent read performance. So bulk storage of write-once read-many that provides orders of magnitude more read IOPS than comparable capacity spinning rust would.
→ “write-once read-many” is exactly what Content 2 is, it will then get U.2 QLC, no need for TLC.


We do have multiple arrays of ~110TB out of 30TB 5316's. Having that much storage directly on CPU lanes has worked wonders for us.
→ Great to hear. This is exactly the wonderland I’m currently building, but just about 10 times bigger. Could you please expand more on “has worked wonders for us”?
 

fatherboard

Member
Jun 15, 2025
37
1
8
Some suggest a layered approach where the NVMe get the incoming data, then hand it over to slower and cheaper drives because there is time between each write.
If I understand it right, NVMe drives haven’t been tested yet, it was done with slower SSDs.
Assuming there is no ready made, off-the-shelf hardware solution today, would a simpler approach equally work?:

Line up layers of drive arrays: Fastest-Small, Fast-Mid-size, Slow-Large.
  1. Fastest-Small: expensive NVMes are first to receive the written data
  2. Fast-Mid-size: a small pool of written files, sort of a temporary place to relieve the Fastest and make room in it, not designed to have all files at the same time, just e.g. 50 files.
  3. Slow-Large: final destination, all the files are here.

Use C++ programme to identify and move files:
For Write: Fastest → Fast-Mid-Size → Slow Large​
For read: Fastest ← Fast-Mid-Size ← Slow Large​

The problem is it will impact:
- RAM, that I would rather leave for data being processed
- CPU I would rather have it do HPC than be busy checking and managing file transfers.

Does this make sense?
 
Last edited:
  • Haha
Reactions: itronin

fatherboard

Member
Jun 15, 2025
37
1
8
Why not?

It's just:
250a+250b+250c = total 750 files
written sequentially over 24hours x 30 days = 720 hours → 1 file / hour on average.
The first files could take 10s of minutes, the last files could take several hours, but I average to keep it simple.

1 guy, 1 project, 1 file / hour, why not? I haven't invented it, I've copied the one below and made it manual (with C++).
Let's not forget, it's one guy, it's not meant for any wider deployment, no company, no boss, no stress, if it works great, if not, why not?

You won't be writing 100Gbps 24/7/365. You will be writing in bursts, which should easily be cached by a much smaller size of NVMe, which could then be written to SATA/SAS SSD or spinning disk.

Using a properly configured clustered storage solution, you could have 1000TB+ of storage, most of it spinning disk, and it would handle write speeds of 5-7GB/sec 24/7/365. If you went with SAS SSD as the final storage instead, you could likely handle writes at 30GB/sec 24/7/365.
 

kapone

Well-Known Member
May 23, 2015
1,392
825
113
Why not?

It's just:
250a+250b+250c = total 750 files
written sequentially over 24hours x 30 days = 720 hours → 1 file / hour on average.
The first files could take 10s of minutes, the last files could take several hours, but I average to keep it simple.

1 guy, 1 project, 1 file / hour, why not? I haven't invented it, I've copied the one below and made it manual (with C++).
Let's not forget, it's one guy, it's not meant for any wider deployment, no company, no boss, no stress, if it works great, if not, why not?
I think it's been mentioned before. You're playing with things you know very little (to none) about.

I suggest hiring a consultant.
 

pimposh

hardware pimp
Nov 19, 2022
397
227
43
Processing
Step 1: only Content 1 is being worked on (all the files). generate 1a. read 1a to generate 1b. Then read 1a to generate 1c.
Step 2: only Content 2 is being worked on (just a few files)
Step 3: All content 1 + a selected few from Content 2 are being worked on at the same time.

Users: for now just one, myself.
Step 1: uploaded vide gets recoded into 720p / FHD / 4K
Step 2: meta/thumbnails are processed
Step 3: user captions are added

OnlyFaps2!! Is the answer?

The only fact that is odd is that there will be one user watching that. Is it really the case?

Sorry just could not resist xD