Tiered storage sizing recommendations

nabsltd · Feb 26, 2024

In a system with about 15TB of data, growing at about 0.5TB/year, with about 500GB of that data being the "working set" at any given time but the actual active data changing as users do different things, and about 20-50GB of essentially "wired" data (metadata, indexes, etc.) that is always part of the working set, what would be a good size for the "fast" tier (likely NVMe) when the slow tier is spinning rust?

I'm looking at enterprise NVMe, but the chassis is a 2U without U.2 support, so I probably have to go with M.2. Enterprise M.2 is much more expensive per GB than U.2, so I'd like to keep the size as small as possible but still give a decent cache hit ratio. If I have to, I could find space in the chassis for U.2, but I'd probably only do that if I needed 4TB or more of fast tier.

Although I am very likely going to use Linux for this, any sizing pointers based on experience with other environments would be appreciated.

Thanks.

nexox · Feb 26, 2024

When the working set data changes is that mostly sequential reads from the slow array? Same question for the writes of new data, and if it's not very sequential are you planning to run the fast tier in writeback or writethrough mode? If your slow tier access is mostly sequential then you probably don't need much more than the 500GB working set for the fast tier, except that's awfully small for a modern SSD, a 1TB mirror is probably a safe bet. If you can pin the wired data in RAM (I'm pretty sure there's a way or three to do this on Linux but I don't recall any details) that will take some IO load off the SSDs.

Unfortunately this sort of thing is strongly workload dependent, so you kind of need to run it and see how it works for you, or just go the easy route and spend more money to make sure you definitely have enough capacity.

nabsltd · Feb 26, 2024

nexox said:
Unfortunately this sort of thing is strongly workload dependent, so you kind of need to run it and see how it works for you, or just go the easy route and spend more money to make sure you definitely have enough capacity.

RAM is 96GB, so I should be able to do some cache tuning to lean more towards MFU instead of MRU. Unfortunately, I don't think I can (or even want to) do it at the application level, since things like DB indexes and temp tables could starve out other RAM if I pick the wrong numbers, and I'd rather not tune every app. Linux kernel tuning would hopefully be good enough, and a one-place-to-change solution.

For me, it's the fact that enterprise M.2 drives bigger than about 3.84TB either don't exist or are insanely expensive. And, I'd prefer not to even go to the 3.84TB, as 1.92TB are almost exactly half the price. So, I was thinking of 2x 1.92TB in a mirror. If that falls into the "more than I need" category, that's OK, because it's quite affordable. I could also get another pair at a later date and still be the same price as starting with 2x 3.84TB.

But, if I hit the "gotta have more than 4TB usable", then it's U.2 drives, which I can't just duct tape to the case because of heatsinks. I'd like to know that now, so I can plan on some kind of mounting system.

nexox · Feb 26, 2024

A proper database engine should be able to manage its own memory usage for metadata, but that also sounds like random IO. I would hope that 1.92TB usable would be sufficient, unless you have four or more entirely different 500GB working sets that must all shuffle around without pauses.

CyklonDX · Feb 26, 2024

A single server tiered storage is bit troublesome as it leaves you with limited amount of customization, and storage.
Its also important what kind of application you plan on using.
(you will either have to use paid software to do tiered storage like netapp or run with zfs if you plan on low level tiered storage.

In certain setup using elasticsearch/opensearch, I have 6 servers

T1 Super Fast Hist
2x 1u box 256G ram with 2x sata ssd's for os, and software raid10 nvme write intensive (capacity 8T)

T2 Fast Hist
2x 2u box 256G ram with 2x sata ssd for os, and 20+2 spare sas ssd write intensive (capacity 32T)

T3 Old Hist
2x 2u 768G ram box with 2x sata ssd for os, and 12 3.5" sas hdd's (144T)

The distributional of indices is based by tag. Every day for each data feed index is being created it gets tag assigned based on the create date. There's additional task that runs weekly(over-weekend), and looks up tags, and changes them from T1 to T2, and from T2 to T3 depending on how old the index is. This causes the data to be moved down the tiers; the old data is less likely to be accessed by customers, and its slower so needs more memory - in case more than 1 customer starts looking at specific lets say 'stocks' historical data - it won't kill the array.

In more conventional setups lets say zfs, or software tiered raid. You will be much better with sas ssd's over u.2.
You will always use them for writes, so endurance will matter.
U.2 endurance isn't that great (top at like around ~10PBW), sas ssd's hit around 8-35PBW for lower or same price.
(Not to mention server with u.2 backplanes will be more limited, and more expensive.)
Performance wise, sas ssd can hit 1950MB/s read, and 1550MB/s write; and sure u.2 disks can be faster (but are we really in same $ space? U.2 that are much faster, and have decent endurnace going to cost you 2-5k usd -i.e. P5800X 800G) While sas ssd can cost below 1k and have sometimes 2-3x the endurance.

*Endurance is very important if you do a lot of data, more important than extra couple GB/s.

nabsltd · Feb 27, 2024

CyklonDX said:
In more conventional setups lets say zfs, or software tiered raid. You will be much better with sas ssd's over u.2.
You will always use them for writes, so endurance will matter.

ZFS doesn't support true tiered storage.

As for endurance, even with a rotation of the "hot" data every day, that would only be 500GB written per day. On a 1.92TB drive with 3 DWPD over 5 years, that means I'll have to think about maybe starting to worry about the SSD 30 years from now. I suspect that by then, I'll be able to cheaply replace it with a 20TB drive with the same 3 DWPD to then last until I'm long dead.

You shrug off a 10PB total write drive, when that would allow you to re-write your entire "Old Hist" drives 70 times.

CyklonDX · Feb 27, 2024

Indeed, zfs isn't great at that, better solution would be glusterFS or other (maybe paid solutions) if you plan on running single server.

As example, the U.2 1.92T Samsung PM983 has only 2.7PBW endurance, new potentially cost like 400usd
(from serversupply - from a vendor like dell prob around 1.4k a pop).

3 DWPD over 5 years 1.92T U.2 would be over 21PBW endurnace rating, and such disk will cost you around 1k+ a pop.
(micron 9200 max has 8.8PBW @ 1.6TB model ~ you can get it sometimes for great price, but new typically go for $800 usd)
(Make sure you checked endurance for your size, and model number from manufacturer - often sellers/sites write standard spec for higher-tiered product on low tiered one out of laziness)

If you wrote 500G a day you would get 1.86PBW a year. Question is it worth it - if you used pm983?
As there's performance degradation that does get visible once you chew through some of your endurance.
Obviously it depends on the memory they used, chipset etc, and if you use disk's cache or direct write bypassing disk cache that is used for databases.

Obviously if your bank allows it, 2x P5800X would do wonders 1.6T model has around 292PBW for some 4-11k depending on exact model (as there are i believe 3 models with same name but endurnace differs on them), and where you buy it.

DaveLTX · Feb 27, 2024

CyklonDX said:
Indeed, zfs isn't great at that, better solution would be glusterFS or other (maybe paid solutions) if you plan on running single server.

As example, the U.2 1.92T Samsung PM983 has only 2.7PBW endurance, new potentially cost like 400usd
(from serversupply - from a vendor like dell prob around 1.4k a pop).

3 DWPD over 5 years 1.92T U.2 would be over 21PBW endurnace rating, and such disk will cost you around 1k+ a pop.
(micron 9200 max has 8.8PBW @ 1.6TB model ~ you can get it sometimes for great price, but new typically go for $800 usd)
(Make sure you checked endurance for your size, and model number from manufacturer - often sellers/sites write standard spec for higher-tiered product on low tiered one out of laziness)

Actually, many many drives that say 3DWPD are 1.92T drives that have more overprovisioning to 1.6T so on and so forth and there are ways to overprovision any 1.92T drive if the 1.6T DC focused variant is not available (Samsung especially, not sure about the others)

CyklonDX · Feb 27, 2024

If you managed to get PM983 (2.7PBW) overprovisioned 1T (from 1.92T) you would prob get around 5.4PBW endurance. Around half way to 3DWPD for 5 years if we counted it as 1TB.

As i recall (would need to buy and open them to be 100% sure)
u.2 pm983 1.92T uses 4x 500G mem chips
u.2 micron 9200 max 1.6T has 8x 480G mem chips
u.2 micron 9300 pro 3.8T has 16x mem chips

nabsltd · Feb 29, 2024

CyklonDX said:
3 DWPD over 5 years 1.92T U.2 would be over 21PBW endurnace rating, and such disk will cost you around 1k+ a pop.

I have no idea where you are buying your disks that they cost that much.

You can get an Intel (Solidigm) DC P4610 at 6.4TB for $640. That's 36.54 PBW.

I also really don't see why you are advocating the P5800X to anybody with less than 200TB total storage. The 160TB/day endurance will be completely wasted.

With a proper tiered storage system that used NVMe for all intake and copied it automatically to spinning disk at leisure, almost everybody could get by with an endurance rating of around 1% of their total storage per day (and many with far less than that). Then set up some NVMe (with 3 DWPD endurance) sized to about 10% of total storage as cache for reads, and you get the equivalent of an all-NVMe storage system for fractions of the cost.

This is what I would do if I was starting from scratch (and I did similar with gluster and 6x 4U SuperMicro chassis in the era before NVMe...a total of 480TB raw HDD and 24TB raw SSD, easily saturating the 40Gbps link to a single client), but I have one box with the chassis and motherboard already in place, and I have limited PCIe slots left, so I'm trying to do this the best way I can but without losing my current investment.

@nexox has given me the sense that I can do what I want using M.2 drives in PCIe cards and still be close to the "all NVMe" speed most of the time. The key now will be finding a truly automatic tiered storage system that doesn't have ongoing or per-TB costs. I can handle a one-time license (like Windows Server 2022 and using Windows storage spaces, although reviews on performance aren't great), but I don't want anything that is subscription or costs more if I replace the drives with bigger ones.

nexox · Feb 29, 2024

It's another decision where the outcome depends heavily on the specific workload, but it sounds like you might be able to get away with a simple block cache layer. Unfortunately my tiered storage experience was implemented at the application level, and I haven't actually experimented with the various Linux block cache options lately, so I have no useful suggestions on which way to go.

gea · Feb 29, 2024

nabsltd said:
ZFS doesn't support true tiered storage.

This is true for tiered storage based on new/hot data vs old/cold data.

ZFS has a different approach as it can "tier" data based on physical data structures like file size. This allows performance critical small files ex office files or metadata to be on fast "tier 1" storage while larger files ex media files are on "tier 2".

While this cannot fully replace data tiering based on data move, it has the advantage that it works without data move that affects storage performance negatively.

How to setup in ZFS.
1. Calculate the amount of small and fast files vs large files where small files are up to 128K.
2. Add one or more special vdev mirrors large enough for the smaller files.
3. Set a small blocksize as a threshold. Smaller files are stored on the special vdev "tier", larger ones on spinning rust
4. Set a recsizse > small blocksize ex 256K-1M

As recsize and small blocksize can be set per filesystem, this is very flexible.
Check special vdev fillrate from time to time if you need a larger or more special vdev as it grows with your pool.

zachj · Feb 29, 2024

I for one think you’re barking up the wrong tree looking at true tiered storage at all for only 15tb.

hardware is a lot cheaper than man hours.

do a math on how much time it’s going to take you and your peers to implement what you’re talking about and multiply that by your average cost per head per hour—I bet you’ll spend several times more just on the labor (let alone software licenses and support) than you’d spend on a simple single-tier all-flash array.

if you’re hell bent on having a combination of rust and flash then have you considered just using sql table partitioning and multiple file groups? You can put logs and index on nvme, you can put stale historical data on rust, you can put data on sata/sas flash—whatever arrangement you want. Doing it this way takes advantage of features that already exist in sql and could be done in a few hours by any competent dba, and on the storage side you just have two or three totally separate pools of storage that back slow/medium/fast LUNs presented to the server. If you’re using something like ZFS already then probably yea you should put a good slog device in your array.

CyklonDX · Feb 29, 2024

nabsltd said:
You can get an Intel (Solidigm) DC P4610 at 6.4TB for $640. That's 36.54 PBW.

They do not sell it on the website. The one on newegg that is for $640 is being sold by some bait 'n switch company.

another one from amazon under 500 usd was being sold by another deceptive company selling used products.

Cheapest i could actually find was - used with over 95% endurance, costs like $950,

https://www.serversupply.com/SSD/PCI-E/6.4TB/INTEL/SSDPE2KE064T8T_324104.htm

Ebay sellers sell used for around $1.1k
New sells for around $1.4k

Its still interesting that its so cheap (1.1-1.4k, i would expect this disk to be priced at ~2k)

(also spotted something interesting ones branded with solidigm P4610 seem to have bit different spec than intel - couldn't find proper spec sheet for Solidigm part)

One made by intel
SSDPE2KE064T8T

One made by Solidigm (and its quite a bit cheaper / half the price)
SSDPE2KE064T801

nabsltd · Mar 1, 2024

CyklonDX said:
The one on newegg that is for $640 is being sold by some bait 'n switch company.

You need to take the ratings about companies on places like Newegg marketplace with a grain of salt. If you don't know who did the rating, you have no idea whether their claim is anywhere near accurate. I have friends who work in a "tear down the datacenter and sell the old stuff" who have stopped selling though places like Newegg and eBay because too many people don't understand what they are buying. Why do you think every SAS hard drive selling on eBay now says something like "Not for Home Use"?

That said, there are a ton of these particular drives on eBay brand new (i.e., less than 100 power on hours), for $640 or less. Most of them are drives that were part of some purchase for a project, but then the project never happens. Or, there's companies who buy systems with redundancy, hot backup systems, cold backup systems, and then spare parts that never get used because the first 3 layers never all fail. This results in things like drives with zero power on hours (essentially new, but technically sold once), drives with a few hundred power on hours and a 50TB written (out of PB), and then the actual used drives that still have 80-90% of their lifetime left.

Selling that zero power on hour drive as "new" isn't really wrong...sure "open box" is a better description, but with an SSD that has no accessories to lose in the "open box", is there really a difference? Many of these resellers take all those drives out of clamshell plastic packaging and put them in smaller anti-static bags so they take up less space in inventory and are easier to ship. Then, some user orders one, sees it isn't in a pretty box, and posts a review that the drive wasn't "new" before checking the power on hours. That review often stays forever, even if the user later realizes their mistake.

I bought that same drive brand new, in original Intel packaging, for $300 from a Newegg marketplace vendor. I have purchased used drives with thousands of power on hours and PB written, just like the one review of NetworkWholesale:

purchased Refurbished SSD: powered on 6+ years and tens of millions read/write

Unlike that user, I understand that "tens of millions read/write" were not much compared to the 10PB endurance rating of the drives I bought.

Yes, if you buy drives like this truly brand new when they were first released, they cost $2K or more. But, many companies pretty much throw this stuff away after 3 years of use, just because they don't have space for a new rack or server, but can put a 15.36TB drive in the same bay that the 6.4TB filled. And, 99% of drives purchased are far, far above the endurance spec that was really required. So, you can get really good endurance for cheap.

nabsltd · Mar 1, 2024

zachj said:
I for one think you’re barking up the wrong tree looking at true tiered storage at all for only 15tb.

hardware is a lot cheaper than man hours.

True, but part of it is the learning.

Websites like servethehome got me to understand that there were alternatives to throwing large bags of cash at Dell, HP, IBM, Cisco, etc. I don't have hundreds of TB, but I've worked places that do, and where my knowledge of cheaper alternatives saved 2-3x my annual salary for a single 3-month project. Being able to apply the knowledge from what I do with my 15TB of data to larger projects is where the man hours go.

Pakna · Mar 4, 2024

Re: write-focused drives, looks like Solidigm D7-P5810 is the one to watch for. NAND drive, 50 DWPD, PCIe 4.0 interface, U.2, 800 GB capacity. Though, aside from the faster interface, I really don't see the appeal over Optane 905p, which is larger capacity, better endurance, lower latencies and half the price.

nabsltd · Mar 5, 2024

Pakna said:
Re: write-focused drives, looks like Solidigm D7-P5810 is the one to watch for. NAND drive, 50 DWPD, PCIe 4.0 interface, U.2, 800 GB capacity. Though, aside from the faster interface, I really don't see the appeal over Optane 905p, which is larger capacity, better endurance, lower latencies and half the price.

The 905P has only 10DWPD spec (assuming the usual 5 years for the TBW value). Both drives have TBW specs, and the 960GB Optane 905P is 17.52PB, with the 800GB D7-P5810 at 73PB, so the Optane doesn't have better endurance. It's definitely better at price and availability, though.

DaveLTX · Mar 5, 2024

There's also the Kioxia FL6. Saw it on eBay before for cheap

DaveLTX · Mar 5, 2024

CyklonDX said:
If you managed to get PM983 (2.7PBW) overprovisioned 1T (from 1.92T) you would prob get around 5.4PBW endurance. Around half way to 3DWPD for 5 years if we counted it as 1TB.

As i recall (would need to buy and open them to be 100% sure)
u.2 pm983 1.92T uses 4x 500G mem chips
u.2 micron 9200 max 1.6T has 8x 480G mem chips
u.2 micron 9300 pro 3.8T has 16x mem chips

I just checked. You don't need to over provision that much.
1.92T 1DWPD drives have about 15% over provision
3DWPD drives have 37% over provision. You don't need to cut the drive by half to get 3 DWPD!

Tiered storage sizing recommendations

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Active Member

Active Member