high capacity, high throughput storage platform

aero · Mar 26, 2018

Hello,

I'm trying to get some ideas, from those that may have already conquered similar challenges, on building a high capacity storage system that is also capable of high throughput.

The intended usage is for this storage server to house long term data, typically massive files (maybe ~200GB per day in 1 daily file). Periodically we will have a need for a bunch of compute nodes to read and process these files (each compute node processing a different daily file), probably accessing them via NFS. I need to be able to parallelize this as much as needed to maximize processing throughput.

I envision the storage server having 40Gbe connectivity, and I want to be able to fully saturate the connection, not be bottlenecked by disk IO.

Due to the high capacity required (120TB) I'd like to see if it's possible to do with spinning disks.

recap of requirements:
- capable of sustaining 40Gb/s throughput to multiple concurrent readers, that are reading large files sequentially
- 120TB of space
- high redundancy (can't do raid0...thinking raid6/raidz2)
- single chassis solution if possible
- easily expandable
- low cost

I'm currently looking at 45 or 60 bay chassis, and trying to determine if ZFS can perform well enough.
In theory 2 ZFS raidz2 vdevs of 15 disks each would more than meet the performance goals for purely sequential reads, but I'm having trouble finding real world performance numbers to back that up.

I am also contemplating hardware raid controllers with raid6 or raid60 configurations.

EffrafaxOfWug · Mar 26, 2018

40Gb/s is 5GB/s... I'd say that's not going to be remotely easy to attain with platter drives and still hard to achieve with platter+SSDs. If you've got more than one user/node accessing files sequentially at the same time as other users/nodes then it's as close to random IO as makes no odds from the storage array's perspective.

200GB/day doesn't sound like it's anywhere near that level of performance though...? Over 10GbE, assuming you can read the whole file at wire speed, you'd be able to read that entire file in less than three minutes which is considerably less time than a day, so is there a requirement that the nodes need to read this file within 1min of it being created/placed on the server, or is this just a Nice To Have? Would all client nodes be reading the same file, or multiple different 200GB files (this would have a big impact on any RAM and SSD caching you might implement)? Do the client nodes read the entire file and e.g. cache it into local RAM for processing, or will they be doing a whole load of random reads/writes to these files as they process them?

I'm not a ZFS expert but I'd shy away from using radiz2 and think more of the RAID10/mirrored vdev setup so that you've got better random IO capability as well as saner rebuild times (rebuild times will vary wildly depending on several of factors but as an example the ~25TB array z2 we've got at work takes at least 12hrs to rebuild after throwing a shoe).

<disclaimer: I am a rabid RAID10 fanboy)

aero · Mar 26, 2018

A common scenario would be to read perhaps a couple months worth of daily 200GB files; client nodes reading multiple different files. The IO pattern of each node will be to sequentially read entire files and process them locally; no small random read/write to the source filesystem.

Because of this, I see no way to benefit from ram or ssd caching.

gea · Mar 26, 2018

5 GB/s, not too easy especially with low cost if reachable at all under continous load..

Let's calculate.
For a pure sequential workload you can count 150-250 MB/s per disk say 200 MB/s.
For 5000 MB/s you need 25 disks in a Raid-0 alike setup in the best case scenario.

But there is no pure sequential workload beside copying a single video to an empty disk.
With a higher fillrate, more users or when data is spread over a Raid array, iops is limiting the throughput. Worst case: Reading 4K datablocks that are fully spread over the disk. When a single disk has around 80 iops you are at 80 x 4K=320 KB/s per disk. Your real values are between the extremes and depend on workload and pool iops.

The iops of a Raid scale with number of Raid or vdev in ZFS. On a Raid-1 read iops are 2 x a single disk while write iops is like a single disk On a Raid 5/6/Z1/Z2 iops is also like a single disk
A Raid 50/60 or a ZFS pool from 2 x Z1/Z2 has the iops of two disks. The reason for this is that you must position the heads of every disk to read/ write data.

A solution must now offer enough performance sequentially with enough iops for your workload needs.I doubt to reach 5 GB/s in a mixed workload but best chances would offer a setup like

- 50 disks in a Multi Raid-10 setup (25 vdevs in ZFS)
This would offer around 5 GB/s Write and 10 GB/s sequential read capacity

Such a config will offer 80 x 25 write iops=2000 and 4000 read iops as ZFS can read from both mirror parts simultaniously.

For your wanted capacity of 120T you can use 7200 rpm disks with at least 3TB each, ex 4TB HGST.
You can use a 60bay HGST or SuperMicro toploader case. Maybe 20 x 6TB HGST HE6 are fast enough if you can reduce your sequential demand.

Fastest ZFS OS is Oracle Solaris (11.3 or 11.4 beta) with a native ZFS.
If you want to check what's possible start with this and then optionally try an Open-ZFS alternative like OmniOS (free Solaris fork) or a solution based on BSD

As you need an expander for that many disk, use 12G SAS HBA (ex 2 x LSI 9300-9i with 4 x miniSAS to the backplane). Calculate at least 64 GB RAM, better more as you need the RAM for caching metadata and small random reads and caching small writes.

cesmith9999 · Mar 26, 2018

You can have any 2 of this list: capacity; throughput; low cost

you mention low cost... what is your budget? that is usually the single largest driver in design.

easily expandable. usually not a single chassis design. usually you want more slots or expansion JBOD.

Chris

aero · Mar 26, 2018

What I'm hoping is that there is some optimal number of clients performing large block reads such that the workload is closer to true sequential than 4k random.
Cost wise, i'm trying to keep it under 20k.

For expansion I envision a few possibilities..
- large enough single chassis that I start out half-populated, and add more vdevs as necessary
- jbods attached to a server instead of all-in-one, so I can add more jbods to expand
- scale out architecture layered on top like lustre/gluster

MiniKnight · Mar 26, 2018

You're also going to have to do kernel tuning to get 40Gb speeds. Most stuff works to 10 and usually 25 but at 40 there's so much data moving through the CPU that vanilla installs break.

aero · Mar 26, 2018

Will I get better performance from the same disk layouts, but utilizing hw raid controllers + LVM + ext4 compared to ZFS?

mrkrad · Mar 26, 2018

aero said:
Will I get better performance from the same disk layouts, but utilizing hw raid controllers + LVM + ext4 compared to ZFS?

I've found newer generation hardware (ie DL380 G10,G9) can sustain far more iops that trying to utilize older generation gear.

I just moved from all G6/G7 consolidating 12 servers to 4 dl360 g9's and now my vm's actually push 10/20gbps of data rather than humming along at 2.5/5/8gps max..

I guess using 5-10 year old hardware doesn't quite cut it anymore!

fwiw I don't get near 40gbp/s in disk I/o from 8 samsung sm863a's in raid-5 (reading!) so you are going to need a helluva raid controller/efficient software raid to push that much in linear bandwidth!

EffrafaxOfWug · Mar 26, 2018

aero said:
Will I get better performance from the same disk layouts, but utilizing hw raid controllers + LVM + ext4 compared to ZFS?

In my experience, you get better scalability, economy and recovery by using SW RAID through one more more HBAs.

Put it this way - a single HW RAID controller is limited to, say, 2GB of cache and, say again, 8 SAS ports. A high-end file server might have 256GB of RAM entirely usable as cache, and several HBAs/SAS expanders, and CPUs from the last decade have been able to deal with this stuff without breaking a sweat - even when you chuck SSDs into the mix.

I say this as someone who religiously uses mdadm+LVM+ext4 at home - the "overheads" of ZFS are generally quite overstated. ZFS only needs gobloads of RAM if you're enabling every feature under the sun - dedupe being the big one in terms of RAM which, given your stated workload, doesn't sound like it would be remotely beneficial to you. The CPU overhead from checksums and should be below that of a single thread on most modern systems, although parity calculations at ~5GB/s would likely be very taxing - but then if you go for ZFS RAID10 you've got no parity calculations to worry about.

pricklypunter · Mar 26, 2018

That will be one helluva system when you get it put together

Evan · Mar 26, 2018

I know a way to get that throughout and decent random but it’s a 180 x 3tb solution (230tb or so usable after mirrors and spares)
I can’t see any way to get the 40Gb performance for 20k.

Looking at others suggestions the hgst he6/he8 would be a decent drive option at higher capacity but you may just get better bang for buck with compression and deduplication on 80tb of 2tb SSD’s ?? Careful shopping it may be possible...

MiniKnight · Mar 26, 2018

Large sequential reads are hard to cache too. Any more info on if the reads are spread out on the 120TB or is it like there's 3-4TB that are "hot" and the rest is archive?

aero · Mar 27, 2018

MiniKnight said:
Large sequential reads are hard to cache too. Any more info on if the reads are spread out on the 120TB or is it like there's 3-4TB that are "hot" and the rest is archive?

Yeah, would be very hard to get any benefit from read cache, unless it was "primed" in some fashion. Extra difficulty in that I would consider probably the most recent 12TB or 24TB "hot"....so a lot...
If it were just a couple TB then priming an L2ARC would be pretty easy and inexpensive.

EffrafaxOfWug · Mar 27, 2018

If you think read cache will be of precious little benefit - and you really do need to keep 12-24TB "hot" - then you're better of putting your money into making your spindles fast I think. But for that you're talking about way more than 60 bays and you're straying more into full-on SAN territory. What hardware are you currently using for this workload and what analysis have you done on it so far?

The SANs I've got at work can just about manage 6GB/s over fibre, but you're talking several systems doing the pulling there, as well as shelves and shelves of SAS, SSD and big oojmaflip RAM caches in the header nodes spread over multiple racks.

A chassis kitted out with 45x4TB SSDs in RAID6 would give you the 120TB of space plus extra (since you wouldn't use a single RAID6, rather multiple arrays joined together or multiple raidz2 vdevs), and on paper at least be able to keep a 40Gb pipe filled but it's a million miles away from what I'd call low cost... and I don't think 45x4TB platter discs would have a hope in hell of sustaining a mostly-random workload.

As per previous comments, does the data respond well to compression?

aero · Mar 27, 2018

I'm currently *not* running this workload, it's not ready yet. I also have no system available to even store a fraction of the data at this point.

The data is highly compressible...I'm seeing roughly 6:1 compression ratio with gzip, and the storage figures I've given so far are actually assuming the data is already compressed.

EffrafaxOfWug · Mar 27, 2018

So the 200GB files is already compressed...? Assume the clients already have a process for decompressing it then? In which case filesystem compression won't help you any and you're just looking for raw storage and IO.

If there's no pre-existing system, I assume the "fills a 40Gb pipe" is just a nice-to-have rather than an absolute requirement?

aero · Mar 27, 2018

It's a nice-to-have, not a strict requirement...just more=better.

Evan · Mar 27, 2018

I couldn’t in 2018 buy anything but all flash for productive data (micro as in a few disk local data store or massive scale in terms of PB’s of SAN) except in rare cases where the workload may not make senses to use flash like this, just the same can you start with say a 24 x 2.5” in 2u filled with 2tb or greater SSD ? Grow as you need later ?

K D · Mar 27, 2018

The opportunity mentions "low cost" as a requirement but having a sense of what the budget is would be if help. $10,000 vs &100,000 makes a difference.

high capacity, high throughput storage platform

Active Member

Radioactive Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Radioactive Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Radioactive Member

Active Member

Radioactive Member

Active Member

Well-Known Member

Well-Known Member