Anyone doing big (24 disk) SSD arrays?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

matt_garman

Active Member
Feb 7, 2011
226
61
28
Anyone out there have any experience with really big (say 24 disk) SSD arrays? With SSD prices on a downward trend, I would expect to see more ideas like the Dirt Cheap Data Warehouse (DCDW) showing up... but I'm either looking in the wrong places, or no one is really doing this. Or they're doing it but not talking about it!

We currently have a big iron filer backing a research/simulation platform. It's 99.9% reads, effectively a WORM workload. The reads are sequential, but there are hundreds running in parallel, across roughly 30 TB of data.

The DCDW got me thinking that we can probably build our own consumer SSD-based system for considerably less than the big name vendors are charging. I'm thinking something along the lines of a 24x 2.5" chassis (e.g. Supermicro SC216) and filling that with 2TB Samsung EVO drives. I haven't yet decided on which CPU(s), motherboard, how much RAM or even the right HBA/RAID controller to use (suggestions welcome!). I'm actually thinking I might be able to get away with software RAID. Two 12-disk RAID-6 arrays stripped with RAID-0 (RAID-60) seems like a good starting point for experimentation.

If such a system could support a decent number of clients (say 10+ compute nodes), then I could just build a bunch of these DCDW-type systems and replicate the data across them. Kind of a poor man's cluster.

Just throwing this out there to see if anyone has any experience with anything similar they're willing to share. Any thoughts on gotchas or potential pitfalls or ideas in general are also welcome!

Thanks!
 
  • Like
Reactions: Chuckleb

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,751
2,129
113
I would wait unless you need it NOW. I feel we're at the tipping point where capacity may jump a lot soon, and in such case NVME will be even more affordable, and instead of 24 you may need 1 or 2 NVME drives... Less space, heat, cost.

I personally never needed high capacity so 8 or 16 SSD vs 2-4 NVME is already not cost effective.... it will catch up fast I believe to larger capacity needs very soon. I've been doing the SSD shuffle myself.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,533
5,854
113
Yea I started making a cluster for an all SSD Ceph using 1TB drives. Totally doable. It is in the datacenter just need to get a few minutes to actually set it up from a software config side.
 
  • Like
Reactions: Martin Jørgensen

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Got your email Matt. My DCDW is humming along very well, and quite a few people have contacted me wanting to build similar rigs. Some have gotten back in touch to say that they actually did it. So far I haven't heard of anyone who built a similar system and didn't like the results.

SATA SSDs + HBAs is still a viable way to get high performance storage on the cheap. That said, NVMe is definitely the future. 10 x 2TB SATA SSDs plus three excellent HBA's costs around $11K. The same amount of storage as NVMe is roughly $30K but will provide much lower latency, even more throughput, and probably more real-world IOPS with much lower CPU utilization. Is that worth 3x the cost? Yes, for many but not for everyone.

Anyone out there have any experience with really big (say 24 disk) SSD arrays? With SSD prices on a downward trend, I would expect to see more ideas like the Dirt Cheap Data Warehouse (DCDW) showing up... but I'm either looking in the wrong places, or no one is really doing this. Or they're doing it but not talking about it!

We currently have a big iron filer backing a research/simulation platform. It's 99.9% reads, effectively a WORM workload. The reads are sequential, but there are hundreds running in parallel, across roughly 30 TB of data.

The DCDW got me thinking that we can probably build our own consumer SSD-based system for considerably less than the big name vendors are charging. I'm thinking something along the lines of a 24x 2.5" chassis (e.g. Supermicro SC216) and filling that with 2TB Samsung EVO drives. I haven't yet decided on which CPU(s), motherboard, how much RAM or even the right HBA/RAID controller to use (suggestions welcome!). I'm actually thinking I might be able to get away with software RAID. Two 12-disk RAID-6 arrays stripped with RAID-0 (RAID-60) seems like a good starting point for experimentation.

If such a system could support a decent number of clients (say 10+ compute nodes), then I could just build a bunch of these DCDW-type systems and replicate the data across them. Kind of a poor man's cluster.

Just throwing this out there to see if anyone has any experience with anything similar they're willing to share. Any thoughts on gotchas or potential pitfalls or ideas in general are also welcome!

Thanks!
 
  • Like
Reactions: T_Minus

Patrick

Administrator
Staff member
Dec 21, 2010
12,533
5,854
113
I was going to put this in a post for this weekend but here are projections from Intel:

Intel PCIe SAS and SATA.JPG
 
  • Like
Reactions: T_Minus

Diavuno

Active Member
I was consulting on a similar heavy read project about a three years ago.
The guy who Hired me had a single 72 drive chassis FULL of 1TB 7200RPM WD blacks.

We found that the RAID cards had a hard time with that number of drives, but by far the most effective route was doing 6 drives in a RAID 5 and Mirroring then Software spanning out to a GOOFY RAID 1500.

We've stayed in touch and I've consulted with him a few more times... eventually he did decide to upgrade to SSD's when he finished testing a Storage pool based box with a bunch of Samsung 845DC's about a year ago. He got something off the shelf, one of the big guys If i recall.

You may want to consider off the shelf or breaking it down to a few clusters.
 

matt_garman

Active Member
Feb 7, 2011
226
61
28
I would wait unless you need it NOW. I feel we're at the tipping point where capacity may jump a lot soon, and in such case NVME will be even more affordable, and instead of 24 you may need 1 or 2 NVME drives... Less space, heat, cost.
I agree with all that, but we need to at least have a definitive plan in place in a month or two. I think the beauty of the SSD approach now is that the fundamental concept scales pretty well. I can build an SSD system now as a stop-gap to buy some time, and maybe when I build the next system, NVME will be a viable option. Or maybe it is now...


Got your email Matt. My DCDW is humming along very well, and quite a few people have contacted me wanting to build similar rigs. Some have gotten back in touch to say that they actually did it. So far I haven't heard of anyone who built a similar system and didn't like the results.
Got your reply, thank you! The above is very encouraging.


SATA SSDs + HBAs is still a viable way to get high performance storage on the cheap. That said, NVMe is definitely the future. 20 x 2TB SATA SSDs plus three excellent HBA's costs around $11K. The same amount of storage as NVMe is roughly $30K but will provide much lower latency, even more throughput, and probably more real-world IOPS with much lower CPU utilization. Is that worth 3x the cost? Yes, for many but not for everyone.
Where are you sourcing 2TB SSDs? I was looking at the Samsung 2TB EVO drives, Newegg and Amazon have them for $750 each, so 20x would be $15k.

Also, may I ask what HBA's you're suggesting?

Likewise, what NVMe devices do you have in mind? I agree, they are the way forward, and at least from my currently superficial understanding, actually simplify things.


We found that the RAID cards had a hard time with that number of drives, but by far the most effective route was doing 6 drives in a RAID 5 and Mirroring then Software spanning out to a GOOFY RAID 1500.
We've done some internal preliminary testing with small spinning drive arrays (just because it's quick and easy as we have the hardware on hand). We've found that at least with spinning drives, arrays don't scale much past 5 or 6 drives, which is consistent with what you're saying. We also found some spinning disk RAID benchmarks online also consistent with this observation. We believe this is due to IOPS limitations of spinning drives. If true, then SSDs ought to better scale in RAID.


You may want to consider off the shelf or breaking it down to a few clusters.
Indeed. Our current data set is about 30 TB, and grows at a rate of about 50 GB/day, which works out to roughly 13 TB/year. Although some of that data is redundant, and the oldest data can be rolled off the system. The data set can also be easily and cleanly split in half. So, going the 24x2TB route, that's 48TB of raw capacity: two of those systems (48 total drives) would store the whole data set plus provide plenty of room for future expansion.

The intent is for one of these systems (48 total drives) would only serve a subset of our compute farm. We'd just build another identical (or better) system for the next batch of compute nodes. And as we add compute nodes, we build these servers as needed, replicating the data between them. So each storage server is an independent entity, holding the exact same data set as its peers. As a whole, they are a what I call a "poor man's cluster": clearly not a formal cluster with its inherent fanciness and complexity, just some servers coerced into a cluster-like collection through replication scripts and failover scripts for the client nodes.

Thanks for all the helpful replies everyone!
 
  • Like
Reactions: Chuckleb

Naeblis

Active Member
Oct 22, 2015
168
123
43
Folsom, CA
on a 24 port intel server here are the issues i had
1) throughput on raid cards / back planes in an issue 12 drives at 6GBs is the wall i hit per card. 16 port using 3 ssds on each port (i.e one wasted drive bay) out of 24 slots i get 18. the other 6 ssd i put in 5.25 cage. on top of the power supply.
1a) two or more raid cards are needed.

2) I could not stomach the price hit of using a raid 10. all that SSD space wasted. I went with with NVMe journal/cache drives. That means another PCI card or two.

3) How to get the I/O off the SAN? In my case i have 4 PCI slots used and and the remaining 2 slots (Dual port 10G nics) does not make the grade. I am looking into dual 40g cards.
 
  • Like
Reactions: Chuntzu

Patrick

Administrator
Staff member
Dec 21, 2010
12,533
5,854
113
on a 24 port intel server here are the issues i had
1) throughput on raid cards / back planes in an issue 12 drives at 6GBs is the wall i hit per card. 16 port using 3 ssds on each port (i.e one wasted drive bay) out of 24 slots i get 18. the other 6 ssd i put in 5.25 cage. on top of the power supply.
1a) two or more raid cards are needed.

2) I could not stomach the price hit of using a raid 10. all that SSD space wasted. I went with with NVMe journal/cache drives. That means another PCI card or two.

3) How to get the I/O off the SAN? In my case i have 4 PCI slots used and and the remaining 2 slots (Dual port 10G nics) does not make the grade. I am looking into dual 40g cards.
What are you using for this OS/ storage wise? E.g. is this a ZFS box, Linux/ Windows RAID, HW RAID on an OS? Gluster, Ceph? Other?
 

Naeblis

Active Member
Oct 22, 2015
168
123
43
Folsom, CA
What are you using for this OS/ storage wise? E.g. is this a ZFS box, Linux/ Windows RAID, HW RAID on an OS? Gluster, Ceph? Other?
OS is windows. We have progressed through multiple raid cards to get the the 16 port Adaptec.
we have tried both HW raid and just having the raid card present the drives to the OS.

Now we are scrapping the Windows OS and SS to go to Linux build similar to Building a high performance SSD SAN - Part 1 › smcleod dot net. Windows iSCSI seems to have too much overhead and does not scale well.

I still don't have the Mellonox cards in (got that recommendation in deals,) to try the SMB/RDMA route.

a side point. Raid expanders even if not expanding the number of ports drop perf by approx 25%

Thom
 

Chuntzu

Active Member
Jun 30, 2013
383
98
28
I will have to post my experiences in detail in a separate post along with some networking experiments in another post possibly seperate,but the cliffs notes version. Windows server 2012r2 with storages spaces with 48 ssds I have scaled to just over 2 million 4k iops read and write along with just under 24gb/s sequential reads and writes using dual e5-2680v2. I have only tested 4 x 40 gb infiniband and 4x 40 gb ethernet to multiple nodes using smb rdma and have saturated all 160 gb of traffic. I have also hit just about the same speeds using 8 nvme drives as well. I really like what Microsoft has done with them rdma network stack.
 

PnoT

Active Member
Mar 1, 2015
650
162
43
Texas
I will have to post my experiences in detail in a separate post along with some networking experiments in another post possibly seperate,but the cliffs notes version. Windows server 2012r2 with storages spaces with 48 ssds I have scaled to just over 2 million 4k iops read and write along with just under 24gb/s sequential reads and writes using dual e5-2680v2. I have only tested 4 x 40 gb infiniband and 4x 40 gb ethernet to multiple nodes using smb rdma and have saturated all 160 gb of traffic. I have also hit just about the same speeds using 8 nvme drives as well. I really like what Microsoft has done with them rdma network stack.
I'd definitely be interested in your findings as well as I've been in the midst of revamping our storage and Hyper-V environment. Our Dell Compellent just can't keep up with what we're trying to do and the fact the product doesn't have dedupe of some sort is depressing and when tier 3 (spinners) get hammered the flash (tier 1 & 2) slow to a crawl.

I've pitched and am starting to put together an IB network leveraging RDMA / SMB and consumer grade SSDs for now and during testing.
 
  • Like
Reactions: Chuntzu

Naeblis

Active Member
Oct 22, 2015
168
123
43
Folsom, CA
I'd definitely be interested in your findings as well as I've been in the midst of revamping our storage and Hyper-V environment. Our Dell Compellent just can't keep up with what we're trying to do and the fact the product doesn't have dedupe of some sort is depressing and when tier 3 (spinners) get hammered the flash (tier 1 & 2) slow to a crawl.

I've pitched and am starting to put together an IB network leveraging RDMA / SMB and consumer grade SSDs for now and during testing.

We are setting that up now NVME + SSD + remote lun of HDD for the 3rd tier. The RDMA / SMB part has has only thrown more chaos into the fire
 
  • Like
Reactions: Chuntzu

TuxDude

Well-Known Member
Sep 17, 2011
616
338
63
I'd definitely be interested in your findings as well as I've been in the midst of revamping our storage and Hyper-V environment. Our Dell Compellent just can't keep up with what we're trying to do and the fact the product doesn't have dedupe of some sort is depressing and when tier 3 (spinners) get hammered the flash (tier 1 & 2) slow to a crawl.

I've pitched and am starting to put together an IB network leveraging RDMA / SMB and consumer grade SSDs for now and during testing.
I'm curious as to what you're doing to it that the Compellent can't keep up - if I remember right a SC8000 controller-pair should be capable of around 300 000 iops with enough disks behind it.

And I haven't heard anything about dedupe, but online compression for all tier-3 data is coming soon.
 

PnoT

Active Member
Mar 1, 2015
650
162
43
Texas
I'm curious as to what you're doing to it that the Compellent can't keep up - if I remember right a SC8000 controller-pair should be capable of around 300 000 iops with enough disks behind it.

And I haven't heard anything about dedupe, but online compression for all tier-3 data is coming soon.
There's a bunch of SQL and VM traffic eating away at our tier 3 and when the iops of the spinners reach their maximum the entire Compellent starts to crawl which includes T1/T2. If data is on T1 (R10 1.6TB SSDs) I can typcially pull only about 1.5GB/sec and on T2 (R5 1.6TB SSD). If T3 is hammered that goes down to about 80Mb/sec and dips to 20Mb/sec sometimes. We've went round and round with support but they haven't given us a single idea on what happens other than stating it's "background processing" that's slowing everything down. Like WTF does that mean...
 

cesmith9999

Well-Known Member
Mar 26, 2013
1,431
483
83
I'm curious as to what you're doing to it that the Compellent can't keep up - if I remember right a SC8000 controller-pair should be capable of around 300 000 iops with enough disks behind it.

And I haven't heard anything about dedupe, but online compression for all tier-3 data is coming soon.
every quote I saw from Compellent shorted us on disks. opting for larger SATA disks and less SAS/SSD.

with SSD's becoming larger soon. it is relatively pointless to have inline dedup happen on anything other than SSD's. while it is a wish factor, there is the I must swing the heads to get at data...

Chris