Utterly Absurd Quad Xeon E5 Supermicro Server with 48 SSD drives

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Warning: While interesting, this build won't be very practical for most folks. The build will also be - and this pains me greatly - made up mostly of parts for which I paid full price. Oh the shame!

Build Name: UberDB
Operating System/ Storage Platform: Windows and Solaris 11.1 - we'll see which one wins the database benchmarking war
CPU: Quad Xeon E5-4640 engineering sample revision C1
Motherboard: Supermicro X9QR7-TF-JBOD
Chassis: Supermicro SuperServer 4047R-7JRFT. I'll also use one of my Supermicro SC216 chassis that has been converted to JBOD.
Drives: 48 Samsung 120GB SSD drives, 16 1TB spinning rust drives, 1 120GB SSD boot drive. Will expand further, up to 72 drives, if the server can make good use of the extra disk IO.
RAM: 256GB DDR3-1600 ECC. This isn't much RAM, but with really fast disk IO I have not seen any benefits from going bigger in my usage scenarios.
Add-in Cards: 4x LSI 9300-8i, 2x LSI SAS2308 on the motherboard, 2x Intel 10GbE ports on the motherboard, 1x Mellanox dual-port QDR Infiniband. I will add up to 4 LSI 9207-8e cards if necessary.
Power Supply: 3x Supermicro 1,620 watt
Other Bits: Oracle ASM, Oracle 12c

Usage Profile: An all-SSD Oracle analytical database. A single "Data Warehouse" query can easily use all available IO, and quite a bit of CPU, so almost no investment in speed will go un-utilized on a daily basis.

More Information: My favorite server ever is the HP DL585 G7. It's absurdly fast, and I built it, thanks to eBay, for remarkably little money. True, my 16 CPU Dell c6145 cluster was faster for many (but not all) database queries, but the HP wins for overall productivity. Now, however, it's time to see if I can do better.

The only real bottleneck with the HP DL is IO. That may sound funny given my signature, but with a data warehouse there is no such thing as too much IO. The DL585 G7 has an insane 11 PCIe slots, all of which are x16 or x8, and sports a full four IO chips, but those PCIe slots are PCIe2. I'm hoping that PCIe3, directly wired to the CPUs as in the Xeon E5, can do better. If it can't then I'm out some serious coin.



The Supermicro SuperServer 4047R-7JRFT sounds expensive at $3,100 new. It's not as bad as it seems. You get a quad Xeon E5 motherboard, two embedded LSI SAS2308 disk controllers, a pair of Intel 10GbE ports, some massive power supplies, and 48 2.5" drive bays with sleds. Buy two 24-bay disk JBOD disk chassis and you'll spend $1,600 even if you buy used. Add another $500 for two LSI cards and $200 more for 10GbE and you are up to $2,500 already. That means you get a chassis, motherboard, and power supplies for the equivalent of $600. Not so bad.

The CPUs were $400 each. Not much of a deal compared to AMD 6128s and 6172s, or Xeon L5520s and L5639s, but a lot less than retail. This, the Mellanox card, and RAM are the items for which I was able to rely on eBay. The SSD drives were bought new, but long ago.

The four new LSI 9300-8i cards were $1,148 total. I could have saved a bit of money without compromising performance by using 9207-8i cards, but the 12G SAS might be useful in the future. The LSI 9207-8e cards are recycled from another server, as is the Mellanox card and the RAM.
 
Last edited:

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
The Supermicro appears to have set a record while doing some SiSoft Sandra benchmarks. Cache bandwidth on this monster ranks at 850 Gigabytes per second average across the whole test range and peaks at 2.61 Terabytes per second.

While the CPU caches scream, RAM memory bandwidth is actually quite pathetic: 80GB/s, perhaps due to the odd Supermicro motherboard which has eight DIMMs each for two processors and four DIMMs each for the other two. Four circa 2009 AMD 6128s can do better than this! I'll have to see if making the memory symmetric, or some other tweaks, can improve the results. My general rule of thumb is that you'll never get more disk IO than 1/4 of your memory IO, so this could be a big problem.

Update: Some tweaks and I'm at 88GB/s, while I need 100GB/s minimum and expected 120GB/s.
Update: Additional tweaks, including a BIOS update and making the memory entirely symmetrical, and I'm up to 97GB/s, which will have to be good enough.

Memory transactional throughput is seriously bad at 1MTPS. I think a BIOS update is in order.

Update: I can't improve my results here; dual Xeon E5 rigs easily double my results. This may be an unavoidable Intel quad-CPU penalty.
 
Last edited:

PigLover

Moderator
Jan 26, 2011
3,186
1,545
113
It'd probably pain you even more if you hadn't found ES E5-4640s to light it up with. Retail C2's run almost 10x what you probably paid paid for those. And then the cost of 48 Samsung 120s might not feel so awful :)

Absolutely love to hear your experiences with this. Really interested in any problems you encounter with the E5-46xx NUMA architecture. This architecture exists for thread-heavy virtualization workloads where you can have some hope of forcing locality for memory. Having 1/3 of your non-local memory access bounce through two other CPUs on the single-legged QPI ring might cause some havoc for database work. Same question goes for IO - every PCIe is 3.0 and tied directly to a CPU (good), but unless your application can bind processes and memory correctly you only have a 25% chance of having the IO tied to the local CPU, and an equal 25% chance that it is not only non-local but two QPI hops away.

If these theoretical issues turn out to be non-problems then there really may be no reason for the E7 product line to exist at all.
 
Last edited:

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,811
113
dba - we need pictures :) Let me know if you need STH to host.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,811
113
During the holiday I will take some and send them your way for hosting. I also plan to take some snaps of the DL585 G7 and a Dell C6145 at the same time.
Saw one of the HP DL585 in the WP CMS (that is a good place to host BTW)
 

matt_garman

Active Member
Feb 7, 2011
212
41
28
Still waiting on the pics :)

dba, a while ago, I saw your "dirt cheap data warehouse" on your website, and now the similarly-themed UberDB.

It never dawned on me until I saw your work how useful consumer SSDs are for ultra-high throughput, read-mostly workloads... and we happen to have this exact requirement where I work. You said, "this build won't be very practical for most folks", but I've been thinking that the basic idea might actually be quite practical for me. I was wondering if you (or anyone else) might weigh in on the applicability of your ideas to my situation:

Basically, we have a WORM-like big data store, about 22 TB of data that is used for "scientific discovery". (It's not true WORM, a small amount of data (~50 GB) is batch-added daily, and old data is rolled off as we near capacity.) This store is NFS mounted (read only) by a farm of compute nodes that continually read in the data and do some analysis. This is not a database, just a file system with a bunch of data files averaging around 400 MB in size.

For the last three years, we've been served by an expensive appliance from one of the big storage vendors. It's performed quite well, although as we've added compute nodes over time, I think we're starting to hit a wall (it is upgradeable, but at appalling costs). Our support/warranty period is almost up, and the renewal costs are similarly outrageous... in fact, a quick back-of-the-envelope estimation suggests that I can build something like your dirt cheap data warehouse for approximately the cost of the one-year support extension. Not to mention, the appliance is huge and power-hungry (96 15k SAS drives).

Just to put some numbers to this: current system serves data at around 2.7 GB/sec. That's gigabytes, not bits, just to be clear... IOW, we're saturating almost three 10GbE links. IOPS hit around 50k/second---this isn't random DB-style IO, but dozens of compute nodes---each with dozens of jobs---simultaneously doing sequential reads of the data files.

I was thinking along the lines of the 24x 2.5" bay SuperMicro case, with a raid0 stripe of eight 3-way mirrors. Linux software raid can actually do this; I was thinking of the raid10, f3 layout. With 1TB SSDs, that gives nearly 8 TB of space. Assume 500 MB/s read performance per SSD, and I think I could easily get well over 3 GB/sec read throughput. I'd actually need three of these systems in place to match the capacity of our current system (the data is easy to spread around multiple systems), furthering the potential throughput.

It almost seems too easy---what am I missing?
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Hi Matt,

I'm going slowly with the build due to other deliverables and the fact that I'm having trouble getting the needed six LSI 9300-8i HBAs, with my original order of "in stock" cards eventually cancelled. The pictures and more details will, however, arrive eventually.

You need massive IO, so you are one of the few who would benefit from a variation on this type of architecture. Actually, the DCDW architecture is perfect for what you are doing. Send me a PM if you'd like to discuss your project in more detail. I can offer a bit of free advice, and might be able to consult if you need a bit more.

Basically I'll say this: Acting as a file server with any of the RDMA protocols, my current dirt cheap data warehouse (DCDW) is able to deliver ~8GB/s to clients over a network. The new hardware will be able to do even better, hopefully much better. On the face of it, assembling a server like this is trivially easy. It turns out, however, that there are a few dozen small mis-steps that, in aggregate, can easily drag your performance down to more like 1GB/s. Here are some of the issues:

* Linux IO is awful for large-scale SSD RAID with default settings, and I was unable to fully fix it with tuning. Windows was ~50% better, and Solaris 11.1 ended up delivering 2x better maximum throughput compared to Linux. Install Linux and lose 1/2 of your performance. I was surprised too!
* SSD jitter is a major problem.
* Chipsets and BIOS aren't tuned for massive SSD RAID, and it can bite you. The same basic disk hardware can deliver over 200% better on some motherboards than others, even if those motherboards look identical on the spec sheet. The issue appears to be brief periods of higher latency on specific PCIe slots, a non-issue with a few disks but a huge deal with dozens of SSDs in RAID.
* Software RAID implementations range from dismal to awesome. Much testing is required, and a hybrid of hardware and software RAID is sometimes necessary to work around defects in the RAID implementations. Mdadm works great for basic RAID, but I have never tried it in a massive SSD raid. Tom's Hardware did with 24x Intel S3700, and the results look merely average, but the real world performance is still unknown.
* You may not need much RAM. With so many fast SSDs, my system performs as well with 64GB of RAM as it does with 384GB.
* Networking quickly becomes a bottleneck. You'll need bandwidth, low latency, and very good network stacks, preferably with RDMA. 10GbE is good, IB is better.
* Many of the strip/stripe size rules from hard drives don't apply to SSD drives.
* The ONLY way you'll be able to make maximum use of your hardware is to build up the system slowly, extensively testing and characterizing each and every addition. If not, you'll spin the whole thing up, discover that performance is disappointing, and have no idea why. At each step, performance has to be "much better than necessary", because when all of the hardware, software, and network components are chained together, each robbing a bit of performance, you can ill afford any major bottlenecks.

By the way, with a very very good RAID implementation, plan on no more than ~350MB/s per drive in read throughput in a large RAID.
 
Last edited:

Chuckleb

Moderator
Mar 5, 2013
1,017
331
83
Minnesota
Another question that I have is, how many clients are accessing it simultaneously? We use lustre in order to satisfy the number of connections, a single storage node will cap out at some point I assume, no matter what IB or other cards are in the system. But I could be wrong.
 

mobilenvidia

Moderator
Sep 25, 2011
1,956
212
63
New Zealand
Hi Matt,

I'm going slowly with the build due to other deliverables and the fact that I'm having trouble getting the needed six LSI 9300-8i HBAs, with my original order of "in stock" cards eventually cancelled. The pictures and more details will, however, arrive eventually.
Alternative:

IBM N2215 SAS/SATA HBA specifications

The N2215 SAS/SATA HBA has the following features and specifications:
LSI SAS3008 12 Gbps controller
PCI low profile, half-length - MD2 form factor
PCI Express 3.0 x8 host interface
Eight internal 12 Gbps SAS/SATA ports (support for 12, 6, or 3 Gbps SAS speeds and 6 or 3 Gbps SATA speeds)
Up to 12 Gbps throughput per port
Two internal x4 HD Mini-SAS connectors (SFF-8643)
Non-RAID (JBOD mode) support for SAS and SATA HDDs and SSDs (RAID not supported)
Optimized for SSD performance
High-performance IOPS LSI Fusion-MPT architecture
Advanced power management support
Support for SSP, SMP, STP, and SATA protocols
End-to-End CRC with Advanced Error Reporting
T-10 Protection Model for early detection of and recovery from data corruption
Spread Spectrum Clocking for EMI reductions

 
Last edited:

capn_pineapple

Active Member
Aug 28, 2013
356
80
28
This is way more powerful and way less expensive than anything we're running at work... Kinda depressing really.

Also interested in an update.
 

Diavuno

Active Member
If I recall correctly some of the old X 8 DT boards had a dual io hubs. So instead of Adolf socket with one hyper transfer to each processor and both having an individual 12 a hub... Both sockets had their individual hubs. I remember someone mentioning that they got nearly 2x per second... But those are only do a socket so you don't get the quad socket round-robin penalty!
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Any updates on this behemoth?
No real updates. It runs runs and runs and runs. I have about 8TB worth of data loaded (with compression), and regularly perform 1TB queries at over 24,000MB/s. Of course almost everything is indexed, so actual performance is much higher than that. In fact, unless the row counts get into the hundreds of millions, I just don't even think about query performance.
 
Last edited:

ehorn

Active Member
Jun 21, 2012
342
52
28
pics or it didn't happen!!!

jk :)

Next time you have that thing out of the rack give us some eye candy man!