24 SSD's + Supermicro X9/96GB/10GBE- am I crazy?

the spyder

Member
Apr 25, 2013
79
8
8
At my current job, we are no stranger to ZFS, OI/Nexenta, and Napp-It- we have several Petabytes of ZFS based storage for archival and processing systems. As I get ready to start replacing our aging VM servers, I realized their localized storage would be an incredible pain to deal with as we migrate on to new hardware next year. In the mean time, I have had no way to easily manage backups or VM transfers for maintance due to the non existent 4.1 licensing. To help temporarily deal with this I purchased 24 Intel 520 240GB SSD's with the intent to create a fast, inexpensive NFS host. I've built plenty of spindle based versions, but never a completely SSD based one. This will run 25-30 very low IO VM's and be directly connected via 10GBE to each VM host with existing copper/Intel nics.

Has anyone else ran all SSD's?
How did you split your pools up? Raid 10? Raid Z2?
Did you use Cache or ZIL drives? I have two 480GB drives available for Cache if I want and could use two of the 240's for ZIL.


I can't wait to iperf/bonnie test this. I realize my main limiting factory will be the single SFF connection to the backplane/MB. But compared to the 24x 320/500GB WD 2.5" notebook drives in raid 5 in each, this should be 10x better.
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
Yes, all my current ESXi datastores are SSD only

I do mostly Raid-Z2 (one or two vdevs similar Raid-60), you do not really need mirrors
(I/O per second of SSDs are a few thousands compared to a few hundreds of spindels, so mirrors are only needed with spindels)

You should use a dedicated ZIL to avoid massive small writes on your pool (or disable sync).
Best, the ZIL should be faster than your SSD, prefer a Dram based ZeusRam if possible (expensive)

If possible, stay below 60% fillrate or performance drops.
Optionally tune IP settings (Jumboframes)
Insert as much RAM as possible. Check Arc statistice to decide to buy more RAM or a SSD ARC for caching
 

the spyder

Member
Apr 25, 2013
79
8
8
Thanks Gea,

I was planning on 2x 10 disk Raidz2 pools, 2 24gb Intel 311 ZIL drives, and mirrored 60gb OS disks. My concern is the 311's are SATA II and the 520's are SATA III.
 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
I have a 56 drive wide SSD array for database use that is insanely fast and has been 100% reliable. I use Oracle ASM, which is a bit like a RAID1E implementation, which in turn performs similarly to RAID10.

With a wide SSD array, you may not need a ZIL to get good write performance.

At my current job, we are no stranger to ZFS, OI/Nexenta, and Napp-It- we have several Petabytes of ZFS based storage for archival and processing systems. As I get ready to start replacing our aging VM servers, I realized their localized storage would be an incredible pain to deal with as we migrate on to new hardware next year. In the mean time, I have had no way to easily manage backups or VM transfers for maintance due to the non existent 4.1 licensing. To help temporarily deal with this I purchased 24 Intel 520 240GB SSD's with the intent to create a fast, inexpensive NFS host. I've built plenty of spindle based versions, but never a completely SSD based one. This will run 25-30 very low IO VM's and be directly connected via 10GBE to each VM host with existing copper/Intel nics.

Has anyone else ran all SSD's?
How did you split your pools up? Raid 10? Raid Z2?
Did you use Cache or ZIL drives? I have two 480GB drives available for Cache if I want and could use two of the 240's for ZIL.


I can't wait to iperf/bonnie test this. I realize my main limiting factory will be the single SFF connection to the backplane/MB. But compared to the 24x 320/500GB WD 2.5" notebook drives in raid 5 in each, this should be 10x better.
 
Last edited:

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
Wait, did you say that the entire array will have a single SAS SFF-808x connection between the disk backplane and the motherboard? If so then you'll be flush with IOPS but your throughput will severely throttled compared to the potential of those Intel drives. In fact, with 30 VMs running, each could potentially have less throughput than a single laptop drive. Your 2GB/s might be enough for your particular VMs, but if it is then why buy so many SSDs?

From your photos you may be using a Supermicro 216 chassis for those SSD drives. If so, then you could always swap out for a non-expander backplane to get 6x the throughput.
 

the spyder

Member
Apr 25, 2013
79
8
8
Thankfully they shipped with the non expander, there was some confusion with the vendor/sm. I will be using 3 LSI 9211-8i cards, which we already have in house.
 

the spyder

Member
Apr 25, 2013
79
8
8
And they shipped me the wrong chassis. A new backplane is on the way.

Here are the initial results.

2x 60GB OS drives (Mirror boot disk is not working, will address later this week.)
1x 240gb ZIL (All I had in stock ATM)
1x 240gb Spare
20x 240gb in 2x10 Raidz2

Initially I was only seeing ~1.1GB's Write- adding the ZIL drive bumped me to an amazing 1.6GB's Write.


Is there a standard gauntlet of tests people would like to see?
 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
...
Is there a standard gauntlet of tests people would like to see?
This is the first large all-SSD ZFS storage server that I have seen, so I am very interested. I'd love to see it connected to one or more client machines via a fast network (IB or several 10GbE connections) and then tested using IOMeter. I'd like to see high queue depth (QD=32) 4kb random read and write IOPS (small files use case), 8kb read and write IOPS (database use case), and 1MB random read throughput (data warehouse or large file use case) tests.

Also, I'd really like to see what happens when more ZIL drives are added to the setup - as many as needed to maximize write speed.

This would make a great main site post - we should be seeing more and more organizations adopting a similar architecture.
 
Last edited:

the spyder

Member
Apr 25, 2013
79
8
8
I currently have two 10GBE links to two more identical boxes, minus the SSD's. They will both have 24x 3TB spindle for local storage. I do have a spare Qlogic 40GB switch + cards, but I do not know if I will have time.

I am planning on installing Omni, FreeNas, and OpenIndiana just for kicks and running the same series of tests.

 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
Here is some completely unsolicited advice:

There are many things that can go wrong at the high end of the performance spectrum. When I work on bleeding-edge systems, I always benchmark at every stage. Almost every build finds at least one point at which something goes subtly wrong and I lose an unexpectedly large chunk of the expected performance. By benchmarking each subsystem, I always know when and where the problem started, which makes it dramatically easier to fix.

For example, I have come to expect 7,700 MB/s of file system read speed from 20 SATA3 SSD drives in RAID. You are getting 2,730. That's very fast, but somewhere along the line you "lost" 5,000MB/s. I wonder if one or more drives are weak, a BIOS setting is off, PCIe interrupts are misbehaving, an OS parameter is misconfigured, or there is defect or misconfiguration in ZFS or Napp-IT. When you later add networking to the setup, you'll have another pile of possible points of misbehavior.


I currently have two 10GBE links to two more identical boxes, minus the SSD's. They will both have 24x 3TB spindle for local storage. I do have a spare Qlogic 40GB switch + cards, but I do not know if I will have time.

I am planning on installing Omni, FreeNas, and OpenIndiana just for kicks and running the same series of tests.

 
Last edited:
  • Like
Reactions: T_Minus

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
Can you run a battery of tests - such as 24 to 48 vm's of CDM or AS SSD at the same, showing linear increase in i/o in random and random QD32 (not interested in linear).

I'm thinking of either using my lefthand VSA to do JBOD (no raid) with 5 ssd's and network raid-1 or 5 using the cheap BROCADE $75 nic's for 4 x 10gbe per lefthand.

The VSA is SRM and VEEAM compatible along with some VAAI primitives.

What I'm really getting at is can you tell us about the failover? For example, if you lose a box, how long does it take ESXi to recover? Does it do a log dump or stun? With 10gbe nic how hard is it to setup vlan flow control? ISCSI to no_drop and say vmotion/lan to lessor priority.

Also can you give us an idea on the time to rebuild with raid-Z and the write amplification?

Strip and Stripe Size versus ESXi 1MB block size and say NTFS 64kbps.

Mainly two points:
How graceful is failure? How fast is Return to service should you lose 1, 2, 3, 4 drives? how does this scale with couple dozen vm's latency wise.

thanks! I'd love to follow your footsteps but I want to see what the real world is like.

Latency is very important - since random i/o is #1, then graceful failover and finally backup/vaai.

Also how well does thin reclamation work with vaai ?
 

the spyder

Member
Apr 25, 2013
79
8
8
dba,
I have not necessarily lost 5GB's- as I know where it is going (or rather not going). I should have stated more clearly that this is on the single SAS cable through the expander backplane- The incorrect chassis was shipped and a new backplane is on the way. This test was just for "fun" per say. It's going to be another week until the replacement gets here. Even with a 10GBE connection, the speed of the array would not be my limiting factor.

As far as the tests go, time will be the largest factor. I have a ton of projects at work and there is already a demand for the spindle based storage. I will do the best I can :). I would like input on creating a standard set of benchmarks, as I can test this across several platforms via work.
 
Last edited:

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
Got it. I remembered from earlier that you were aiming for the non-expander backplane. I didn't know that the tests were using the expander version. Sounds like even better speed is coming!

If you plan to do any IOMeter testing, I can share my configs. Long-term tests can take hours each, but you can get a good idea of your performance from a ten minute test.
 

the spyder

Member
Apr 25, 2013
79
8
8
Well, I am a bit disappointed. It looks like I have quite a bit of tweaking to do before I will see above 3GB's read/2GB's write. Monday we changed the backplane and installed two 9211-8i controllers. Combined with the onboard 2308, this gave us 6 ports for the non-expander backplane. Initially, I left the zpool config alone and re-ran the dd bench with the same parameters: identical results from before. 2.7/1.6. Next up I ran some different block sizes, file sizes, ect- but I never saw more then a 100-200MB change in speed. After reading through dba's setup, I went ahead and setup a 20 disk Raid 10 in 4 disk sets. This again gave nearly identical results. Realizing I left the 9211's with IR firmware, I flashed them to IT. This finally broke the 3GB's read/2GB's write mark. Last night I decided to move away from DD bench and evaluate other testing options. Hopefully today will yield better results. If anyone had block/file size recommendations, I am all ears.

I would love to fill the system with dedicated controllers for each port on the non-expander backplane, but due to the 10GB nics, I can not. Based on the PCI-E/9211 HBA/SAS6Gb's/Intel 520 speedtest, I will max out the 9211's 8x ver 2.0 PCIe connection first. I need to do some more research, but I would love to find some tests on exactly what sustained throughput the 9211 can handle.

*Edit* I am seeing some sites claiming ~20% loss to over head on the 9211's- giving 3GB's maximum. I still should be able to max out 3 controllers, which brings me back to waiting to test with something other then DD.
 
Last edited:

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
...I need to do some more research, but I would love to find some tests on exactly what sustained throughput the 9211 can handle.
The 9211, like all of the LSISAS2008 cards that I have tested, is good for between 2,500-2,700MB/S worth of 1MB random reads against 8 SSD drives, tested using IOMeter against unformatted volumes. This is when accessing test volumes large enough to greatly overwhelm any caches, not the tiny 256MB volumes that some benchmarks use. Using an unformatted volume, which IOMeter supports on Windows, ensures that no OS caching can happen; I've seen a pair of spinning disks seem to pull >4GB/S when evaluated using the "wrong" test configuration.

I recommend standardizing on IOMeter for your threaded IO testing. By the way, the "threaded" part is important. You'll only see maximum throughput from your storage when you throw many threads at it. In my tests, the peak comes at a queue depth between 16 and 256 depending on the specific system, with 32 being a good number to standardize on.
 
Last edited:

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
I do my basic tests with the RAID/HBA cards installed into a Windows box running either 2008R2 or 2012. In this setup, IOMeter runs on the Windows box itself. You can also run the IOMeter load generator (Dynamo executable) on Linux, though you then need a Windows box to run the IOMeter UI that controls the load generator. See: HowTo:iometer - Greg Porter's Wiki for example.

To test the HBA and disk throughput (as opposed to the RAID implementation), I test the SSD drives as JBOD. In IOMeter, use control-click to select all eight drives. IOMeter will issue IO to all drives in parallel. One you are satisfied, you can re-test the multi-disk RAID volume and compare the results to the raw drive tests.


How was your test system setup? Windows installed on a test system across IB? Directly on the system?
 
Last edited:

the spyder

Member
Apr 25, 2013
79
8
8
The new 9207-8i cards arrived and were installed last week. Server 2k8R2 and IOmeter are ready for testing- but our older primary VM host crashed and I have not gotten to play with this since. After that was fixed, I found 4 dead drives in our primary NAS pair (replicated). Hopefully later this week I can get some work done :D


DBA- message me your config settings.