24 SSD's + Supermicro X9/96GB/10GBE- am I crazy?

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
The new 9207-8i cards arrived and were installed last week. Server 2k8R2 and IOmeter are ready for testing- but our older primary VM host crashed and I have not gotten to play with this since. After that was fixed, I found 4 dead drives in our primary NAS pair (replicated). Hopefully later this week I can get some work done :D


DBA- message me your config settings.

I believe that you mean my IOMeter settings. Here they are:

IOMeter configuration for SSD-oriented testing:
Run the IOMeter UI and configure it as shown below.

Test Setup tab:
All default values except:
"Number of Workers to Spawn Automatically" = 1 or optionally 1 per physical CPU slot (not the default, which is one worker per CPU core). Try both options and see what you get.
"Ramp Up Time" - 0 seconds is fine if you are testing the HBA itself. If you are testing overall system performance, 60 seconds or up to 60 minutes is appropriate - this gives the system time to get into "steady state" before you start collecting metrics.
"Run Time" - I use five minutes for quick card/HBA tests and 30 minutes to 5 hours for system performance tests.

Results Display tab:
"Results Since" = Start of Test since we want the overall results, not a snapshot
"Update Frequency" = 10 seconds. It's not critical

Access Specifications tab:
Each test will have it's own configuration here - see below.

Network Targets tab:
You won't be testing the network, so there is no need to touch this tab.

Disk Targets tab:
"Maximum Disk Size" = 24,000,000 sectors would be 12GB per disk assuming 512byte sectors. Set this large enough that the data won't fit into your RAID card cache or your OS memory, if you are testing a filesystem.
"# of Outstanding I/Os" = 32. This is a good average for an LSI HBA. Most systems will show maximum performance somewhere between 16 and 256.
"Write IO Data Pattern" = Pseudo Random.
Note: Control-click to select more than one disk to test at once.


Access Specifications:

Read IOPS Test:
4kb, 100% of specification, 100% read, 100% random for SSD testing. All other settings default.

Read Throughput test:
1MB, 100% of specification, 100% read, 100% random. Sometimes you can see greater throughput by testing 2MB or 4MB transfers, but my use case is for 1MB transfers so that's what I use.

RAID Read test:
64kb (or whatever your RAID chunk size is), 100% of specification, 100% read, 100% random.

RAID Write test:
64kb (or whatever your RAID chunk size is), 100% of specification, 100% write, 100% random.


Configure IOMeter as above with the exception of the access specifications, which vary per test. For each test, configure an access specification and then run that test. You should create additional test specifications to match your anticipated disk usage, including mixed read/write tests.

In my methodology, I first test the system as JBOD, which gives me a good idea of my raw maximum throughput. For this, I connect many SSD drives as separate volumes, leaving them initialized but unformatted (to avoid OS caching) and then control-click all of them in the Disk Targets tab so that I'm testing on all of the SSD drives in parallel but not testing any RAID implementation.
Once you are satisfied with your raw throughput, you can re-test with the disks in whatever RAID setup you have chosen to see how well your RAID implementation is performing.
 
Last edited:

the spyder

Member
Apr 25, 2013
79
8
8
Thanks dba.

Here is the quick results from 8 Intel 520 240GB drives on a LSI 9207-8i

231k IOPS
4.5GB's Read Throughput
2.2GB's Read 64k test
2.2GB's Write 64k

Still have some tweaking to do, but this is looking promising.
 

the spyder

Member
Apr 25, 2013
79
8
8
I ran 30min tests yesterday and two 5h tests last night. Today I doubled the disks to 16 and re-ran the quick tests- This is the closest to my real world environment I will be seeing, as with Raidz2 and two 10 disk sets, I end up with 16 drives. I will test to 24 last this week.

4k Read IOPS: 606,000
1MB Read TP: 7.2GB's
64k Raid Read: 6.5GB's
64k Raid Write: 3.5GB's
 

the spyder

Member
Apr 25, 2013
79
8
8
24 Intel 520 240GB drives

4k Read IOPS: 860,000
1MB Read TP: 11.1GB's
64k Raid Read: 10.6GB's
64k Raid Write: 5.6GB's
 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
Nice!

It will be VERY interesting to see how well ZFS is able to deliver throughput and IOPS after you have set up the RAID. My understanding of RAIDZ2 is that your delivered IOPS equals the IOPS of the slowest drive in the pool, ignoring caching. I wonder if you'll end up wanting four smaller RAIDZ pools instead of two large RAIDZ2 pools. RAID10 would also be quite appealing for VMs. The argument in favor of RAIDZ2 is the overly-long resilver time for an array built out of large multi-terabyte drives. With smaller and much faster SSD drives, the argument is far weaker.

24 Intel 520 240GB drives

4k Read IOPS: 860,000
1MB Read TP: 11.1GB's
64k Raid Read: 10.6GB's
64k Raid Write: 5.6GB's
 
Last edited:

the spyder

Member
Apr 25, 2013
79
8
8
I am running out of time to test. Today I reinstalled Solaris/Napp-It and created a huge Raid10 pool of all 24 drives. I am again stuck with the question of how do I test this, as the built in benchmarks won't touch the speed available. I will spend some time this evening researching what I can do. In the mean time, I have procured a Mellanox switch and three 40Gb's HBA's for some VM thrash testing. By request, I will have two hosts and 40 VM's spun up. Half CentOS, half Win7. The Cent machines will run a thrash script, randomly read/writing- based on average use. The Windows machines I have not decided what to do yet.
 

gigatexal

I'm here to learn
Nov 25, 2012
2,746
524
113
Portland, Oregon
alexandarnarayan.com
And they shipped me the wrong chassis. A new backplane is on the way.

Here are the initial results.

2x 60GB OS drives (Mirror boot disk is not working, will address later this week.)
1x 240gb ZIL (All I had in stock ATM)
1x 240gb Spare
20x 240gb in 2x10 Raidz2

Initially I was only seeing ~1.1GB's Write- adding the ZIL drive bumped me to an amazing 1.6GB's Write.


Is there a standard gauntlet of tests people would like to see?
Fwiw, my array of 20 spindles hit 1200MB/s writes in zfs using dd and /dev/zero. There has to be a bottleneck somewhere

Lol disregard you're hitting huge speeds
 
Last edited:

the spyder

Member
Apr 25, 2013
79
8
8
I have until the end of this week to wrap this up.

1) Solaris 11, Omni, and FreeNAS are all going to be tested
2) My two other 2u servers are setup with ESXi and 30 Windows/Cent OS machines mixed
3) I have both 10Gb and 40Gb IB ready for testing.

It's going to be a loooong week. :)
 

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
I have found that raid-1 scales far better with random i/o than raid-10. Linear scales perfect with raid-10 but random stinks.

Also you have to realize that if you make one big filesystem, ESXi will give the target 32 queue depth.

Likewise one big ESXi vmfs volume is one target, 32 QD shared across many vm's. no good.

One option since I needed large space, was to create extents and span raid-1 - naturally vmware (like ext3) spreads out the load, so while its not a true raid-10 each extent gets 32 queue depth.

Another nasty thing with ESXi is that it never uses all VMDQ cores unless it thinks it will need it - case in point svmotion/vmotion (Das to das) uses two cores, maybe a little of 4 core tops! starts to stall around 3 to 4 gbps. Of course if you ran vm's and other vmotions it would scale up but it doesn't even try to use all cores to ramp up speed aggressively.

I'd really love to try out IB but networking isn't my best skill so I am going with 2 10gbe nic's per server and just doing simple networking (lan frontend/and backend) - given the cost of networking ($$$$) it gets really expensive to outfit two nic's to every server.

Already have about $2500 for two 10gbe 24 port switches and $75 per emulex dual port nic - cabling isn't cheap either (sfp+ and fiber or passive DAC - can easily reach $35 per meter!!) per port.

I'd like to see what you get as far as utilization with IB with ESXi clients.

With standard VSWITCH one vm will never "trunk", even with a dvswitch it will barely trunk (esp with non-cisco switches). With the nexus 1000V switch you can get source/dest IP hash but that doesn't really scale either if you have 1 vm per host and 1 file server.

I'd like to hear what you find out on how to scale to reach peak potential output.

I chose the emulex nic's since they are the cadillac of nic's. One driver - many versions of chipset. CIM plugin. VMware flashable. Vcenter plug-in. No DAC/SFP+ brand limitation. Windows 7 support. The only thing close is the Intel which we all know cost far more than $75 each.


Let me know how IB 40gb works out man.

What I found with SVMOTION/VMOTION without shared storage, it goes like a bat of out hell, then pauses for a minute (checks/thin) then goes again , then pauses, then goes again. So basically with ESXi 5.1 and a small vm say 80gb, 4 gigabit is just as fast since much of the time is processing/sync'ing/handing off.

My main goal is not storage but ability to pull backups from veeam fast. I need 30 minutes to restore a 250gb vmdk. Raid-5 works well since i'm doing host based deduplication (read: slow), but raid-5 is far faster than RAID-10.

I'd be glad to lend you my lefthand VSA licensing since i'm not using if it you would like to try it. I've got 10 esxi mac addresses (manual zone). You can setup DEV/TEST/PROD on 1 to 10 servers. It is truly enterprise in the way the ESXi connects to each instance for each lun for each nic.. If you want to compare performance and features and stability. 4 10gbe nic's (or 4 virtual functions) will create 4 connects per target lun. This can scale up performance greatly
 

the spyder

Member
Apr 25, 2013
79
8
8
I am afraid my testing is being cut short, I knew it could happen- but such is life. We did fire up 42 virtual machines, 21 Cent OS and 21 Win 7 and run a thrash script/sisoft for several hours this week. We were maxing out the CPU and Ram the entire time. It's sufficently burnt in and will be heading to production later this week.
On our next system I will try and schedule more testing time. Until then:
 

adrian

New Member
Jul 26, 2013
10
0
1
Hello....

I've been watching your threads both here and on hardforum for a very long time, and i'm very curious on your progress with this 24 ssd pool.

I do have a similar setup with head-node (with LSI 9206-16e) based on openindiana/napp-it and a SuperMicro JBOD with 8 / 16 SSDs. I also have the backing network for this storage based on Connect-X2 Infiniband HCA's and Mellanox 4036 Switch.

The IO and throughput performance on the headnode directly is superb and it scales very similar to your benchmarks. I did test multiple times both in solaris and windows server environments.

My setup is for providing a storage for some ESXi hosts i have. On the ESXi i have some storage performance hungry VM's.
Due to the fact that the Mellanox Vmware ESX drivers (currently 1.8.1) only provide IPoIB or SRP to ESXi hosts, i did a lot of testing and i want to know what your results were (and your setup).

What i did find is that SRP is not very stable and i did run into some issues with it. Also IPoIB for ESXi is limited to DATAGRAM MODE due to driver limitation.

Can you please tell me what was your setup and how was your performance?

Thanks a lot...
 

iSolutions

New Member
Feb 17, 2014
1
0
0
Guys, we also want use 24x512GB SSD for our local storage. But we have one question. How its better build RAID10 array.
We already try use 3xLSI HBA controllers, 8xSSD per each with soft raid for test, but we have bad results with such configurations.
We are using Supermicro X9DRi-F mainboard
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
Guys, we also want use 24x512GB SSD for our local storage. But we have one question. How its better build RAID10 array.
We already try use 3xLSI HBA controllers, 8xSSD per each with soft raid for test, but we have bad results with such configurations. We are using Supermicro X9DRi-F mainboard
If you are not CPU, PCI or RAM limited, sequential performance scale with number of datadisks while IO performance scale with number of vdevs. If you use mirrors, you can double values for read.

This is the reason, mirrors are used with IO sensitive workloads like ESXi datastores or databases. But this is the case mainly with spindles. If you use SSDs, you have a 100x better IO and no problem with fragmentation. This is the reason I would go with Raid-Z on SSD only pools to increase capacity.

You should build some testpools and do your own benchmarks either locally ex with bonnie or remotely via iSCSI and something like Chrystalmark with iSCSI writeback enabled and larger blocksizes.

If your use case is mostly of type filer, I would use a 2 x 10 SSD Raid-Z2 pool. If you use it as a ESXi datastore I would prefer a 4 x 6 disk Raid-Z2 setup. Best IO performance is with 12 x mirror.

If you use consumer SSDs, you should not go above say 80% fillrate or write performance in a Raid setup may go down. With consumer SSDs you may also think of a dedicated write optimized ZIL SSD to reduce small writes. For benchmarks I would disable sync write, compress and avoid dedup at all. You can compare values with LZ4 compress. It may give you better values.
 

jingjing

Banned
Nov 23, 2013
22
0
0
If possible, stay below 60% fillrate or performance drops.
Optionally tune IP settings (Jumboframes)
Insert as much RAM as possible. Check Arc statistice to decide to buy more RAM or a SSD ARC for caching