ScaleIO Build and Meh Performance, any ideas?

frogtech · Apr 14, 2016

I have a 6-node cluster of servers that are all configured nearly identically. I've installed Scale IO 2.0 and seem to get pretty below standard performance, or, at least lower than I would expect from a solid state storage pool.

So each node is basically an X9DRD-7LN4F-JBOD board with LSI 2308 in IT mode, x2 E5-2670 chips, and 64GB of RAM.

The primary circuit for the SAN is a 10 gigabit network, on a Quanta LB6M switch. Each node has a ConnectX-3 EN dual port card. I've configured jumbo frames at the switch and NIC and confirmed it with pings to appropriate IPs.

So when I have my 6 SSDs together in a storage pool and added to Server 2012 Failover Clustering as a Cluster Shared Volume, my 'test' is to install a VM onto the solid state storage pool. I was figuring that at a minimum I could at least get up to the write speed of one disk during an install, which is about 480~MB/s or so. Unfortunately when I start installing an OS to a VM at best I get like 130 MB/s write bandwidth, each node contributing about like 15-20MB/s or so. Pretty weak stuff.

I've referred to EMC's performance tuning guide but I don't think they've released an updated version for 2.0 yet. So the guide for 1.32 is a bit outdated in a few ways, firstly is that the "num_of_io_buffers" command option I think is removed from the SCLI and I can't find it in the GUI, so it must be related to the SDS/MDM/SDC performance setting in the GUI which I've set to "High" for all nodes/services.

Secondly the guide says to configure the SDS cfg file with special paramters if it's running on flash based storage, when I did that the SDS' were not able to connect to the MDMs, so that obviously doesn't work. There's also a recommendation to remove AckDelay from tcpip in the registry, however, this doesn't apply to Server 2012 R2 anymore. The last recommend is to change the values of DWORDs in the 'scini' registry which weren't there for me, so, I added them, but I'm not sure it made any difference.

I'm hoping there's someone else out there who has a similar setup or has troubleshooted similar performance issues before who can help point me in the right direction. My next step is to remove Hyper-V virtual network adapters and multiplexor drivers and see if that's causing overhead latency somewhere. Cause I'm really not sure what else to set or do. When I remove an SSD from the pool/cluster and test it with something like CrystalDiskMark I get the advertised speeds from the manufacturer. The SSDs in question are 960GB SanDisk Cloudspeed Ascends.

Thanks for any insight.

Jun Sato · Apr 28, 2016

I read your post. I run a SQL server architcture consultation biz so I deal with a lot of SANs. I recently built a ScaleIO test system for us to gain good understanding of what it is and how it performs. Some pictures at the company site here.

It is a similar setup, mine is a 5-node setup, and I'm getting 150,000 IOPS at 4K random, in terms of transfer rate 1800MB/s, or, 14.4Gbit/s on sequential-ish workload off SQL, so your system definitely has more headroom.

My spec per node
CPU: Mixed AMD CPUs. Some AMD Athlon X2's and FX-4300s.
Memory: 4GB DDR3 ECC DIMM x 2 (=8GB per node)
M/B: some old ASUS AMD gaming board, recycled...
Interconnect: Mellanox ConnectX-2 VPI at Infiniband QDR (40Gbit/s line late, 32Gbit/s usable) Qty: 2
ScaleIO Version: 2.0
OS: Ubuntu 14.04

So I have 3 things for you:

1. 10GbE, especially copper, is known for high latency and high heat...not the best choice for ScaleIO, go with a lower latency Infiniband. They are dirt cheap on eBay.

2. Try Ubuntu as OS instead of windows. As a mostly Windows-only guy (I'm a SQL server pro by trade) that didn't know chmod, I'd still say this. It's much easier to tune with Linux and learnt a few things along the process, and Ubuntu makes it easy -IMHO more so than RedHat / CentOS. For example, in Windows you can't easily set TxQueue length. In Linux, what you set it visible. In windows, you modify registry and hope that it took effect with no way to verify so you're left in the dark. So there's no way to be scientific about the tuning process.

3. Do a node-to-node transfer test. Get on each node, and see what kind of bandwidth you get between the nodes. This points you to the right place to focus. If the node-node communication sucks, that's where you want to focus. If the node-node communication is OK, you want to focus on ScaleIO itself. You want 15~20 Gbit/sec transfer and 50~80 microsecond latency.

From the spec of it, you have really powerful nodes. It's easily 5 times more capable than each of our humble Athlon nodes on 5-year-old motherboards, but then raw CPU power isn't really needed for the nodes. So...for clients we recommend spending the budgets on low-latency interconnects and maximum number of nodes financially/physically possible than beefing up each node.

SATO Database Architects, LLC owner, Jun Sato.

Jun Sato · Apr 28, 2016

With your setup, you should get something like this...we do have:

Qty 2 Intel 750 NVMe 400GB on PCIe slots

and

Qty 2 "Striker" 480GB SSDs form Mushkin Enhanced

...per node, for a total of 20 SSDs.

Oh....and I have couple more recommendations for you:

4. Right-size everything.

SSD size? keep it under 500GB - you would rather have 2x 500GB than 1x 1000GB. There aren't good SATA SSD controller than can handle over 500GB well. Unless, you pay for enterprise PCIe stuff.

SSD count? Maximize.

Balance the PCIe lanes for sending info to the fabric and PCE lanes going to the SSDs.

Node count? Maximize.

DIMM count? MINIMIZE. Sometimes you get less performance with more RAM. populate the motherboard with the bare minimum necessary to achive the dual-channel or quad-channel operation the CPU wants to do, with fewest number of chips on it. Unless you have spinner HDD, you don't want read cache anyway. So there really isn't much node RAM can do for ScaleIO in SSD world. Think of each node as merely an enclosure for your SSDs.

and last but not least...

5. Power options. Disable C-state, enable HPC mode in BIOS. disable power management for PCIe and or chipset. It's all in your BIOS setup. Basically, power management assumes that the user is on the computer, and activate CPU as the user starts some task - otherwise the CPU will go to sleep to save power. It hurts ScaleIO performance quite badly, because read/write request at SDS comes in when CPU is sleeping, and it would have to wake up the CPU every single time. From SDS perspective there's no knowing when the request would come, so it has to be awake the whole time. This is a really big factor, and your pitfall may be here. So go in each node, and absolutely ensure that it's running full speed the whole time. If you go to Task Manager, it will give you the CPU speed in GHz....make sure the GHz stays high and not change like it usually does.

frogtech · Apr 28, 2016

Jun Sato said:
With your setup, you should get something like this...we do have:

Qty 2 Intel 750 NVMe 400GB on PCIe slots

and

Qty 2 "Striker" 480GB SSDs form Mushkin Enhanced

...per node, for a total of 20 SSDs.

Oh....and I have couple more recommendations for you:

4. Right-size everything.

SSD size? keep it under 500GB - you would rather have 2x 500GB than 1x 1000GB. There aren't good SATA SSD controller than can handle over 500GB well. Unless, you pay for enterprise PCIe stuff.

SSD count? Maximize.

Balance the PCIe lanes for sending info to the fabric and PCE lanes going to the SSDs.

Node count? Maximize.

DIMM count? MINIMIZE. Sometimes you get less performance with more RAM. populate the motherboard with the bare minimum necessary to achive the dual-channel or quad-channel operation the CPU wants to do, with fewest number of chips on it. Unless you have spinner HDD, you don't want read cache anyway. So there really isn't much node RAM can do for ScaleIO in SSD world. Think of each node as merely an enclosure for your SSDs.

and last but not least...

5. Power options. Disable C-state, enable HPC mode in BIOS. disable power management for PCIe and or chipset. It's all in your BIOS setup. Basically, power management assumes that the user is on the computer, and activate CPU as the user starts some task - otherwise the CPU will go to sleep to save power. It hurts ScaleIO performance quite badly, because read/write request at SDS comes in when CPU is sleeping, and it would have to wake up the CPU every single time. From SDS perspective there's no knowing when the request would come, so it has to be awake the whole time. This is a really big factor, and your pitfall may be here. So go in each node, and absolutely ensure that it's running full speed the whole time. If you go to Task Manager, it will give you the CPU speed in GHz....make sure the GHz stays high and not change like it usually does.

I'm using passive SFP+ cables for the 10 gig, not BaseT copper. Got a quick question though, before I head to bed and read through the rest of your post (which is superb by the way), I do have an infiniband switch on hand but how do you utilize it with ScaleIO? Is it just IP over Infiniband business as usual? My switch has the subnet manager built in.

I have spinning disks for bulk storage and for media but for those I have SSD flash cache device added anyway to the pools.

Jun Sato · Apr 28, 2016

frogtech said:
I'm using passive SFP+ cables for the 10 gig, not BaseT copper. Got a quick question though, before I head to bed and read through the rest of your post (which is superb by the way), I do have an infiniband switch on hand but how do you utilize it with ScaleIO? Is it just IP over Infiniband business as usual? My switch has the subnet manager built in.

I have spinning disks for bulk storage and for media but for those I have SSD flash cache device added anyway to the pools.

Yes, you use IPoIB. If your switch has subnet manager built in, you're good to go. I grabbed ConnectX-2 off eBay for $40 ish. In our use case, each HCA is not going to saturate the PCI Express 2.0 speeds at 8 lanes, so I didn't see a point in paying for Gen3 Infiniband - and evenrything you need is downloadable from Mellanox. They even have Linux distro-specific OFED, so you can get point-and-click stuff for Ubuntu 14.04 for example...add the power of apt-get, it's dead easy. If you are missing some package, the messages will say so, and all you have to do is

#apt-get install missingstuff

and it will get out to the internet, grab everything, and install and configure it for you. Linux has come a long way.

I see, with respect to the spinner/SSD mix, if you have spinners, your 64 gigs or RAM will come in handy for those. But here's my thinking from operational experience. If you got slow-to-rebalance spinner in the mix, you kinda miss out of ScaleIO's beauty. One node goes out, and it keeps pumping while it rebalances/rebuilds and teh user won't notice a thing - which is only possible with SSDs. HDD rebuilds is noticeable. And renders your ScaleIO cluster in degraded state for long period of time, which allows secondary issues that won't otherwise happen.

So, I have a movie archive at home, and I decided I don't want it on spinners to ScaleIO for terabytes of files, although that's entirely doable. They live happier on the home theater PC on RAID-0 12TB (two 6TB drives) that are backed up nightly to external USB drive, which is a power efficient setup for home because it requires only two drives be up 24/7, and nightly backup needs to be powered only while the backup is in progress. Besides, it separate the concern - the family won't complain if ScaleIO is broken, and the home theater PC is self-sufficient. There's a beauty in that.

Good night.

frogtech · May 2, 2016

Well, I was attracted to the idea of ScaleIO from my previous perspective of using a mix of hardware/controller based RAID 6 and 10. I wanted to go for a bulk storage method that was divorced from the traditional platform of RAID where it could be mostly platform and hardware agnostic. And I didn't want to use ZFS since that relies on using a Linux platform and is basically just hardware agnostic RAID. I just expected the performance to be a lot better. Even if I were rebuilding either ScaleIO spinner pool or a RAID5/6, either performance hit would probably be noticeable. But I saw okay-ish rebuild/rebalance speeds when I was troubleshooting a BSOD'ing node, around ~550MB/s rebuild which I thought was cool. But I wasn't limiting the rebuild IO.

I do like your perspective and it has made me reconsider some things, such as completely just segregating my media platform onto a singular box just with much larger capacity hard drives. Because as it is now I have 4x2TB hard drives in each server comprising the spinner pool. I was hoping that ScaleIO could've been something great that would give me better performance out of the spinners but I haven't had much time to play with it. The thing is I serve Plex to multiple friends and family and re-building the platform to something like Ubuntu to test if the platform is the issue isn't practical since it means I would have to take Plex down, tell everyone, spend days re-building, retransfering the 10TB of media all over again, etc. I wanted to see the beauty of ScaleIO with HDDs because I know that not every firm or business can afford to load hyperconverged systems with SSDs.

But alas if I want to have a completely flexible home lab I am considering dumping all of the 2 TB drives in favor of 6-8TB in a single box, backed up nightly as I am doing now.

Anyway, I did want to address your mentioning of CPU power profiles. Yes, in BIOS I do have all the typical things turned on like C-states, C1E, and the CPU profile in BIOS is "maximum energy efficiency". I need to go back 1 by 1 through each node in BIOS and turn them all back up to maximum performance. I will try that and report results.

But I may just end up selling my 836 3.5" chassis in favor of Supermicro 216 24x2.5" chassis and just loading them with SSDs over the years as deals come and go.

Jun Sato · May 2, 2016

Moving to SSD is probably the way to go, I think it's just matter of when, and of course, your milage vary.

As for maximum performance, actually - don't discard classic RAID just yet; at my day job, I see endless pile of companies that don't spend their money in the most efficient manner - and there's a big misunderstanding on what storage performance is.

SAN like ScaleIO is great for scalability and performance - when the said performance is concurrency. When your mission is to make a transnational database go faster, local RAID-10 absolutely and completely beat ScaleIO setups it's not even funny.

Let me give you an example; when 15 Intel DC series PCI-Express SSD, super fast stuff, is combined in ScaleIO, an SQL server benchmark that basically measures how many inserts it can do in a minute - it inserted 356,784 rows. The same SQL host, with all the ScaleIO nodes turned off, with lesser SATA SSD on the on-board RAID at RAID-10, (8 spindle config) inserted 1,245,875 rows. So, lesser SSD, fewer drive count, four times the performance.

And it's not because ScaleIO was put together in a suboptimal way. It's precisely what we expected.

ScaleIO brags about their figure in IOPS. IOPS is a measure of concurrency in the real world - how many people can hit the storage without slowing down. So, it shouldn't be used as a measure of single-threaded performance. It is relevant when you wonder if you can host 100 virtual machines or 1000 virtual machines on the storage.

Comedy story that frequently unfold at IT deparment goes something like this.

DB guy: We need faster storage.
IT head: Okay, we have a 500,000 IOPS SAN. Let's get a 2,000,000 IOPS SAN so it's four times faster!!!

Of course, without latency improvement nothing's going to change after quarther million dollar buy.

That's like saying,
"We lost a race. Four cyl engine isn't enough!"

"Well, let's get a 6-liter V-8"

...and of course, the guy shows up on race track with a 6-liter V-8.....6-liter V-8 BUS that's good at hauling 50 people. You don't show up in a race in a BUS. It's a common sense.

But in IT, people don't have that common sense and do silly things like running SQL on SAN....

anyway, rant over! good luck with your setup!

Paul Roland · Jul 14, 2016

Hi, on the same boat here.
Similarish specs, configured scaleio in 3 m610 blades with 2xE5645, 64GB ram, SAS IT Mode
I've setup a dedicated 10G network (57711 + m8024) just for this purpose. Dedicated switches, dedicated network cards. Jumbo frames, tuning optimizations, txqueuelen etc. CentOS 7.2
These 3 nodes have 2x Samsung SM863 Enterprise SSDs. Single SSD perf is 83K iops, 510 MB/s linear, 500us latency.
Latency over network is 19us, throughput 1.24 GB/s flat.
And now the part that sucks. If I do thin volumes, I get very good performance, 900 MB/s and 90K IOPS impressive.
But, as soon as I put some data on that volume like a windows installation, performance drops to 50K IOPS and... 200 MB/s linear (at best) and more than reasonable latency of 500us
IOPS are still acceptable but that linear performance I can get from two sas10k in raid1.
If I start with a thick volume I get that performance, 50K iops and 180 MB/s (sometimes as low as 120)
Watched that famous video from youtube with scaleio guys, noticed they created(used) thin volumes. My guess is that after data is added, scaleio loses performance fetching files/blocks from the cluster not favoring local.
Ram cache doesn't help (quote slower) and I don't think infiniband will do better because it's all in TCP stack (IPoIB) at least on Ethernet I am offloading here and there.

Here are some metrics I've setup with fio, runing fio --name=read --filename=/dev/whatever --direct=1 --rw=read --bs=4KB --iodepth=32 --norandommap --ioengine=libaio --group_reporting --runtime 30

Latency 1st:

[root@node01 ~]# qperf somehost tcp_bw tcp_lat udp_bw udp_lat -oo msg_size:1K
tcp_bw:
bw = 1.24 GB/sec
tcp_lat:
latency = 19.5 us
udp_bw:
send_bw = 1.15 GB/sec
recv_bw = 1.15 GB/sec
udp_lat:
latency = 18.8 us
[root@node01 ~]#

Single SSD:

read : io=19656MB, bw=335450KB/s, iops=83862, runt= 60001msec
lat (usec) : 250=0.02%, 500=99.96%, 750=0.02%, 1000=0.01%

Scaleio Thin

read : io=9988.8MB, bw=340937KB/s, iops=85234, runt= 30001msec
lat (usec) : 100=0.01%, 250=13.13%, 500=70.78%, 750=15.77%, 1000=0.28%

Scaleio Thick

read : io=6876.2MB, bw=234694KB/s, iops=58673, runt= 30001msec
lat (usec) : 250=0.22%, 500=43.06%, 750=51.56%, 1000=4.59%

Paul Roland · Jul 14, 2016

Frogtech There is something perky here. I am using h200 in IT mode as well.
Problem with it is I am getting 300 points in as ssd with it. With H700 for example I am getting 900...
I think this has something to do since it's the only thing we have in common other than the crap performance.

frogtech · Jul 14, 2016

Do you have h700 flashed to IT mode or using standard dell/lsi firmware? How do you present the disks to the operating systems, is each disk an individual virtual raid 0 disk?

Paul Roland · Jul 14, 2016

No, I have H200 not H700, flashed to it mode, ir works as ell if I don't create an array they will be presented to os as sda and sdb. No difference.
I said I also have H700 where I have better performance, and I would try h700 for ScaleIO but I won't do that anymore because I created an array of ramdrives and I am getting slightly better results. Same 50k but 320 MB/s linear. That is with ramdrives.... Which in my case have 600k iops and 3GB/s and 26us latency each. Quote a good simulation.
Don't know how to do that in windows in linux I do modprobe brd rd_size=102400000 and then create an 20G vol (so it won't fill up the non existent 100G ram)
I appreciate that ScaleIO is free for testing otherwise this with let's say 2k per TB would have a different affect.
I still consider the issue is ScaleIO tries to fetch the files from all over the place which is expected but 160 MB/s over 10G is not.

Michael Saunders · Aug 29, 2016

I'm curious, you mention doing a node to node network test, what method do you recomend for doing that? Troubleshooting ScaleIO performance on our system at the moment, and while I believe it is internal to ScaleIO that the issue is (we're working with EMC on it), I'd like to be able to rule out the 10G network side of things.

Thanks,

Search

ScaleIO Build and Meh Performance, any ideas?

frogtech

Well-Known Member

Jun Sato

New Member

Jun Sato

New Member

frogtech

Well-Known Member

Jun Sato

New Member

frogtech

Well-Known Member

Jun Sato

New Member

Paul Roland

Member

Paul Roland

Member

frogtech

Well-Known Member

Paul Roland

Member

Michael Saunders

New Member