RAID 30 terrible sequential read performace

Railgun · Jul 28, 2018

Firstly, long time lurker, first time poster.

Secondly, yes, you read the title correctly, RAID 30. Let me preface the meat of this post by saying I don't want to entertain why this and you should've that. I'll deal with it later when I can temporarily offload and possibly re-do the array.

What I would like to do is ask those vastly smarter than I where this bottleneck may be.

I have a hyper-v server running on an Asrock EP2C602-4L/D16 dual E5-2670s, 64GB memory. HBA is an Areca 1882-ix-24 with 4GB cache.

Storage is as mentioned, a RAID 30 with 8x WD RED 3TB disks, 4k block size.

Network is all Ubiquiti Unifi (for the relevant path anyway) on the same flat network, and jumbo frames are enabled everywhere.

I have a VM running Plex on this server, with said array directly presented, and given all CPU cores, and dynamically allocated RAM, of which it's currently consuming a bit over 6GB (installed a desktop temporarily).

Storage format is xfs (which was another good idea at the time I suppose).

Storage is all movies at the moment; 1:1 blu-ray rips so all multi-gig file sizes, and should all be sequential reads/writes.

I've checked the FS, and all is good.

So to the issue. Write speed is not great, about 50MBps. I'm using nmon to monitor disk and network performance on the VM and writes are very spiky...lot of inactivity, then a quick 400MB write, rinse and repeat. Disk busy time is pretty low.

Read speed is horrendous. I can pull about 6GB at 35-50MBps, but then drops to 1MBps for reasons unknown. Disk busy time is 1% at most.

I can't imagine this is a disk issue, but more a controller issue. I've been playing around with the controller settings; buffer threshold, read ahead, AV streams, etc...nothing is making a dent one way or another.

The built in benchmark tool will only use up to 1GB file samples, so it's all cache there...read showing 322MBps and write 3.4GB! Needless to say, that result is skewed a lot.

I'm at a loss on how to further check that the HW and/or config is good. This is a somewhat technical exercise, hence the ask not to go into woulda/coulda/shoulda.

If there's more information required, please let me know and I'll provide it. Else, is RAID 3(0) just that crap? Thought the whole point was for sequential loads.

TIA!

fsck · Jul 28, 2018

Read speed is shit because you need to read the full set of drives.

I'm willing to bet that SoC speed is your limiting factor. You need to crunch a parity for every byte instead of every block like you would with raid 5. So it's I/O intensive.

If you give me the problem of building a raid 30 controller, I'd probably do it in hardware on an fpga.

Raid 5 and 6 can be done with exclusively with XOR (pun intended). It's almost certainly hardware accelerated. I doubt they did any optimization for raid 30, as that costs silicon space.

Railgun · Jul 28, 2018

Thanks for that. I wouldn’t have expected it to be that shit TBH.

I suppose that answers my question then in what I need to do. In theory it would work. In practice, as you say, they decided to half ass it.

mstone · Jul 29, 2018

In theory it's a lousy raid level that nobody uses because it's terrible for 21st century hardware. It's byte oriented, which is exceeded only by RAID2 (bit oriented) for being irrelevant on things you buy today. You can't do simultaneous I/O to multiple disks, which means that unless your entire workload is sequential (including metadata updates, etc.) your performance is bound by seeking every single disk for every non-sequential read. And for that horrible cost, the benefit is...nothing over RAID 5/6 on modern hardware. (In the best case your read speed in RAID 3 is disk bandwidth * (n-1) disks; in RAID 5/6 it's approximately BW*n. For sequential writes using nv cache both RAID 3 and RAID 5 are BW*(n-1). Except: consider what happens when writing single bytes on any modern storage medium [like, for example, a 4k disk], then recall that raid 3 is byte oriented, then facepalm.) Parity, readahead, and non-volatile caching aren't the issue they were 30 years ago.

Side note: back when RAID 3 was a thing, it used synchronized disks in dedicated hardware to limit the latency reading across the spindles. I'm not aware of any modern hardware that actually does that.

Jeggs101 · Jul 29, 2018

I'd agree on a different RAID level. RAID has been around for decades. There's a reason 0/1/5/6/10/50/60 are well supported on everything.

Railgun · Jul 29, 2018

Thanks folks. So while on paper; suggesting that sequential transfers will benefit, which technically it is, in practice, it sucks.

Call me convinced. I'd had hoped that this controller was a) beefier than it is in this instance and b1) they'd either NOT support it if it can't hack it or b2) indicate that it sucks if you use it.

And this is why I'm a network engineer, not a dba or the like.

mstone · Jul 29, 2018

The only reason I can think any modern controller would list RAID 3 support is that their marketing people didn't want them to look worse than other products listing RAID 3 support. The engineers just hope nobody would try to use it.

As far as the controller being "beefy", you won't find a better one. This isn't a controller issue. If you found one that actually performs well at RAID 3, it's really doing RAID 4 and lying.

Railgun · Jul 29, 2018

I’m making a lot of assumptions based on what I’m reading here, so the beefiness was in reference to fsck’s comment. As there’s no way to measure that, it’s all speculation at best anyway.

And yeah...marketing teams trump all. C’est la vie.

mstone · Jul 29, 2018

Railgun said:
Storage format is xfs (which was another good idea at the time I suppose).

xfs is a great choice.

writes are very spiky...lot of inactivity, then a quick 400MB write, rinse and repeat. Disk busy time is pretty low.

Disk busy time in this context is meaningless, it's just a measure of how fast the controller is responding and has nothing to do with the disk. Likewise the writes are going into the controller cache, not the disks, and is a typical pattern (stuff gets cached on the host, squirted out to the controller). The thing that is unusual is to see such large writes on a periodic basis; if there was significant backpressure from the RAID controller I'd expect to see a much lower level of continuous writing. What's the source of the writes? If it's coming direct from the bluray than this is all completely normal because the writer isn't generating data any faster than it's being written (you won't ever see more than 50-60MB/s from BD). A simple way to test pure sequential write speed is something like "dd if=/dev/zero of=file.name bs=64k osync=nocache count=819200" with different block sizes being better or worse on different systems (but also requiring different counts, which are in units of blocks; the preceding is 50GB--or you can drop the count= and just hit ctrl-c when you're bored).

fsck said:
I'm willing to bet that SoC speed is your limiting factor. You need to crunch a parity for every byte instead of every block like you would with raid 5. So it's I/O intensive.

I doubt that, the numbers are too low for a CPU parity bottleneck in something modern. Even if it was true, it would only affect writes, not reads. I'd lean more toward it being an issue with moving single bytes through the whole chain: controller, drives, cache, etc. Performance might be better with an enterprise SAS drive with different optimizations: a WD red isn't really designed for low latency small block I/O, and I'd be curious what the service times look like from the HDs. A typical optimization is for the internal HD controller to never read less than a full track, then serve subsequent requests from that track out of the read cache. (Because the cost of reading a track is approximately the same as reading a block.) But even from cache there's overhead in processing the request & response. Alternatively, if things are being coalesced between the disk & raid controller, you've got to unpack each coalesced request and reorder the bytes. This is where I think the real bottleneck is: either moving bytes around in RAM or sending byte requests to the disk. The way to really improve performance would be to coalesce the byte operations into much larger block operations...but then you've got RAID 4/5.

Railgun · Jul 29, 2018

mstone said:
What's the source of the writes?

In this context, a simple file copy via scp from a Win10 box. It's the only way I know to get a direct copy to an xfs file system from a Win box. I do the rips locally on my desktop, and simply copy them over via this method. I was copying it back from the server to do something else with one of the files when I discovered this issue.

I've cobbled together enough spare disks to create enough of a pool copy off this array and rebuild it as a 5. Though before I do, I'll look into doing more testing for, if nothing else, the purposes of academics.

I'm open to suggestions in how to measure said service times. This would also have the added benefit of helping me in a work capacity seeing as I'm also doubling as a sys admin who's trying to determine the source of some storage issues, but that's another story.

mrkrad · Jul 30, 2018

raid-50 is faster!

mstone · Jul 31, 2018

Railgun said:
In this context, a simple file copy via scp from a Win10 box. It's the only way I know to get a direct copy to an xfs file system from a Win box. I do the rips locally on my desktop, and simply copy them over via this method. I was copying it back from the server to do something else with one of the files when I discovered this issue.

At some point you can look at samba to expose the filesystem as an smbfs share you can mount as a drive from the windows machine. You should definitely test locally, as testing via scp makes it hard to distinguish disk problems from network problems.

I've cobbled together enough spare disks to create enough of a pool copy off this array and rebuild it as a 5. Though before I do, I'll look into doing more testing for, if nothing else, the purposes of academics

See the dd command line above for a quick & simple write test. For a read test, when that's done run "dd if=file.name of=/dev/null bs=64k".

I'm open to suggestions in how to measure said service times. This would also have the added benefit of helping me in a work capacity seeing as I'm also doubling as a sys admin who's trying to determine the source of some storage issues, but that's another story.

My reference to service times was purely a matter of curiosity; that's information that's only available to the RAID controller, and most don't expose that level of detail. More generally you can use the iostat tool in the sysstat package as "iostat -x -p 1" to see things like average request size, service time, etc., which can give you a much better understanding of what's happening than just looking at the bandwidth--just remember that with hardware RAID you're seeing statistics to/from the RAID controller cache, which isn't the same as getting all the way to the disk.

Railgun · Jul 31, 2018

Welp...it's not a storage issue. Well...it's probably not anyway. It's perhaps a server issue on the network side (it's NEVER the network!)

It's not really a network issue per se, but I've not yet determined the cause.

In short, the RAID 50 array is exhibiting the same behavior. Taking a trace, normal latency for these transfers is about 1.5ms; response from the last ack to the server to the next couple of frames from the server. Occasionally, I see a zero window from my desktop...recovers immediately, and back on we go.

At some point, and for no particular reason, the latency skyrockets to about 16ms from the ack to the next few frames.

Now, I've tested this now from two different sources on the same VM...as mentioned I have the array exposed directly to the VM, as well as a separate SSD that I also have directly exposed. I've copied to three different targets, two on the same machine. Of those two, one is the aforementioned R5 array (4x WD black 2TB) and one is an SSD...albeit an older SM841. The other being a mix of drives in a 6TB pool on a sep Win10 box. All exhibit the same at some point.

I'm not convinced it's a network issue per se in the context of switch config, but I will make some adjustments to vet that out as well by connecting everything to the same local switch. All things being equal, it seems to be a server side thing, but that's TBD.

This trace could all be a red herring...and I'm still digging around, but it's frustrating to say the least.

Railgun · Aug 1, 2018

I've come to the conclusion that SCP sucks.

I've tested pushing from the VM in question to my PC...25MB/s solid. Realized I had samba installed on the VM, but never actually configured for setting up the share. Did so, accessed it from the PC in question and moved it over...pulling from the VM...100+MB/s all day long.

The last time I'd tried this...which was probably...more than likely incorrectly, I wasn't able to access the array being an XFS format. At least, I'd read countless issues natively accessing an XFS format from Windows.

So...I've destroyed my array for no reason. But not the end of the world. Chalk it up to lessons learned.

EffrafaxOfWug · Aug 1, 2018

Railgun said:
I've come to the conclusion that SCP sucks.

It does - don't ever use it for benching! I'd recommend prior comments about testing the disc IO locally (be it with dd or fio or a dozen other utilities) so as to keep the network out of the equation until you're happy with the internal performance.

Railgun said:
The last time I'd tried this...which was probably...more than likely incorrectly, I wasn't able to access the array being an XFS format. At least, I'd read countless issues natively accessing an XFS format from Windows.

You don't actually access the array as XFS - your windows machine talks CIFS to samba, which talks I/O to the linux kernel, which then talks XFS to the discs. You can have any file format underpinning the discs behind samba as long as the OS running samba understands it.

As an aside... other than lockstep spindles not really being a thing any more, isn't RAID3 (and 4 along with it) frequently also bottlenecked by having a dedicated parity drive rather than distributed parity a la RAID5|6? At some point on the spindle scaling you reach the limit of access to your parity drive and performance plateau's.

Railgun · Aug 1, 2018

Thanks. As mentioned, I'm not a storage guru by any stretch. Just a PEBKAC all day long.

And I wasn't using SCP to bench per se, rather just the mechanism to do the transfers in my normal workflow.

mstone · Aug 1, 2018

EffrafaxOfWug said:
As an aside... other than lockstep spindles not really being a thing any more, isn't RAID3 (and 4 along with it) frequently also bottlenecked by having a dedicated parity drive rather than distributed parity a la RAID5|6? At some point on the spindle scaling you reach the limit of access to your parity drive and performance plateau's.

The parity drive in raid 3 isn't generally more or less utilized than the other drives; unless you're literally writing a single byte, you'll write to all the disks simultaneously. For raid 4 it's much more likely that you'll be writing to a single block, which requires a read of both the data & parity blocks, then writing both; for raid 4 small random writes your performance is roughly the same as writing to a single drive, with no benefit from the additional spindles. For raid 5/6, you can seek and write to multiple drives simultaneously so your write rate goes up. Rotating parity has a more consistent affect on read performance: on raid 3/4 it's on the order of N-1 drives, for raid 5/6 it's closer to N drives.

Search

RAID 30 terrible sequential read performace

Railgun

Active Member

fsck

Member

Railgun

Active Member

mstone

Active Member

Jeggs101

Well-Known Member

Railgun

Active Member

mstone

Active Member

Railgun

Active Member

mstone

Active Member

Railgun

Active Member

mrkrad

Well-Known Member

mstone

Active Member

Railgun

Active Member

Railgun

Active Member

EffrafaxOfWug

Radioactive Member

Railgun

Active Member

mstone

Active Member