Raid - SSD - Optimization

Tayschrenn · Sep 22, 2020

Setting up an all SSD server using Micron's 5210 Ion 7.68's - using an AOC-S3108L-H8iR - and I'm getting absolute trash for write performance using DiskSpd/CrystalDisk 'real' world performance.

For starters, in JBOD I'm getting perfectly acceptable results with the same test:
------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
Sequential 1MiB (Q= 1, T= 1): 317.481 MB/s [ 302.8 IOPS] < 3301.36 us>
Random 4KiB (Q= 1, T= 1): 21.717 MB/s [ 5302.0 IOPS] < 188.23 us>

[Write]
Sequential 1MiB (Q= 1, T= 1): 338.690 MB/s [ 323.0 IOPS] < 3093.16 us>
Random 4KiB (Q= 1, T= 1): 61.783 MB/s [ 15083.7 IOPS] < 65.93 us>

[Mix] Read 70%/Write 30%
Sequential 1MiB (Q= 1, T= 1): 295.500 MB/s [ 281.8 IOPS] < 3541.71 us>
Random 4KiB (Q= 1, T= 1): 20.417 MB/s [ 4984.6 IOPS] < 200.04 us>

Profile: Real
Test: 64 GiB (x2) <0Fill> [Interval: 5 sec] <DefaultAffinity=DISABLED>
OS: Windows Server 2019 [10.0 Build 17763] (x64)
------------------------------------------------------------------------------
But when I try it in any other configuration - Storage Spaces was tried, but that's never done good on Parity - other than Raid-0, it is absolute trash.

An example: Raid5+0 with a pair of 12-disk raid-5's - no read ahead / write through configuration
------------------------------------------------------------------------------
[Read]
Sequential 1MiB (Q= 1, T= 1): 757.525 MB/s [ 722.4 IOPS] < 1383.17 us>
Random 4KiB (Q= 1, T= 1): 20.895 MB/s [ 5101.3 IOPS] < 195.63 us>

[Write]
Sequential 1MiB (Q= 1, T= 1): 109.678 MB/s [ 104.6 IOPS] < 9549.54 us>
Random 4KiB (Q= 1, T= 1): 11.595 MB/s [ 2830.8 IOPS] < 352.57 us>

[Mix] Read 70%/Write 30%
Sequential 1MiB (Q= 1, T= 1): 405.627 MB/s [ 386.8 IOPS] < 2582.32 us>
Random 4KiB (Q= 1, T= 1): 15.446 MB/s [ 3771.0 IOPS] < 264.59 us>

Profile: Real
Test: 64 GiB (x2) <0Fill> [Interval: 5 sec] <DefaultAffinity=DISABLED>
OS: Windows Server 2019 [10.0 Build 17763] (x64)
------------------------------------------------------------------------------
So the only thing I can think of is to use a raid-0 with a hot spare and trust in the SSD Protection feature that Avago/LSI/MegaRaid cards have, which is supposed to swap in a good SSD into a raid-0 before utter failure.

I've tried raid-5 FastPath as well, with no better results. That really boggles my mind as FastPath in a Mirror works great. I understand Parity calculations lower performance, but I can get a better sequential write using a raid-50 of NL-SAS 7200 rpm HDDs with more capacity, in a 2U system. The random IOPS aren't as good it's true, but there's no reason I can think of that the parity calculation on any modern RAID card should impact performance that much.

I've used raid-5 on SAS-SSDs with Dell servers using basically the same cards - same config of write through / no read ahead - and get much better performance despite only using 6xSSD, this is 24 - the amount of additional write paths should scale.

Thoughts? Am I expecting too high performance on a 64GiB file in Parity?

Use case: Veeam Repository (thus my use of large file size) - in real world use would it use more than a Q1/T1 for writes?

MBastian · Sep 22, 2020

Are you sure that that the Raid wasn't still initializing (!= Raid0)

i386 · Sep 22, 2020

I don't know how to interprete these numbers...
Are the numbers from "raid 5+0" the hardware raid? If yes what stripe size did you use to create the array?

Tayschrenn said:
So the only thing I can think of is to use a raid-0 with a hot spare and trust in the SSD Protection feature that Avago/LSI/MegaRaid cards have, which is supposed to swap in a good SSD into a raid-0 before utter failure.

Stop.
Raid 0 offers no redundancy or parity. Losing a single device will destroy the entire array.

Tayschrenn · Sep 22, 2020

Yes it was done with a fast initialization, stripe size was 256k which is pretty standard - on both the Raid-5 and Raid-5+0 (Raid 50)

Technically as a backup server, it's not a huge deal if i lose the array - I follow the 3-2-1 rule, so all I'd be losing would be at most, 6 days worth of backups (weekly fulls to tape) and even then, 80% of my data also has SAN snapshots I can recover from.

So Raid-0 is doable, more so with the LSI SSD Guard - that's designed to at the slightest indication of issues, take a drive out and put the hotspare in.

I was hoping it was just an odd setting issue with SSDs and parity I've not seen before. I'm half tempted to try using the 2gb of RAID memory instead of write through, but everything I've done in the past has clearly shown that SSD Write Through is faster than using the raid cache.

EffrafaxOfWug · Sep 23, 2020

i386 said:
Stop.
Raid 0 offers no redundancy or parity. Losing a single device will destroy the entire array.

Tayschrenn said:
Technically as a backup server, it's not a huge deal if i lose the array - I follow the 3-2-1 rule, so all I'd be losing would be at most, 6 days worth of backups (weekly fulls to tape) and even then, 80% of my data also has SAN snapshots I can recover from.

To paraphrase an old adage;

There are two types of people in this world: those who have lost an array on their backup server in the middle of a restore, and those who haven't lost an array on their backup server in the middle of a restore yet.

Even if you're sticking with 3-2-1, RAID0 is multiplying your risk for no good reason - if a "proper" RAID is too costly you're much better off using JBOD if at all possible.

Tayschrenn · Sep 23, 2020

EffrafaxOfWug said:
To paraphrase an old adage;

There are two types of people in this world: those who have lost an array on their backup server in the middle of a restore, and those who haven't lost an array on their backup server in the middle of a restore yet.

Even if you're sticking with 3-2-1, RAID0 is multiplying your risk for no good reason - if a "proper" RAID is too costly you're much better off using JBOD if at all possible.

Not really a question of cost, I'm talking performance and what should be expected of 24 Enterprise SATA SSD in a modern array. JBOD does no better - I've tested a mirror-accelerated parity with this - without swapping in an NVMe or pair of SAS/Write Optimized SSD for the Mirror (Or in ZFS I could use a write cache NVMe I suppose)

But that would then require swapping to Ubunutu. Not that I'm against it - Veeam actually has a new XFS/ZFS guide for a secure linux repo that's awesome - but completely changes the build, and makes what we bought less ideal.

At the end of the day, I don't understand from a technical perspective why on a raid-5 of 24xSSD Enterprise drives, a Single Queue of a 512Kb/1Mb block size file write (25Gb or 64Gb both tested) would be 1/3 the speed of what you can get when writing to a single SSD in JBOD. Parity calculations should not cause that much grief.

Heck I even tested for shits and giggles and got a better result (290MB/s) on a MAP using the same system; aka a single mirror using Windows > 24 disk raid-5 using Avago/LSI - that seems highly improbable unless something is really amiss that I'm, well - missing.

Dreece · Sep 23, 2020

Frustrating indeed.

My guess it's those QLC drives, IIRC they have pretty poor write iops around the 4000 or so mark, maybe a buffering issue on the drives themselves which causes havoc for parity read/writes, possible a phone-call to Micron could shed some light on it, ie they may have a different firmware for you to try out... Micron are pretty good like that.

Another consideration, based on personal experience > maybe one or more of those drives isn't playing ball nicely and is dragging the array speed down on writes.

On random writes those QLC drives lose endurance ridiculously quick, to the point that they just don't seem economical outside of mostly large sequential block storage, but this is a whole different issue and one which doesn't affect your use case for them.

Personally, and this is in no way intended to cause offence, I wouldn't touch those drives even for backup, and even if they were given to me for free.

BoredSysadmin · Sep 24, 2020

Dreece said:
Frustrating indeed.

My guess it's those QLC drives, IIRC they have pretty poor write iops around the 4000 or so mark, maybe a buffering issue on the drives themselves which causes havoc for parity read/writes, possible a phone-call to Micron could shed some light on it, ie they may have a different firmware for you to try out... Micron are pretty good like that.

Another consideration, based on personal experience > maybe one or more of those drives isn't playing ball nicely and is dragging the array speed down on writes.

On random writes those QLC drives lose endurance ridiculously quick, to the point that they just don't seem economical outside of mostly large sequential block storage, but this is a whole different issue and one which doesn't affect your use case for them.

Personally, and this is in no way intended to cause offence, I wouldn't touch those drives even for backup, and even if they were given to me for free.

these are QLC drivers indeed, but they do offer PLP so maybe no everything lost.
here are few good reads:

Storage Spaces Direct troubleshooting

Learn how to troubleshoot your Storage Spaces Direct deployment by confirming the make and model of your SSD, inspecting for faulty drives, and more.

docs.microsoft.com

Troubleshooting performance issues on your windows storage. Storage Spaces Direct – jtpedersen.com IT made simple

EffrafaxOfWug · Sep 24, 2020

Tayschrenn said:
Not really a question of cost, I'm talking performance and what should be expected of 24 Enterprise SATA SSD in a modern array. JBOD does no better - I've tested a mirror-accelerated parity with this - without swapping in an NVMe or pair of SAS/Write Optimized SSD for the Mirror (Or in ZFS I could use a write cache NVMe I suppose)

At the end of the day, I don't understand from a technical perspective why on a raid-5 of 24xSSD Enterprise drives, a Single Queue of a 512Kb/1Mb block size file write (25Gb or 64Gb both tested) would be 1/3 the speed of what you can get when writing to a single SSD in JBOD. Parity calculations should not cause that much grief.

Missed that you were using the 7.5TB Micron 5210's (and I assume under windows as well?); this was something we evaluated as well given the substantial price decrease compared to the 5200/5300 drives of the same size... suffice to say we went with the TLC models due to concerns about speed and endurance. The 5210s are definitely not as fast in sequential or random workloads.

That might not be the case for you - we're softraid through and through and you're using a hardware RAID controller which can be a significant bottleneck in of itself depending on the drives and the workload. I assume if you're using 24 drives on an 8-port adapter there's some expanders in play also?

Personally if it were me I'd always go the *nix+softraid route if at all possible but of course that can be a whole other can of worms if you're not familiar. If you have an HBA and/or a bootable linux distro to hand it can be useful for at least comparing different setups though.

i386 · Sep 24, 2020

Tayschrenn said:
At the end of the day, I don't understand from a technical perspective why on a raid-5 of 24xSSD Enterprise drives, a Single Queue of a 512Kb/1Mb block size file write (25Gb or 64Gb both tested) would be 1/3 the speed of what you can get when writing to a single SSD in JBOD.

This is pretty simplified and not 100% accurate, but I hope it makes some things clearer

In a jbod configuration every io is send to the drive directly and the ssd controller optimizes the requests.
In a raid configuration the io is mapped to the size of your stripesize by the raid controller; every io becomes a request with a size of 256KB to the actual ssd.
With random io in the benchmarks your controller sends io requests with the size of 256KB to different ssds. In the best case 64 4KB blocks are on the same stripe and would benefit from one actual 256KB read of/write to the ssd.

3nodeproblem · Oct 30, 2020

Didn't see this thread before I already ranted a bit about my 5210 IONs here.
But yeah, sequential write is abysmal, I get about 50MiB/s sustained. msecli says the disk doesn't support overprovisioning, but I'm attempting to just partition part of the drive now and see if it makes any difference at all. I may just try to contact Micron about it.

EffrafaxOfWug · Oct 30, 2020

3nodeproblem said:
But yeah, sequential write is abysmal, I get about 50MiB/s sustained. msecli says the disk doesn't support overprovisioning, but I'm attempting to just partition part of the drive now and see if it makes any difference at all.

If our tests were anything to go by, overprovisioning won't help any. QLC is seemingly just slow; without using the "pseudo-SLC" hacks that most consumer drives use (which work OK for desktop workloads but don't do diddly for 24/7 "enterprise" usage) writes eventually just fall off a cliff.

I'm not sure what prices are like where you are, but here the 5210s are about 70% of the cost of 5300s of the same size but I still don't think that's enough of a price differential to excuse the performance penalty, not to mention the potential endurance/longevity issues.

Tayschrenn · Oct 31, 2020

So, a couple things:
Using RAID and the onboard 2gb cache, things are... actually pretty damn decent. I can easily write 1GB/s sustained and given the bottleneck of it being a hyper-v backup even from an all flash array, I cannot hit that anyways.

It's possible that jbod would work better in linux, but in Storage spaces, it was trash on JBOD (I put the controller in a true jbod mode) - and I'm likely one of the more versed Storage Spaces engineers out there (I run half a dozen refs mirror accelerated parity and mirror/mirror hybrid builds) to the point that I found a bug Microsoft has yet to fix on it regarding destage and clones.

BUT

I've also been seeing some weird drops where the virtual disk just totally shuts down and throws a bunch of errors. Specifically I was able to replicate the issue when doing a large (9TiB)robocopy with /J so bypassing windows cache.

SuperMicroCorp (via Amax the vendor), Micron, and Avago/LSI have all been reviewing logs and trying to figure out what's going on.

They're now sending me a brand new chassis with a new backplane, and new controller. Should arrive next week to test again.

Still, it really isn't bad on performance with using the 2gb cache on the card. In fact it's suspicious on how the performance doesn't degrade vs when in JBOD. it really makes me wonder.

This is mostly sequential, but even for things like roll ups of backups, synthetic fulls, reverse incremental, the performance is way way better than what you can get with 48+ NL-SAS drives and the cost is only slightly more than that sort of build.

I'll report back on how the replacement hardware does.

Search

Raid - SSD - Optimization

Tayschrenn

New Member

MBastian

Active Member

i386

Well-Known Member

Tayschrenn

New Member

EffrafaxOfWug

Radioactive Member

Tayschrenn

New Member

Dreece

Active Member

BoredSysadmin

Not affiliated with Maxell

Storage Spaces Direct troubleshooting

EffrafaxOfWug

Radioactive Member

i386

Well-Known Member

3nodeproblem

Member

EffrafaxOfWug

Radioactive Member

Tayschrenn

New Member