FreeBSD pre-prod test: S3700 horrible SLOG performance

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

RedHeadedStepChild

New Member
Jan 22, 2019
5
0
1
Long time lurker, first time poster, yadda yadda. I'll do my best to get all the pertinent data in here, sorry in advance for the length.

I've built a test rig to evaluate using ZFS as a backup and archival storage target for those items which do not make sense to store in one of our various enterprise SANs or our hyperconverged environments. The primary goal is capacity, secondarily performance, and finally some semblance of resilience.

The current build is a collection of lab hardware, not necessarily indicative of the final design goal BUT enough to test some of our workload targets:

HP DL380 gen 7 chassis
Dual socket 6c/12t 2.2GHz Xeon v3's
144GB DDR3 1066MHz ECC memory
Dual LSI 9200's (HP rebranded), each with dual 6Gbps Mini-SAS external ports
HP i420 SmartArray card, BBU and caching disabled, 8 x SATA interface (four per controller)
Booting from a mirrored pair of 32Gb Samsung USB3 keys for now
Four HP SAS drive trays, can't remember the model, each one holds Qty 12 x 3.5" SAS drives (each tray connected by one 6Gbps SAS cable)
Qty 48 2TB 7200rpm HP SAS drives strewn about the four trays
Qty 2 Intel S3710 400GB SATA drives, each individual drive configured as a standalone RAID0 device in the SmartArray card, since the controller doesn't support JBOD natively. Thus, both drives are exposed individually, even though I lose SMART. Yeah, I know, it's not indicative of final. It isn't write-caching.

Oh, and running FreeNAS 11.2u1 (the current build as of this writing.)

Built the main pool out of eight RaidZ2 vdevs of 6 drives each (keeping with the recommended 2n + 2 config) and did NOT configure SLOG yet. I kept all the ZFS defaults of record size = 128k, lz4 compression, no dedupe, atime on, sync=default.

From the shell interface of FreeNAS itself (I'm not remote) I ran a simple DD if=/dev/zero of=/mnt/tank/blahddfile bs=4k count=102400

Performance was great, picked up something like 2.8Gbytes/sec. Between the RAM and compressing zeroes being really easy, this isn't a realistic figure. But hey, it's a data point.

Went back to storage config, disabled compression, disabled atime, ran the test again. Now we're seeing 800Mbytes/sec -- which is more inline with what I probably expected to see. Obviously streaming zeros still isn't indicative of anything truly useful, but I'm really just sanity testing for now.

Then I added the pair of s3710's as a mirrored SLOG device and set the pool to sync=always. I re-ran the DD command. And I waited. And I waited.

And I waited.



WTF is it broken? I control-C'd the task, and it barfed out 13 mbytes/sec.

Used the command line to rip out the SLOG from the pool. Used DD to write zeros to the SSD drives, built new partitions, built a new pool with just the SSD drives themselves in a mirrored set, mounted the new SSD-only pool. This new pool is again using the defaults: record size = 128k, sync=default, lz4, atime on, no dedupe. Reran the DD command against the new SSD pool, and BOOM winky: 2.8Gbytes / sec. Yay, RAM and compression.

Turned off compression, turned off atime, ran the test again. 550mbytes/sec.

Forced sync=always. Reran the test. 330mbytes/sec. Let's be honest: these S3700's aren't optanes, 330Mbytes/sec is absolutely OK in my book. And since I'm forcing sync on a pair of mirrored drives without SLOG, I'm incurring double writes on these two drives to get to the 330Mb/sec number. Honestly, this is probably better than I deserve on these drives.

Ok, the drives are good. Maybe it was just a gremlin in the original pool config somehow? I blew away the SSD pool, re-added the drives back into the main TANK pool as mirrored SLOG, retested with sync=always.

Can you guess? Did you guess 13mbytes/sec? I didn't, but that's what it was.

What.
In.
The.
Eff.

Recap: I used the same pair of S3710's as a mirrored pool on their own, with sync forced, and they'll knock down 330mbytes/sec. I literally use the same drives, in the same mirror configuration, but now as a SLOG to a larger pool, and they wallow in at 13mbytes/sec. If I rip out SLOG entirely and test that main pool, it will knock down >800Mbytes/sec.

Something is broken. Thoughts?
 

marcoi

Well-Known Member
Apr 6, 2013
1,532
288
83
Gotha Florida
what does zpool iostat -v poolname 1
showing is happening when your running your benchmarks?

Also have you tried fio local shell in FreeNas? you need to go to the mnt point for pool IE /mnt/tank1 to run the below command.

fio --output=128K_Seq_Write.txt --name=seqwrite --write_bw_log=128K_Seq_Write_sec_by_sec.csv --filename=nvme0n1p1 --rw=write --direct=1 --blocksize=128k --norandommap --numjobs=8 --randrepeat=0 --size=4G --runtime=600 --group_reporting --iodepth=128
 

RedHeadedStepChild

New Member
Jan 22, 2019
5
0
1
zpool iostat shows almost nothing is happening when those drives are in SLOG -- 16 write I/O's and about 1.45KB/sec bandwidth per drive. When they're in their own standalone mirror pool, they knock down tens of thousands of write I/O's and several hundred MB/sec bandwidth per drive.

I'm going to completely wipe this installation and rebuild it. It occurred to me last night I started this test rig first as 11.2 beta, and around that time I was using only two drive trays in a SAS multipath configuration. I still have some drives which report "broken paths" even though I've completely wiped and reloaded those pool and drive configs. This shouldn't affect the SSD's in the slightest, but now I'm wondering if this is just cruft from either the 11.2 Beta -> 11.2 release -> 11.2u1 release upgrade path combined with my various pool config tests / wipes / rebuild new config / wipe / rebuild new config (repeat about 6x.)

Shouldn't take more than an hour to completely rebuild it from scratch. I'll report back when I'm done.
 

RedHeadedStepChild

New Member
Jan 22, 2019
5
0
1
No luck.

Wiped all the drives, cleanly reinstalled FreeNAS 11.2 u1 from newly created media, and then rebuilt the pool as before: 8 vdevs of 6 drives, two S3710's as SLOG, and for fun I tossed another pair of S3710's in for L2ARC. Took all the defaults except compression and a-time off, then I piped dev/zero to a flat file and picked up something like 800Mbytes/sec. Same result as last time.

Then I flipped sync=always, retested, and again ... 13Mbytes/sec.

Blew out the pool, created a new pool with literally two spindle drives in mirror, and the two S3710's as the SLOG mirror.

Sync off = about 100Mbytes/sec.
Sync on= 13Mbytes/sec.

Blew out the pool, created a new pool with literally JUST the two S3710's in a mirror (same ones I just deleted from mirrored SLOG duty.)

Sync off = about 500Mbytes/sec
Sync on = about 300Mbytes/sec.

What the hell is broken here?
 

marcoi

Well-Known Member
Apr 6, 2013
1,532
288
83
Gotha Florida
Did you try the fio command I listed above?

I also usually add log using legacy gui, storage volume manager then manual config.

Maybe try that method and add only one drive as log then test again.
 
  • Like
Reactions: Patriot

herby

Active Member
Aug 18, 2013
187
53
28
Out of curiosity, why are you turning off compression? I think ZFS is faster with lz4 than uncompressed, not that I'd expect it to cause your massive drop in performance.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Probably only to test the actual pool performance as opposed to the cpu's capability to compress lots of 0's into a single one;)
 
  • Like
Reactions: herby

Patriot

Moderator
Apr 18, 2011
1,450
789
113
ffs stop using DD for performance tests
Use fio, sync is absolute worst case performance and I am not surprised it is shit.
If you want performance... stop limiting your drive to Q1 performance.

Also ./hpssacli controller slot=0 modify hbamode=on forced
So that you aren't running zfs on top of controller striping.... I can only imagine how randomly ****y the performance could be with that.
 
  • Like
Reactions: nerdalertdk

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
Thus, both drives are exposed individually, even though I lose SMART.
Just a heads up....
On HP raid controllers and single drive raid 0 devices, one doesn't necessarily loose SMART.
Syntax varies with driver. For the cciss driver, it's possible to query the drives using:
smartctl -a -d cciss,<volume> /dev/cciss/<device>
 

Patriot

Moderator
Apr 18, 2011
1,450
789
113
Just a heads up....
On HP raid controllers and single drive raid 0 devices, one doesn't necessarily loose SMART.
Syntax varies with driver. For the cciss driver, it's possible to query the drives using:
smartctl -a -d cciss,<volume> /dev/cciss/<device>
Oh... I forgot that freebsd never ported the hpsa scsi driver and is still trying to use the old block driver for new cards... any performance anomalies could just be driver related using something that was never tested or intended for use with SSDs. Might not even work in the hba mode.
 

RedHeadedStepChild

New Member
Jan 22, 2019
5
0
1
Did you try the fio command I listed above?
Yup!

I've been moving forward on identifying how I want the pool configured. At the moment I'm running 5 x RaidZ3's of 9 disks, then 2 x S3710's as SLOG, and 2 x S3710's as cache.

Compression OFF, sync = standard: 4321MiB/s, 34.6k IOPS.
Compression OFF, sync = always, SLOG = mirror: 10.2MiB/s, 82 IOPS.
Compression OFF, sync = always, SLOG = single (per your suggestion): 23.9MiB/s, 214 IOPS.

I've built all of this using the command line; GUI typically just slows me down -- especially when building ZFS pools out of 52 disks!

That is the correct sync performance you are seeing, nothing wrong.
Tell me more about your perspective here. This isn't my first ZFS build, nor my first FreeNAS build. At home, I have a crap test rig with (among other things) a pair of OCZ Agility 3 240GB drives (seven years old, SandForce 2200-series controllers, reclaimed from well-abused old laptops) as mirrored SLOG and they perform at 65Mib/s with sync=always; this is a 5x improvement from my S3710's. "Nothing wrong" seems an unlikely statement given other examples I have in my various ZFS implementations, past and current. If you have different experience to share, I'm very interested to hear it!

ffs stop using DD for performance tests
You might consider the principle of charity here; simply because you do not agree with or understand my method or perspective does not then invalidate my methods or perspectives on the matter. None of my prior statements gave specific weight to DD-based testing, however it IS a useful workload emulation for certain streaming input models which are applicable to me.

The performance of this device in streaming write work is abysmal, and given the hardware and configuration and my experience in building other ZFS rigs, I find this performance to be very far out of the ordinary. My goal here is to understand if others have seen similar issues and to understand how they resolved them.
Use fio, sync is absolute worst case performance and I am not surprised it is shit.
I agree sync writes are the absolute worst case, yet my need for them still persists because several of my workloads will be sync constrained. Since FIO has shown even worse numbers than DD, do you have any recommendations on other items worth checking?

If you want performance... stop limiting your drive to Q1 performance.
I find this a strange statement; SLOG is all about Qd=1 workload performance. SLOG performs best with very low latency storage with exceptional Qd=1 performance, which is not necessarily indicated by high throughput numbers of modern consumer NVMe drives. This is also why Optane drives make fantastic SLOG units, even with lower overall throughput numbers.
Also ./hpssacli controller slot=0 modify hbamode=on forced
So that you aren't running zfs on top of controller striping.... I can only imagine how randomly ****y the performance could be with that.
What is the name of the package which includes this hpssacli tool? I do not have this binary on my system. Also, can you be more specific about what "striping" you think might be happening? The very small amount of documentation available seems to indicate this controller treats any singular drive in RAID0 as being functionally an HBA access model versus any sort of "striping" method.

However, principle of charity suggests I take your suggestion at its most positive face value, so let us assume striping is happening, therefore "taxing" my overall SSD performance in both latency and bandwidth. Fortunately for me, the possible performance implications of the HP i420 storage controller is absolutely something which had occurred to me. I did test this hypothesis and documented it my first post thusly: I built a new ZFS pool using only the two S3710 drives as a single mirrored vDev. I documented significant and consistent performance (>300MiB/s) with Sync=Always against this mirrored vDev configuration. Then, without a reboot or any controller manipulation, I destroyed that SSD-only mirror zpool, re-added those same drives to function as SLOG to the larger zpool, and reran sync=always tests -- with the same 13MiB/s results.

If it exists, this hypothetical single-device striping behavior of the controller may indeed be taxing my SSD performance. Yet, this "tax" doesn't help me explain the 300MiB/s -> 13MiB/s write performance drop between the mirrored zpool of my two S3710's in sync=always by themselves versus the same exact pair of drives in a mirrored SLOG vdev when sync=Always.

Just a heads up....
On HP raid controllers and single drive raid 0 devices, one doesn't necessarily loose SMART.
Syntax varies with driver. For the cciss driver, it's possible to query the drives using:
smartctl -a -d cciss,<volume> /dev/cciss/<device>
You are absolutely correct, my statement was in the context of the "inbuilt" SMART reporting versus the CRON'able CLI method you describe -- I know others have been quite successful in this method, too. Thank you for clarifying this point for other, future readers of the thread as its absolutely valid data.

Back to the PoC...

My continuing collection of data suggests I'll probably be happy with four vDevs of 11 disks in Z3. This rig is a little low on disk bandwidth because these 2TB drives are old and generally slow. The new rig will be using EXOS 12TB SED drives; I'll be using the SED integration on the current FreeNAS distro to allow for if/when a disk bites the dust, I can just toss it and not care about the data on it. I'll mount them in perhaps a 12Gb SAS SuperMicro 44-drive 4u chassis, or maybe a BackBlaze Pod 5 or perhaps a 45Drives.com chassis with a hot spare. I'll connect the main pool drives with a pair of 12Gb SAS PCIE-3 LSI 9308's or something incredibly similar. A SLOG pair of 375GB P4800x's along with an L2ARC pair of Sammy 960 Pro's should round it out. I'll toss in a mirrored pair of 300GB 10k drives for boot, just to say I did :) This will all get powered by one of the midrange, high-clock Xeons with somewhere between 256 and 512GB ECC load-reduced DIMMs. The excessive memory will be there to cover for several workloads where Dedup has shown to make sense in my PoC testing of various example copies of production data sets.

For continued testing, I spent $150 and bought a pair of those cheap 32Gb Optane NVMe drives and a pair of PCI-E riser cards to install them. Gonna give those a whirl in the next day or three and see if my SLOG performance changes. As a reference point, I have the exact same pair of Optane 32GB drives in the same pair of PCI-E risers at home in my "main" ZFS rig. They act as the mirrored SLOG fronting my eight EXOS 10TB drives (four mirrored vDevs) augmented by a pair of OCZ Agility 3 (yeah, remember those? lol!) L2ARC drives. Those Optanes may not have NVMe bandwidth, but they completely chew straight through a pair of LAG'd 1GbE interfaces with sync=always forced.

If the Optanes don't perform to my expectation on the PoC rig, I'll know something else is up.
 

Patriot

Moderator
Apr 18, 2011
1,450
789
113
The stripe size in r0 with zfs above may be causing misaligned writes, stripe size may be different in the mirror. Different block sizes could potentially effect performance drastically in a block driver abandoned before SSDs.

I do not know of hpssacli existing for freebsd, it is normally found in rpm package for rhel and sles.
I have also been looking for an HPSA port that might give better more consistent performance in general.

Newer firmware with smart path enabled might also help considerably. At release I think the card was only capable of 200k iops, eol it was hitting 450k iops.

An offline tool could be used for changing modes if you wanted to try fake hba mode rather than r0. https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_b8cacaa71541492d8ee070fe99
This would be bypassing the raid stack of the controller and should remove some bottlenecks.

My disdain for DD is based on years of performance engineering experience and the massive inconsistency with dd performance.
It's not a personal slight against you to warn you away from using inferior methods of measurement. It is not an opinion, it is frustration based on extensive experience hunting down performance anomalies. There is an off chance that it behaves better on FreeBSD than Linux but in my testing has shown results to be highly inconsistent with other synthetic benchmarks and be deviant of real-world performance.

And I understand the importance of Q1 performance, while it is important... limiting yourself to it is going to be quite painful to overall performance with that many drives. Hopefully you are just having sync=always for isolation purposes.

Testing FIO with 128Qdepth while having sync set is still going to be Q1 performance. The only reason it was faster was due to block alignment and perhaps underlying stripe size. You have a lot of extra variables in here that could be causing issues.

Since this in no way matches your end config... if you have a spare h220... I would drop that in to hang the SSDs off of.
P420 is great in linux... but well, unsupported in freebsd with unknown performance caveats, some of which you may be encountering.
 

RedHeadedStepChild

New Member
Jan 22, 2019
5
0
1
Thank you Patriot for the incredibly helpful reply. I now have a much better understanding of the concerns surrounding the 420 controller. I certainly hadn't considered the affect of ashift on SLOG devices. Is it even possible to manage ashift on individual vdevs inside a zpool? I've never even tried...

Regardless, this is good news as it strongly suggests my cheapo Optane drive SLOG test should "solve" the performance challenge for all the obvious reasons. If they do, then I'm perfectly happy with life.

And yeah, Qd=1 will SUUUUUUUCK on this array with a fat stack of drives, but it's something I need to be able to solve in a reasonable way for a specific subset of use cases. Hence my interest in a well-performing SLOG, because I know I can't avoid sync writes on this box. I'm not promising those users full saturation of my LAG'd pair of prod-facing 10GbE's in the new rig, but I'd like to provide them something better than gigabit speed.

I'll update the thread when the Optanes roll in.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
P420 is great in linux... but well, unsupported in freebsd with unknown performance caveats, some of which you may be encountering.
I find this really interesting, because, perhaps just an anecdote, but I've used HP P4xx and P8xx controllers for years with ZFS on freebsd, solarish and more recently linux, and have not seen the performance issues, though I'm sure I never looked at the kinds of loads that you are talking about.

Regardless, I agree that the HBA controllers are the path of least resistance.

I'm about to do a proof of concept myself, with a DL380g8 with P822 and H241 controllers.
I'd prefer to run omnios for zfs, and I'm certain the controllers work well on it, but I had trouble getting good network performance at 40Gbe with mellanox connectx-3 cards on solarish the last time I looked at it.
My focus is not on the SLOG yet, so i'm not sure I can contribute to this specific thread, but I could look at differences between controllers.

For benchmarking I'm wondering whether to just use phoronix suite, or something else as a reference scenario for filesystem peformance.
 

Patriot

Moderator
Apr 18, 2011
1,450
789
113
I find this really interesting, because, perhaps just an anecdote, but I've used HP P4xx and P8xx controllers for years with ZFS on freebsd, solarish and more recently linux, and have not seen the performance issues, though I'm sure I never looked at the kinds of loads that you are talking about.

Regardless, I agree that the HBA controllers are the path of least resistance.

I'm about to do a proof of concept myself, with a DL380g8 with P822 and H241 controllers.
I'd prefer to run omnios for zfs, and I'm certain the controllers work well on it, but I had trouble getting good network performance at 40Gbe with mellanox connectx-3 cards on solarish the last time I looked at it.
My focus is not on the SLOG yet, so i'm not sure I can contribute to this specific thread, but I could look at differences between controllers.

For benchmarking I'm wondering whether to just use phoronix suite, or something else as a reference scenario for filesystem performance.
I know the guy who most likely was responsible for dinking around with the freebsd driver for the SA controllers... but it was never officially supported and the cciss branch is long EOL'd.

It's not that it can't work it's that it has no active support team or development.
H241 is an HBA with full smart passthrough... soo that one has a fair chance at working... if you can convince the cciss driver to accept it.
And the P822 is still listed under the cciss driver.
The raid stack is not being used so performance shouldn't be a concern on the h241. P822 is unclear if it has bypass or not... I know the p420 supports the pseudo hba mode but I don't know if the 822 does.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
Yes, the P822s can indeed be configured for 'hbamode'.