abysmal zfs raidz2 throughput

aero · Jun 21, 2018

Edit*title was wrong, I meant zfs not nfs.... The io is local, tested with dd and iozone

This is my first go with ZFS (usually an mdraid + lvm kinda guy), and i'm running into some performance issues.

I've configured a pool consisting of 1 vdev of 8 disks (4tb sas 7.2k rpm) in raidz2.

I'm expecting sequential disk read speeds of roughly 200MB x 8, ~1.6GB/s, and write speeds of 200MB x 6, ~1.2GB/s.

However, in reality I'm getting reads ~550MB/s and writes ~220MB/s
I disabled sync writes, and set the recordsize to 1M, which had only a slightly positive effect.

Any suggestions?

BLinux · Jun 21, 2018

yeah.. that's pretty bad. I've used older 3TB HUS723030ALS640 HDDs in 8x3TB HDD in RAIDZ2 and got about 1GB/s reads and about 6~700MB/sec writes.

can you share more about the hardware setup? OS? version of ZFS? also, what was your iozone testing command?

gea · Jun 22, 2018

If you compare ZFS behaviours with older raid/filesystems, you must consider

On an older raid/filesystem:
- any data corruption in the data chain OS-driver-controller- cable-backplane-disk and vice versa cannot be detected. ZFS adds checksums to data and metadata to detect any problems and fix them from redundancy on access. This results in more data to be read/written and a higher CPU load. That is the price for security.

- a crash during a write can result in a corrupted raid and/or corrupted filesystem as a raid is updating its disks sequentially and every data update on disk requires at least a data-write and a metadata update, This is called write hole phenomen. ZFS adds CopyOnWrite to guarantee secure atomic/transactional writes (data+metadata) or raid stripes that are written completely to all disks or discarded. The price for this is more fragmentation especially if the pool is not empty.

- on a ZFS raid, the data is spread over the whole pool. This means that even on a sequential workload like dd, the effective performance is more iops limited than limited by the pure sequential performance of a disk. As in a single z2 pool the iops is equal to one disk (as every disk must be positioned for every io) the effective number of raw iops of the pool is between 50 and 100.

ZFS adds superiour rambased read and write caches to compensate this. With a lot of RAM it es even possible to overcompensate as it can happen that nearly all reads are from RAM and the writecache transforms all small and slow random writes to a large sequential write. But this depend on RAM.

so in effect
- Low read and write values in ZFS are mostly due pools with less iops like yours and/or when RAM is low.
- For an overall thumb rule, if the pure sequential value of a disk is like 200MB/s the value that you can expect with ZFS (when the workload is not processed purely in RAM) is more at 150 MB/s or less per disk

Ad your 8 disk Z2 pool has 6 data disks, you should really care about when your read/write value goes below say 500 MB/s and then you should first increase RAM and the check for other problems like a weak disk, cable or backplane or to try another HBA.

- How much RAM do you have?
- is your pool empty?
- Which controller optionally mode (ex AHCI vs IDE or HBA mode)?

aero · Jun 22, 2018

iozone -a -s 16G -y 64k
Ubuntu 16.0.4 with kernel 4.4.0-34
zfs version 0.6.5.6-0ubuntu21

The disks are in a Dell MD1000 connected to an LSI 9200-8e.
Server is a dual 2670 with 128GB RAM, and not heavily utilized.

BLinux · Jun 22, 2018

aero said:
iozone -a -s 16G -y 64k
Ubuntu 16.0.4 with kernel 4.4.0-34
zfs version 0.6.5.6-0ubuntu21

The disks are in a Dell MD1000 connected to an LSI 9200-8e.
Server is a dual 2670 with 128GB RAM, and not heavily utilized.

Hmmmm... You are using auto mode with iozone. When the record sizes are small, and you have 1M record size in ZFS, you are going to have performance variations across the range auto mode is going to cover. Maybe better off sticking to tests 0+1+2 with a fixed record size that matches your ZFS. That would at least be a fixed starting point to have the conversation. Otherwise, I can't tell if the numbers you stated above are with which part of the auto mode test.

The md1000 uses a SAS-1 expander. How are you hooked up? Single 4-lane cable? Dual cables in split mode? What speeds are the HDD linking at? It's also possible you may be bottlenecked at the expander chip; try decreasing iops demand and see if your throughput goes up?

Run iozone with the -+u option to include some info about CPU utilization.

You could also try a newer version of ZoL as 0.6.x is pretty old.

aero · Jun 22, 2018

gea and blinux, thanks for your input. I've got a few things to dig deeper on.

Gea, I get what you're saying, but even given more realistic performance expectations it looks like my setup is still far below said expectations. I think I have either a hardware or configuration problem of some sort. For my use I'm primarily concerned with sequential throughput, although I might switch to mirrored vdevs as a compromise. Before I do that, I want to make sure something isn't wrong.

When I watch iostat the throughput of each disk rarely exceeds 25MB/s during write tests, with no read activity seen at that time.

I've actually limited the arc to only 8GB of RAM because my sequential workloads won't get much benefit, and I don't want zfs eating all memory on the box.

I'm using the md1000 in split mode, with a single 4-lane cable, all 8 drives in slots 8-15 off one controller. 4 lanes of SAS1 should be 4x375MB= 1500MB max theoretical, so shouldn't be a bottleneck for 8 spinners not in raid0. The LSI card is pcie 2.0 x8, which at 500MB/lane also shouldn't be a factor.

The numbers I provided were tests 0 and 1, whichever record size gave the highest value (forget which it was, and foolishly didn't save the output).

my next steps before reporting back will be...
- check 9200-8e firmware and upgrade if necessary
- verify negotiated disk link speed
- upgrade zfs version
- re-run tests

gea · Jun 22, 2018

You should also rerun the tests without an arc limit to check RAM effects

see chapter 4 and 5 with different RAM settings
http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

aero · Jun 22, 2018

thanks for pointing me to that doc...good to see the various benchmarks.

BLinux · Jun 22, 2018

aero said:
my next steps before reporting back will be...
- check 9200-8e firmware and upgrade if necessary
- verify negotiated disk link speed
- upgrade zfs version
- re-run tests

Here are a couple more things I would consider doing:

1) get a baseline of what your benchmark methodology can handle. since you're not running this with multiple threads (which I might suggest at some point, but at the level of performance you're dealing with, is probably not a factor right now), it is good to know how fast the benchmark can go on a single thread. i do this by mounting a /test as tmpfs and running the benchmark in there.

2) try to find information on whether this bottleneck is hardware or software (ZFS). for example, try using Linux mdraid RAID0 across those 8 drives (or maybe 6, for better comparison against the raidz2). mdraid is simple, no checksums, no parity calc, nothing special, just create 8xHDD RAID0 without initialization and run your benchmark there. if you see a bottleneck there as well, then the issue might be somewhere in the hardware, not the ZFS software.

3) another thing you can do to aid #2, cable the drives directly to the HBA or another similar HBA, doesn't have to be fancy, the drives could all just sit on a workbench for testing purposes as long as you have cables.

just some more thoughts...

while running the benchmark, i would also open up 'top' or 'htop' and see what's hitting the CPU.

aero · Jun 22, 2018

- the low performance doesn't appear to be ZFS specific. For instance, I put all 8 drives into an md raid0, but it can only push a max of 520MB/s read, and 520MB/s write. As a comparison, on a separate 4-disk md raid10 (in a separate DAS, no expander) I'm seeing 600MB/s read...and that's just 4 disks, 270MB write (2 disks).

- not specific to the 9200-8e card...I updated to the latest bios and firmware, which had no effect. I then moved the md1000 over to another card, a 9211-16e; no effect

- oh, the link to the md1000 is negotiated a full sas-1 speed, 3gbit x 4 channels

next steps...

- wish I had another cable to attach the second md1000 controller, and split the disks between them to see if that helps...will order one
- try to free up slots in my other DAS to see how they perform without md1000

aero · Jun 22, 2018

I had a better idea...created a raid0 with just 5 of the 8 drives...and the per-disk throughput shot up from 65MB/s to 105MB/s...so hit apparent ~525MB ceiling. From various internet sources this looks to be a common shortcoming of the md1000 and also md3000.
I suspect max performance, ~1GB/s, will be achievable using both md1000 controllers in split mode. I'll test this tomorrow when I get the appropriate cable.

aero · Jun 23, 2018

My suspicions were proven correct. For some reason the md1000 is limiting throughput. Utilizing both I/O modules, and splitting disks between them doubles performance. I was able to get 1.1GB/s for both sequential reads and writes (md raid0)

Now I'm back to testing ZFS and still banging my head against a wall.
the 8 disks are once again in a raidz2
reads are fantastic, 1.0GB/s
writes are super bad, 153MB/s

Here's a look at iostat during a write. What's with the %util ~100%?

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sdt 0.00 0.00 0.00 108.00 0.00 17.84 338.33 0.99 9.15 0.00 9.15 9.19 99.20
sdu 0.00 0.00 0.00 109.00 0.00 18.18 341.50 0.99 9.10 0.00 9.10 9.06 98.80
sdv 0.00 0.00 0.00 109.00 0.00 18.18 341.50 1.00 9.21 0.00 9.21 9.17 100.00
sdw 0.00 0.00 0.00 108.00 0.00 18.01 341.49 0.99 9.15 0.00 9.15 9.19 99.20
sdy 0.00 0.00 0.00 109.00 0.00 18.18 341.50 1.00 9.14 0.00 9.14 9.14 99.60
sdz 0.00 0.00 0.00 109.00 0.00 18.18 341.50 0.99 9.10 0.00 9.10 9.10 99.20
sdaa 0.00 0.00 0.00 108.00 0.00 18.01 341.50 1.00 9.22 0.00 9.22 9.30 100.40
sdx 0.00 0.00 0.00 109.00 0.00 18.18 341.50 0.99 9.03 0.00 9.03 9.10 99.20

aero · Jun 23, 2018

found it....noticed the avgqu-sz is 1...

tweaked these two parameters:
zfs_vdev_async_write_max_active from 10 to 20
zfs_vdev_async_write_min_active from 1 to 20

check out these sweet numbers...

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdt 0.00 0.00 0.00 685.00 0.00 113.89 340.51 20.04 29.38 0.00 29.38 1.47 100.40
sdu 0.00 0.00 0.00 732.00 0.00 121.56 340.11 20.05 27.64 0.00 27.64 1.37 100.40
sdv 0.00 0.00 0.00 737.00 0.00 122.39 340.11 20.05 27.58 0.00 27.58 1.36 100.40
sdw 0.00 0.00 0.00 707.00 0.00 117.23 339.57 20.05 28.46 0.00 28.46 1.42 100.40
sdy 0.00 0.00 0.00 638.00 0.00 106.05 340.43 20.12 32.23 0.00 32.23 1.58 100.80
sdz 0.00 0.00 0.00 690.00 0.00 114.56 340.02 20.14 29.06 0.00 29.06 1.46 100.80
sdaa 0.00 0.00 0.00 738.00 0.00 122.89 341.04 20.12 27.37 0.00 27.37 1.37 100.80
sdx 0.00 0.00 0.00 702.00 0.00 116.39 339.56 20.14 28.25 0.00 28.25 1.44 100.80

Up around 700MB/s on sequential writes.

Search

abysmal zfs raidz2 throughput

aero

Active Member

BLinux

cat lover server enthusiast

gea

Well-Known Member

aero

Active Member

BLinux

cat lover server enthusiast

aero

Active Member

gea

Well-Known Member

aero

Active Member

BLinux

cat lover server enthusiast

aero

Active Member

aero

Active Member

aero

Active Member

aero

Active Member