Struggling reads on a mirrored setup...

levifig

Member
Nov 27, 2017
50
13
8
levifig.com
Here's the overview:

SETUP

Proxmox 5.4-5, ZoL 0.7.13-pve1~bpo1
4x 2-way mirrored vdevs
atime=off,xattr=sa,sync=disabled,compression=off,recordsize=1M

WRITE
# dd if=/dev/zero of=bench.bin bs=1M count=120000
120000+0 records in
120000+0 records out
125829120000 bytes (126 GB, 117 GiB) copied, 305.644 s, 412 MB/s
READ
# dd if=bench.bin of=/dev/null bs=1M count=120000
120000+0 records in
120000+0 records out
125829120000 bytes (126 GB, 117 GiB) copied, 307.467 s, 409 MB/s
POOL LAYOUT
Code:
  pool: storage
 state: ONLINE
  scan: none requested
config:
        NAME                                   STATE     READ WRITE CKSUM
        storage                                ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_7SGJ7E3C  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKZ5TSC  ONLINE       0     0     0
          mirror-1                             ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_7SGBY56C  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKWPLYC  ONLINE       0     0     0
          mirror-2                             ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_7SGKYW3C  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7SH0YT6G  ONLINE       0     0     0
          mirror-3                             ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7SH0W5HG  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7SH0WKNG  ONLINE       0     0     0
I was expecting to saturate my 10G network on sequential reads... No idea what's going on! During the tests, reads/writes are evenly distributed to all disks. zpool iostat shows ~850MB/s total bandwidth during writes to all 8 drives (makes sense, about 100-120MB/s per drive), but reads total those 400-450MB/s bandwidth, with about 50-60MB/s per drive...

I'm incredibly puzzled! :X
Any help and ideas would be greatly appreciated!

Thank you in advance!

--LF

PS: I ran bonnie++ tests as well, with very similar results.
 

MiniKnight

Well-Known Member
Mar 30, 2012
3,018
925
113
NYC
What about if you run two tests at once? I know this one sounds odd but need to make sure it is not something upstream that is serial.

What happens if you turn on compression?

This looks like a tuning issue for sure but it's late here and I'm going to give you the wrong advice when I'm this tired.
 

levifig

Member
Nov 27, 2017
50
13
8
levifig.com
First of all, thank you for taking a stab at it.

I just tried running 3 dd read tests at once. Here is the output of iostat:

Code:
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              53.00        33.39         0.00        166          0
sdb              62.20        48.69         0.00        243          0
sdd              88.80        76.45         0.00        382          0
sdc              56.00        37.62         0.00        188          0
sde              61.00        46.90         0.00        234          0
sdg              68.00        49.53         0.00        247          0
sdf              48.80        33.41         0.00        167          0
sdh              65.40        46.04         0.00        230          0
That MB/s read speed PER drive is WAAAAYYYY too slow! :(

Having said that, all 3 tests ended with the read speed being around 380-385MB/s... which would mean they actually read at 3x that (since I was running 3 tests). Would the fact that the file that they were reading was the same one, and basically /dev/zero affect the result? If so, the speed should've been WAY higher in all 3 anyway... *sigh*

Well, then a tried with compression turned on and...

WRITE
# dd if=/dev/zero of=bench.bin bs=1M count=120000
120000+0 records in
120000+0 records out
125829120000 bytes (126 GB, 117 GiB) copied, 44.8684 s, 2.8 GB/s
READ
# dd if=bench.bin of=/dev/null bs=1M count=120000
120000+0 records in
120000+0 records out
125829120000 bytes (126 GB, 117 GiB) copied, 25.0219 s, 5.0 GB/s
Again, this is with compression turned on, so this dd bench is basically useless.

On to bonnie++ with compression on:


I'm actually blown away with that CPU usage and latency! But that's weird, because looking at my Proxmox host, CPU usage never got that high (this is a dual 10-core/20-thread (40-thread total) server):


I'm beyond confused, honestly! Also, btw, writing to the network drives via SMB has been BLAZING fast, basically maxing out my 10GB network! I see write speeds > 900MB/s for files > 5-6GB.

Here's some more results while copying a ~17GB file from/to the NAS (compression=on)

READ
Code:
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage     16.6T  12.4T    466      0   352M      0
storage     16.6T  12.4T    463      0   350M      0
storage     16.6T  12.4T    437      0   327M      0
storage     16.6T  12.4T    381      0   268M      0
storage     16.6T  12.4T    407      0   295M      0
storage     16.6T  12.4T    440      0   332M      0
storage     16.6T  12.4T    450      0   334M      0
storage     16.6T  12.4T    445      0   330M      0
storage     16.6T  12.4T    441      0   330M      0
storage     16.6T  12.4T    445      0   328M      0
WRITE
Code:
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage     16.6T  12.4T     14   1014  1.85M   886M
storage     16.6T  12.4T      8  1.13K  1.12M  1.07G
storage     16.6T  12.4T      3  1.16K   486K  1.09G
storage     16.6T  12.4T     80    613  10.0M   483M
storage     16.6T  12.4T     15    987  1.90M   797M
storage     16.6T  12.4T     10  1.13K  1.32M  1.07G
storage     16.6T  12.4T     11  1.08K  1.50M  1.02G
storage     16.6T  12.4T      9  1.10K  1.20M  1.04G
storage     16.6T  12.4T     11  1.04K  1.43M  1003M
storage     16.6T  12.4T      5  1.10K   742K  1.04G
READ


WRITE


(interesting dips... but overall great performance)

Thanks again, in advance, for the help!

--LF
 
Last edited:

levifig

Member
Nov 27, 2017
50
13
8
levifig.com
Here's a benchmark from a Windows 10 machine, via 10Gbps network, to an iSCSI vol* on that drive:

* compression=on,sync=disabled,volblocksize=64K




This makes sense! That's the expected performance! But I don't get it AT ALL when reading a file from that same iSCSI volume from that same machine! Here's an example:

COPY


RE-COPY
(using cache, I'm guessing)

(max'ed out at around 600MB/s for a split second, initially stable at around 450MB/s)

I could be getting something REALLY wrong about how all of this is supposed to be working... :p

Thank you for your time and help.

--LF
 

CaptainPoundSand

Active Member
Mar 16, 2019
134
92
28
Ashburn, VA
I can say you are not alone - I just have been going through the same thing but with 24x SSD and 12x 4tb Ironwolf. I've settled on a supermicro with BPN-SAS3-216A-N4 and 3x 9300-8i and an optane slog for NFS to get me that extra 100/200 for writes. for 10gb - I didn't get to much change from 12x vdev mirror to a full 1vdev of raidz3 and everything in between.
 

azev

Active Member
Jan 18, 2013
757
226
43
my experience with freenas/zfs is that almost impossible to get each individual drive to run at its maximum speed. I have probably the same setup as CaptainPoundSand 24x 800gb HGST 12GB sas ssd (setup as 12x mirror vdev) and I rarely seen each drive running more than 200Mbps read or write even during benchmark. I feel that zfs performance does not really scale with the amount of drive you have. I've tried following many different custom tuning setup I found online, but it does not improve performance by much at all.
If anyone have the magic button to squeeze more performance for my setup, I am all ear :)
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,259
1,711
113
CA
my experience with freenas/zfs is that almost impossible to get each individual drive to run at its maximum speed. I have probably the same setup as CaptainPoundSand 24x 800gb HGST 12GB sas ssd (setup as 12x mirror vdev) and I rarely seen each drive running more than 200Mbps read or write even during benchmark. I feel that zfs performance does not really scale with the amount of drive you have. I've tried following many different custom tuning setup I found online, but it does not improve performance by much at all.
If anyone have the magic button to squeeze more performance for my setup, I am all ear :)
As always stated in these threads ZFS is not about the best performance... but I may have some ideas :D

Steady state is never "maximum performance"... you're setting yourself up for failure if you expect to get anywhere near max SSD #s, in addition to this you have overhead in the hardware and software to account for, networking and any limitations your software/protocol/firmware etc may have as well.

What's CPU \ RAM usage on the file server during benchmark?
What about the other system cpu\ram as well... if over network?

Older article: Using file copy to measure storage performance – Why it’s not a good idea and what you should do instead
Still not the best way to test I'm sure :)
 

levifig

Member
Nov 27, 2017
50
13
8
levifig.com
I can say you are not alone - I just have been going through the same thing but with 24x SSD and 12x 4tb Ironwolf. I've settled on a supermicro with BPN-SAS3-216A-N4 and 3x 9300-8i and an optane slog for NFS to get me that extra 100/200 for writes. for 10gb - I didn't get to much change from 12x vdev mirror to a full 1vdev of raidz3 and everything in between.
my experience with freenas/zfs is that almost impossible to get each individual drive to run at its maximum speed. I have probably the same setup as CaptainPoundSand 24x 800gb HGST 12GB sas ssd (setup as 12x mirror vdev) and I rarely seen each drive running more than 200Mbps read or write even during benchmark. I feel that zfs performance does not really scale with the amount of drive you have. I've tried following many different custom tuning setup I found online, but it does not improve performance by much at all.
If anyone have the magic button to squeeze more performance for my setup, I am all ear :)
Interesting to hear... I mean, I just feel like, with my setup, I should be closer to 800MB/s+ reads from the platters (aka not cache)... Theoretically, sequential reads should be about the speed of reading from all 8 drives. These drives do around 200MB/s sequential reads, so, theoretically, even 8x100MB/s (accounting for overheads, etc), should be a reasonable expectation. I literally switched from a RAIDZ2 to this to improve performance. It did improve write speeds BY A LOT (especially reducing I/O delay during random/small file writes, for sure), but I'm still struggling with reads... :\

As always stated in these threads ZFS is not about the best performance... but I may have some ideas :D

Steady state is never "maximum performance"... you're setting yourself up for failure if you expect to get anywhere near max SSD #s, in addition to this you have overhead in the hardware and software to account for, networking and any limitations your software/protocol/firmware etc may have as well.

What's CPU \ RAM usage on the file server during benchmark?
What about the other system cpu\ram as well... if over network?
In my other post above I mentioned some of that data. I've ran benchmarks locally and via network. The numbers match up, as per all the screenshots above... :)

As I've stated, even though, theoretically, a pool of 4x 2-way mirror vdevs should read (sequentially) at about the max speed of a single drive x8, I know that's all theoretical. But I expected that, reading a single LARGE file, should get me more than 400MB/s.

Now, just for the sake of information, the "client" system, the one on the network, is a Proxmox system, using a Windows 10 VM (which iperf3 benches at 9.4Gbps to the NAS in question), running on an i9-7900X system, 64GB RAM (24GB to the VM), on NVMe drives (that bench at > 1800MB/s writes, 2800MB/s reads inside the VM).

As shown in my last post before this one, the Windows benchmarks show exactly the kind of performance I expected. But when I copy a file, the performance is pretty different...

Thank you all for your willingness to help!

--LF
 

levifig

Member
Nov 27, 2017
50
13
8
levifig.com
More interestingness... Just pulled up zpool status and a scrub was underway. Here are the 2 spinner pools I have:

2-disk stripe
Code:
 pool: temp
 state: ONLINE
  scan: scrub in progress since Sun May 12 00:24:04 2019
        3.64T scanned out of 16.3T at 251M/s, 14h41m to go
        0B repaired, 22.31% done
config:

        NAME                                  STATE     READ WRITE CKSUM
        temp                                  ONLINE       0     0     0
          ata-WDC_WD100EMAZ-00WJTA0_2YH6137D  ONLINE       0     0     0
          ata-WDC_WD100EMAZ-00WJTA0_2YHDBL9D  ONLINE       0     0     0

errors: No known data errors
8-disk 2-way mirror stripe
Code:
 pool: storage
 state: ONLINE
  scan: scrub in progress since Sun May 12 00:24:03 2019
        3.43T scanned out of 17.0T at 236M/s, 16h40m to go
        0B repaired, 20.20% done
config:

        NAME                                   STATE     READ WRITE CKSUM
        storage                                ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_7SGJ7E3C  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKZ5TSC  ONLINE       0     0     0
          mirror-1                             ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_7SGBY56C  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKWPLYC  ONLINE       0     0     0
          mirror-2                             ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_7SGKYW3C  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7SH0YT6G  ONLINE       0     0     0
          mirror-3                             ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7SH0W5HG  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7SH0WKNG  ONLINE       0     0     0

errors: No known data errors
Super interesting that the 2-disk stripe is faster (slightly) than the 8-disk mirrored vdevs (effectively 4-way stripe).

Just thought I'd add that...

PS: Found this thread in the FreeBSD forums that seems to be about exactly the same issue... ¯\_(ツ)_/¯

--LF