Disappointing ZFS read performance on 2 x 6 RaidZ2 and quest for bottleneck(s)

Wariole

New Member
Mar 2, 2020
8
0
1
Hi!
After spending a lot of time reading valuable sources such as STH, I spent the last 3 months slowly incubating my new file server. I am unfortunately quite disappointed by the performance of the resulting system.
I apologize for the long post, but I will try to give all relevant info that experts around might need to help me tweak my setup.

Server configuration
  • OS: OpenMediaVault 5.3.5-1 with Proxmox Kernel (Debian GNU/Linux, with Linux 5.3.18-2-pve), booting from a USB 3.1 stick (Samsung MUF-128AB), using the OMV flashmemory plugin.
  • Motherboard: ASRockRack X470D4U2-2T (BIOS 3.30, BMC FW 1.70) custom fitted into a Rackable S3012 chassis
  • CPU: AMD Ryzen 7 3700X
  • RAM: 2 x 32GB ECC Samsung M391A4G43MB1-CTD DDR4-2666@3200MT/s
  • HBA: LSI 9201-16i using 3 SAS connections to the 3 backplanes of the S3012. The HBA is mounted on PCIE slot #6 of the MB (PCIE 3.0 x 16) with an old PCIE 2.0 riser and flashed to IT mode with FW 20.00.07.00.
  • ZFS pool: 12 x 3TB HDD (WD Red) in 2 vdevs of 6x RaidZ2 (ZFS version 0.8.3-pve1)
  • NIC: 10GbE embedded Intel X550 (server) -> Mikrotik CR305-1G-4S+IN / S+RJ10 SFP+ modules -> Sonnet Solo10G thunderbolt 3 (to iMac or PC, using Cat6a S/FTP cabling)
  • Power Supply: Corsair SF750, 750 Watt, 80Plus Platinum

Use case and expectations

Overkill home office server, able to fully take advantage of the 10GbE LAN (+ 1Gb Internet)
  • Documents & Media file server : over SMB for Mac / Windows clients and over NFS for Kody/Emby
  • Direct editing of huge RAW photos / movies from iMac over SMB
  • Time Machine backups for Mac clients over SMB
  • Various docker containers : portainer, unifi-controller, embyserver (with software transcoding), logitech-media-server, heimdall, rutorrent (through Wireguard), letsencrypt, etc.
  • (TBD: RSync or ZFS send to a local OpenMediaVault backup server + encrypted cloud syncing to Dropbox / Google Drive + homelab virtualization with Cockpit)
I expected to get at least 900MB/s sequential read and 600MB/s write over SMB and currently get only 400MB/s read & write. For the moment, I am mostly concerned by the local sequential read speed of around 650MB/s that I get from the pool (after a lot of tweaking). I will address the SMB low performance afterwards.


Performance tuning so far
ZFS
main pool (/tank):
Code:
ashift = 12
recordsize = 128K
compression = lz4
atime = off
xattr = sa
sync = disable (the server is protected with a UPS, monitored with NUT)
dataset used for performance tests (/tank/media):
Code:
recordsize = 1M
other attributes are inherited from /tank
/etc/modprobe.d/zfs.conf:
Code:
options zfs zfs_arc_max=53687091200 (I don’t need much RAM for the non ZFS tasks)
SMB options (just to solve some issues with MacOS Catalina)
Code:
min protocol = SMB2
usershare path =

NFS options (local share for Kodi / Emby, used to stream UHD content to an Android TV set)
Code:
ro,subtree_check,insecure,crossmnt

Speed tests

Bonnie++ (with compression = off on /tank/media)
Code:
# bonnie++ -u root -r 1024 -s 64G -d /tank/media -f -b -n 1 -c 4
gets 820MB/s write, 320MB/s rewrite, 677MB/s read:
Code:
Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
gringotts    64G::4            820m  68  320m  29            677m  35 656.5  49
Latency                       43974us     975ms               296ms     450ms
Version  1.98       ------Sequential Create------ --------Random Create--------
gringotts           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency                59us       9us     428us      35us      71us      72us
1.98,1.98,gringotts,4,1583180288,64G,,8192,5,,,839945,68,327742,29,,,693055,35,656.5,49,1,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,,43974us,975ms,,296ms,450ms,59us,9us,428us,35us,71us,72us
Note 1: the write performance is quite good here due to the use of sync=disable and increased arc size.
Note 2: with compression = lz4, I get 1.7GB/s write, 1.5GB/s rewrite, 3.4GB/s read with 100% CPU, but this is not a realistic scenario with my mostly incompressible data.

Sample output of # zpool iostat -v 10 during bonnie++ read phase:
Code:
                                                capacity     operations     bandwidth
pool                                          alloc   free   read  write   read  write
--------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                          15.6T  16.9T  2.44K    144   654M  1.14M
  raidz2                                      7.80T  8.45T  1.22K     67   326M   576K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N1NV74PJ      -      -      0     11  5.20K  98.0K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4NNRF1H      -      -    312     10  81.5M  98.4K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKLVF6      -      -    314     10  81.5M  91.6K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7PJANV7      -      -    313     10  81.5M  93.2K
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K6YCCJS3      -      -    304     11  81.5M  95.6K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6AT49D5      -      -      0     12  1.60K  99.2K
  raidz2                                      7.77T  8.48T  1.22K     76   328M   589K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N1XDHAKL      -      -      0     12  3.60K  97.2K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKL3SA      -      -    313     12  82.1M   101K
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K5PKN5R0      -      -    310     13  82.1M   102K
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K1SHENDA      -      -    310     12  82.1M  99.2K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5VPU63H      -      -    313     12  82.1M  94.8K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6SJP7R9      -      -      0     12  5.20K  94.0K
--------------------------------------------  -----  -----  -----  -----  -----  -----
Note: during bonnie++ write phase, I get between 84 to 107MB/s write speed on each drive.

I think I should get 25 to 50% more read speed than this...

iperf3 from client to server over 10GbE
Code:
% iperf3 -c 10.0.0.16 -t 30 -V
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-30.00  sec  32.8 GBytes  9.40 Gbits/sec                  sender
[  5]   0.00-30.00  sec  32.8 GBytes  9.40 Gbits/sec                  receiver
CPU Utilization: local/sender 98.7% (0.6%u/98.1%s), remote/receiver 31.4% (0.7%u/30.7%s)
rcv_tcp_congestion cubic
Code:
% iperf3 -c 10.0.0.16 -t 30 -V -R
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  31.9 GBytes  9.12 Gbits/sec  869             sender
[  5]   0.00-30.00  sec  31.9 GBytes  9.12 Gbits/sec                  receiver
snd_tcp_congestion cubic
Note 1: I see here that I have some retries from server to client, which is probably due to a cabling issue. I will investigate this but I believe it is marginal.
Note 2: MTU size is kept to default 1500 as it seems the Mikrotik SFP+ modules I have don’t support jumbo frames (S+RJ10 rev. 1). This may explain the gap between local and SMB speeds... I will try a direct link with jumbo frames enabled to see the difference.

Speed Test over SMB / 10GbE (target is /tank/media)
2020-03-02 - DiskSpeedTest.png
Write: 392MB/s, Read: 393MB/s
I get similar results when manually moving large files around (to and from a RAM disk on the client) with a nice speed boost when moving the same file twice due to the effect of ZFS ARC.​


I don't think my expectations are unrealistic and I will be gratefull if someone could point me towards the correct direction (to 10GbE Nirvana ;-)
Thank you for the read and for your most welcomed insights.

Additional information
Code:
Pool status (zpool status):

  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 06:11:44 with 0 errors on Tue Feb 11 21:56:46 2020
config:

    NAME                                          STATE     READ WRITE CKSUM
    tank                                          ONLINE       0     0     0
     raidz2-0                                    ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N1NV74PJ  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4NNRF1H  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKLVF6  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7PJANV7  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68N32N0_WD-WCC7K6YCCJS3  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6AT49D5  ONLINE       0     0     0
     raidz2-1                                    ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N1XDHAKL  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKL3SA  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68N32N0_WD-WCC7K5PKN5R0  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68N32N0_WD-WCC7K1SHENDA  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5VPU63H  ONLINE       0     0     0
       ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6SJP7R9  ONLINE       0     0     0

errors: No known data errors

Pool details (zpool get all):

NAME  PROPERTY                       VALUE                          SOURCE
tank  size                           32.5T                          -
tank  capacity                       47%                            -
tank  altroot                        -                              default
tank  health                         ONLINE                         -
tank  guid                           7470709504985343392            -
tank  version                        -                              default
tank  bootfs                         -                              default
tank  delegation                     on                             default
tank  autoreplace                    off                            default
tank  cachefile                      -                              default
tank  failmode                       wait                           default
tank  listsnapshots                  off                            default
tank  autoexpand                     off                            default
tank  dedupditto                     0                              default
tank  dedupratio                     1.00x                          -
tank  free                           17.0T                          -
tank  allocated                      15.5T                          -
tank  readonly                       off                            -
tank  ashift                         12                             local
tank  comment                        -                              default
tank  expandsize                     -                              -
tank  freeing                        0                              -
tank  fragmentation                  2%                             -
tank  leaked                         0                              -
tank  multihost                      off                            default
tank  checkpoint                     -                              -
tank  load_guid                      11594792279076986393           -
tank  autotrim                       off                            default
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           active                         local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local

Pool filesystem details (zfs get all):

NAME  PROPERTY              VALUE                                 SOURCE
tank  type                  filesystem                            -
tank  creation              Sun Nov 17 12:18 2019                 -
tank  used                  10.3T                                 -
tank  available             10.7T                                 -
tank  referenced            37.1G                                 -
tank  compressratio         1.01x                                 -
tank  mounted               yes                                   -
tank  quota                 none                                  default
tank  reservation           none                                  default
tank  recordsize            128K                                  default
tank  mountpoint            /tank                                 default
tank  sharenfs              off                                   default
tank  checksum              on                                    default
tank  compression           lz4                                   local
tank  atime                 off                                   local
tank  devices               on                                    default
tank  exec                  on                                    default
tank  setuid                on                                    default
tank  readonly              off                                   default
tank  zoned                 off                                   default
tank  snapdir               hidden                                default
tank  aclinherit            restricted                            default
tank  createtxg             1                                     -
tank  canmount              on                                    default
tank  xattr                 sa                                    local
tank  copies                1                                     default
tank  version               5                                     -
tank  utf8only              off                                   -
tank  normalization         none                                  -
tank  casesensitivity       sensitive                             -
tank  vscan                 off                                   default
tank  nbmand                off                                   default
tank  sharesmb              off                                   default
tank  refquota              none                                  default
tank  refreservation        none                                  default
tank  guid                  6829761628355240197                   -
tank  primarycache          all                                   default
tank  secondarycache        all                                   default
tank  usedbysnapshots       0B                                    -
tank  usedbydataset         37.1G                                 -
tank  usedbychildren        10.3T                                 -
tank  usedbyrefreservation  0B                                    -
tank  logbias               latency                               default
tank  objsetid              51                                    -
tank  dedup                 off                                   default
tank  mlslabel              none                                  default
tank  sync                  disabled                              local
tank  dnodesize             legacy                                default
tank  refcompressratio      1.02x                                 -
tank  written               37.1G                                 -
tank  logicalused           10.5T                                 -
tank  logicalreferenced     37.9G                                 -
tank  volmode               default                               default
tank  filesystem_limit      none                                  default
tank  snapshot_limit        none                                  default
tank  filesystem_count      none                                  default
tank  snapshot_count        none                                  default
tank  snapdev               hidden                                default
tank  acltype               off                                   default
tank  context               none                                  default
tank  fscontext             none                                  default
tank  defcontext            none                                  default
tank  rootcontext           none                                  default
tank  relatime              off                                   default
tank  redundant_metadata    all                                   default
tank  overlay               off                                   default
tank  encryption            off                                   default
tank  keylocation           none                                  default
tank  keyformat             none                                  default
tank  pbkdf2iters           0                                     default
tank  special_small_blocks  0                                     default
tank  omvzfsplugin:uuid     753ded0e-697f-4d52-aec2-274b5c3f852d  local
 

acquacow

Well-Known Member
Feb 15, 2017
605
322
63
39
My SMB was bottlenecked at core performance until I enabled smb multithread.
 

Spartacus

Well-Known Member
May 27, 2019
764
302
63
Austin, TX
You are not misplaced on your calculations, however 900/600 would be the expectation of 7200 rpm drives.
I feel 700/500 +- 50 is more realistic with the 5400 rpm drives you have so that "local sequential read speed of around 650MB/s that I get from the pool" may not be out of place (per @i386 's note of the drive bottleneck).

@acquacow doesn't have a bad idea, whats the CPU look like when executing a full write.
However first think that comes to mind for the 400/400 issue is, are you getting full throughput over the 10g?
Try iperf and see what speeds you get, I've seen weird issues only getting half speed with my own 10g connection with mellanox windows drivers.
 

Wariole

New Member
Mar 2, 2020
8
0
1
Thank you all very much for your answers.

My SMB was bottlenecked at core performance until I enabled smb multithread.
I assume you are suggesting multi channel, even if I am not sure how this would benefit a single 10GbE connection, but I tried anyway by adding this to smb.conf:
Code:
server multi channel support = yes
aio read size = 1
aio write size = 1
Unfortunately, this doesn't seem to make any difference for my read/write speed (tested from my iMac with SMB v3.02). The server is running SMB version 4.9.5-Debian.

3tb Drives? I think that's your bottleneck.
Thank you for pointing this out.
I decided to check a single drive performance by benchmarking an old spare drive of the same 3TB WD Red model, connected to an integrated SATA port of the motherboard and formatted with EXT4.
I get 137MB/s write, 144MB/s read, which is much better than the 82MB/s read per drive that I get with my 2 x 6 RaidZ2 pool (reported by zpool iostat)…
Code:
# bonnie++ -u root -r 1024 -s 64G -d . -f -b -n 1 -c 4
Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
gringotts    64G::4            137m   8 69.3m   4            144m   5 255.0   7
Latency                         206ms    2027ms               111ms     281ms
Version  1.98       ------Sequential Create------ --------Random Create--------
gringotts           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1  1024   0 +++++ +++  1024   0  1024   0 +++++ +++  1024   0
Latency               208ms      63us   58967us   88274us      11us   57931us
1.98,1.98,gringotts,4,1583244197,64G,,8192,5,,,140432,8,70975,4,,,147910,5,255.0,7,1,,,,,43,0,+++++,+++,44,0,43,0,+++++,+++,43,0,,206ms,2027ms,,111ms,281ms,208ms,63us,58967us,88274us,11us,57931us
It makes me believe the 3TB drives might not be my bottleneck. Please correct me if I am wrong.

@acquacow doesn't have a bad idea, whats the CPU look like when executing a full write.
Something like this while executing a big write from SMB client:
Code:
top - 18:28:43 up  1:57,  2 users,  load average: 0.45, 0.26, 0.28
Tasks: 455 total,   1 running, 452 sleeping,   0 stopped,   2 zombie
%Cpu0  :  0.3 us,  4.0 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  :  1.3 us,  5.3 sy,  0.0 ni, 93.1 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu2  :  0.3 us,  3.6 sy,  0.0 ni, 73.8 id, 21.5 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu3  :  0.7 us,  4.7 sy,  0.0 ni, 94.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu4  :  1.0 us,  9.6 sy,  0.0 ni, 83.7 id,  4.3 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu5  :  0.3 us,  5.3 sy,  0.0 ni, 94.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu6  :  0.7 us,  5.7 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu7  :  0.0 us,  7.7 sy,  0.0 ni, 92.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu8  :  0.0 us,  5.0 sy,  0.0 ni, 94.1 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu9  :  0.0 us,  4.0 sy,  0.0 ni, 87.0 id,  8.7 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu10 :  0.3 us,  5.9 sy,  0.0 ni, 93.1 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu11 :  0.7 us,  2.7 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu12 :  0.3 us,  6.6 sy,  0.0 ni, 92.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu13 :  2.0 us,  8.1 sy,  0.0 ni, 86.9 id,  0.0 wa,  0.0 hi,  3.0 si,  0.0 st
%Cpu14 :  0.7 us,  7.9 sy,  0.0 ni, 90.4 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu15 :  0.7 us,  9.4 sy,  0.0 ni, 87.9 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
MiB Mem :  64314.8 total,  47388.5 free,  16328.7 used,    597.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  47266.1 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                       
27774 ym        20   0 8412596  21672  14280 S  43.9   0.0   4:03.95 smbd                                                                                                                           
 1242 root       1 -19       0      0      0 S   3.7   0.0   0:18.09 z_wr_iss                                                                                                                       
 1248 root       1 -19       0      0      0 S   3.7   0.0   0:18.05 z_wr_iss                                                                                                                       
 1241 root       1 -19       0      0      0 S   3.3   0.0   0:18.25 z_wr_iss                                                                                                                       
 1243 root       1 -19       0      0      0 S   3.3   0.0   0:18.14 z_wr_iss                                                                                                                       
 1244 root       1 -19       0      0      0 S   3.3   0.0   0:18.27 z_wr_iss                                                                                                                       
 1245 root       1 -19       0      0      0 S   3.3   0.0   0:18.12 z_wr_iss                                                                                                                       
 1246 root       1 -19       0      0      0 S   3.3   0.0   0:18.12 z_wr_iss                                                                                                                       
 1247 root       1 -19       0      0      0 S   3.3   0.0   0:18.26 z_wr_iss                                                                                                                       
 1249 root       1 -19       0      0      0 S   3.3   0.0   0:18.23 z_wr_iss                                                                                                                       
 1250 root       1 -19       0      0      0 S   3.3   0.0   0:17.98 z_wr_iss                                                                                                                       
 1251 root       1 -19       0      0      0 S   3.3   0.0   0:18.32 z_wr_iss                                                                                                                       
 1252 root       1 -19       0      0      0 S   3.3   0.0   0:18.09 z_wr_iss
However first think that comes to mind for the 400/400 issue is, are you getting full throughput over the 10g?
Try iperf and see what speeds you get, I've seen weird issues only getting half speed with my own 10g connection with mellanox windows drivers.
I get 9.40Gb/s and 9.12Gb/s with iperf3.
This seems to be a pure SMB issue: today I mounted a tmpfs RAM disk on the server and got the same 400/400 results.

So, I need to fix what I think is a suboptimal read speed from my pool + the SMB issue...
 

XeonLab

New Member
Aug 14, 2016
26
8
3
Is the server already in production? If not, you could try different VDEV combinations (mirrors/stripes) to see if that makes any difference. Even better if you have a couple of spare SSDs to test a full flash pool.
 

Wariole

New Member
Mar 2, 2020
8
0
1
Maybe try to export and mount as NFS and test again?
I just tried that and got slightly worse performances over NFS (which in my experience is usual with MacOS) : 360MB/s write and 320MB/s read.

It could be the smb client (MAC?)
This is what I suspected so I tried with my Windows Laptop today and I am even more confused now...
This is a speed test on the same /tank/media dataset over SMB from Windows 10:
2020-03-03 - CrystalDiskMark tank.png
and then testing on a ramdisk on the same server over SMB from Windows 10:
2020-03-03 - CrystalDiskMark ramdisk.png
I am having a hard time understanding those results, especially the lower read speed from the ramdisk...

Is the server already in production? If not, you could try different VDEV combinations (mirrors/stripes) to see if that makes any difference. Even better if you have a couple of spare SSDs to test a full flash pool.
It is already in production and close to 50% full, so not too easy to try again different combinations.
I did some trials when I created the pool, but it was using a much slower CPU/RAM/MB, and my bottleneck was clearly the CPU, so I decided to go for the current 3700X build.
I have 3 spare HDDs of the same 3TB WD Red and just a couple of 240GB Intel 535 SSD. The S3012 chassis is full but I have 6 free SATA ports on the motherboard.
What kind of test do you suggest?
 

Wariole

New Member
Mar 2, 2020
8
0
1
As an experiment, I disabled SMT on the 3700X and tried the same Bonnie++ benchmark (with compression = off):
Code:
# bonnie++ -u root -r 1024 -s 64G -d /tank/media -f -b -n 1 -c 4
Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
gringotts    64G::4            896m  71  336m  31            745m  38 710.3  53
Latency                       58809us     892ms               169ms     412ms
Version  1.98       ------Sequential Create------ --------Random Create--------
gringotts           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency                64us       2us     288us      30us      11us      72us
1.98,1.98,gringotts,4,1583255260,64G,,8192,5,,,917718,71,344559,31,,,763001,38,710.3,53,1,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,,58809us,892ms,,169ms,412ms,64us,2us,288us,30us,11us,72us
So:
  • with SMT on: 820MB/s write, 320MB/s rewrite, 677MB/s read
  • with SMT off: 896MB/s write, 336MB/s rewrite, 745MB/s read
Sample output of # zpool iostat -v 10 during bonnie++ read phase (with SMT off):
Code:
                                                capacity     operations     bandwidth
pool                                          alloc   free   read  write   read  write
--------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                          15.6T  16.9T  2.89K    113   773M   815K
  raidz2                                      7.80T  8.45T  1.42K     42   379M   278K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N1NV74PJ      -      -    212      6  56.5M  44.8K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4NNRF1H      -      -    360      6  94.8M  45.2K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKLVF6      -      -    148      7  38.2M  46.4K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7PJANV7      -      -    147      7  38.2M  46.8K
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K6YCCJS3      -      -    361      6  94.8M  48.0K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6AT49D5      -      -    220      7  56.5M  47.2K
  raidz2                                      7.77T  8.48T  1.47K     71   395M   537K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N1XDHAKL      -      -      0     11  9.20K  88.0K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKL3SA      -      -    379     14  98.7M   104K
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K5PKN5R0      -      -    375     11  98.6M  86.8K
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K1SHENDA      -      -    375     10  98.6M  85.6K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5VPU63H      -      -    378     11  98.6M  87.6K
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6SJP7R9      -      -      0     11  7.20K  85.2K
--------------------------------------------  -----  -----  -----  -----  -----  -----
Now I have 10 drives out of 12 doing reads instead of 8 previously, and the top single drive read speed has also increased significantly.
 
Last edited:

XeonLab

New Member
Aug 14, 2016
26
8
3
...
It is already in production and close to 50% full, so not too easy to try again different combinations.
I did some trials when I created the pool, but it was using a much slower CPU/RAM/MB, and my bottleneck was clearly the CPU, so I decided to go for the current 3700X build.
I have 3 spare HDDs of the same 3TB WD Red and just a couple of 240GB Intel 535 SSD. The S3012 chassis is full but I have 6 free SATA ports on the motherboard.
What kind of test do you suggest?
Create a 2 or 3 disk ZFS stripe pool with SSDs' and test its performance, you should be able to saturate the 10 GbE link even with two drives. If it does, then you know SMB and network are fine.

And you probably know it, be safe when playing with production hardware, I'd export the HDD pool and disconnect the drives before creating any new test pools if possible downtime-wise. Then you even could use the LSI HBA and not onboard SATA, which would just be another unnecessary variable in the test.

PS. Some folks here might have a better knowledge but use ashift = 13 with SSD's as they have 8K sector size.
 
  • Like
Reactions: Wariole

i386

Well-Known Member
Mar 18, 2016
2,460
675
113
31
Germany
I get 137MB/s write, 144MB/s read,
These are much higher than what I would excepect from <7200 rpm and 3 tb drives.
I had 3tb wd greens and they had peak read/writes of 110 MByte/s and average reads/writes of ~65MByte/s.
Are you shure that you're not measuring cached data?
 

Wariole

New Member
Mar 2, 2020
8
0
1
I performed some additional digging per your suggestions:

ZFS performance troubleshooting

Following @XeonLab advice (thank you very much for the export and ashift=13 reminder), I created a 2 SSD ZFS stripe, first connected to onboard SATA, and then connected to the LSI HBA:
Code:
ashift = 13
recordsize = 1M
compression = off
atime = off
xattr = sa
sync = disabled
Code:
NAME                                          STATE     READ WRITE CKSUM
intelssdstripe                                ONLINE       0     0     0
  ata-INTEL_SSDSC2BW240H6_CVTR5180021B240CGN  ONLINE       0     0     0
  ata-INTEL_SSDSC2BW240H6_CVTR5180014L240CGN  ONLINE       0     0     0
Bonnie++ results with onboard SATA: 1GB/s write, 503MB/s rewrite, 988MB/s read (exactly as expected).
Bonnie++ results with 2 disks on the same backplane / LSI HBA: 516MB/s write, 258MB/s rewrite, 522MB/s read.
-> it seems my bandwidth is caping somewhere… so I moved one of the SSDs to a separate backplane -> same results. I even ordered an SFF-8087 to 4 SATA cable to check if something is going on with my backplanes (should test it in 2 days).

After banging my head for a while, I opened the LSI Bios and discovered the 2 SSDs are negotiating to 3Gb/s (SATA2) instead of 6Gb/s (SATA3) that both the SSDs and LSI 9201-16i are supposed to support. It is something that I will investigate, but I think it shouldn't cause a performance bottleneck with my 3TB WD Red drives...

While I was there, I plugged my single 3TB drive formatted as EXT4 onto the LSI HBA and got the same 136MB/s write, 144MB/s read speed that I got when it was connected to the onboard SATA.
I erased the drive and created a single drive ZFS pool with the same properties as my main pool and got 144MB/s write, 136MB/s read speed:
Code:
# bonnie++ -u root -r 1024 -s 64G -d /single3tb -f -b -n 1 -c 4
Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
gringotts    64G::4            144m  14 56.1m   5            136m   5 241.4  17
Latency                       47398us    2075ms              2058ms    1250ms
Version  1.98       ------Sequential Create------ --------Random Create--------
gringotts           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               287us       4us     366us      99us       5us      58us
1.98,1.98,gringotts,4,1583354843,64G,,8192,5,,,147403,14,57470,5,,,139057,5,241.4,17,1,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,,47398us,2075ms,,2058ms,1250ms,287us,4us,366us,99us,5us,58us
Those numbers seem about right if it has some data on the drive based of STH testing chart here (couldnt find the actual test post):
https://www.servethehome.com/wp-content/uploads/2019/01/Easystore-WD-10TB-White-Label-Parkdale.jpg

The 3tb WD red tested empty at about 149/146 (sequential)
Thank you very much @Spartacus for the confirmation (and usefull chart).

I still don't see where my ZFS bottleneck is (< 100MB/s single drive read speed inside my 2 x 6 RaidZ2 pool)...​


Poor SMB speed over 10GbE

After creating the ZFS stripe with 2 SSDs (over onboard SATA with 1GB/s read & write bonnie++ speed), I performed a CrystalDiskMark test over SMB. I got 284MB/s sequential read and 1049MB/s write, while using current network and similar results through a direct link between the 10GbE NICs...
So I decided to surrender to jumbo frames craziness and finally got 1181/1198MB/s (which is obviously coming from ARC, but that's perfectly fine). MacOS give me 916/918MB/s with the same setup.

MTU over 9000.jpg

The issue is that I really don't want to mess around with jumbo frames over my network and my Mikrotik SFP+ modules don't even support it. So I need to find a way to get similar speeds without jumbo. Any help much appreciated :)
 
Last edited:

Wariole

New Member
Mar 2, 2020
8
0
1
Following @XeonLab advice (thank you very much for the export and ashift=13 reminder), I created a 2 SSD ZFS stripe, first connected to onboard SATA, and then connected to the LSI HBA:
[...]
Bonnie++ results with onboard SATA: 1GB/s write, 503MB/s rewrite, 988MB/s read (exactly as expected).
Bonnie++ results with 2 disks on the same backplane / LSI HBA: 516MB/s write, 258MB/s rewrite, 522MB/s read.
-> it seems my bandwidth is caping somewhere… so I moved one of the SSDs to a separate backplane -> same results. I even ordered an SFF-8087 to 4 SATA cable to check if something is going on with my backplanes (should test it in 2 days).​
I tried today the same SSD stripe connected to the HBA with a SFF-8087 to 4 SATA cable and got 528MB/s write, 522MB/s read.
So my HBA is probably limiting SATA speed to 3Gb/s per lane.

I checked the HBA PCIe speed with # lspci -vv:
Code:
2b:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)
    Subsystem: LSI Logic / Symbios Logic SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (SAS 9201-16i)
    Control: I/O+ Mem+ BusMaster+ SpecCycle-
[...]
        LnkCap:    Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[...]
    Kernel driver in use: mpt3sas
    Kernel modules: mpt3sas
It seems good to me, and still doesn't explain the suboptimal read speed with my main pool...
 

XeonLab

New Member
Aug 14, 2016
26
8
3
...
It seems good to me, and still doesn't explain the suboptimal read speed with my main pool...
One thing you could still try is to create a single VDEV RAIDZ1 pool with those 3 spare drives and see if it performs as expected.

By the way, did you benchmark or burn-in individual disks before creating the pool? If not, at least do it with those spare drives before going forward.
 

Wariole

New Member
Mar 2, 2020
8
0
1
I found a solution for the ZFS read speed thank to this issue regarding openzfs posted on Github.
I had to increase zfetch_max_distance to the maximum value of 2147483648 (after trying many values from the default 8388608). Here are the results:
Code:
# echo 2147483648 >> /sys/module/zfs/parameters/zfetch_max_distance
# bonnie++ -u root -r 1024 -s 64G -d /tank/media -f -b -n 1 -c 4
Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
gringotts    64G::4            850m  70  358m  42            1.4g  53 641.8  48
Latency                       24399us     936ms              2036ms     365ms
Version  1.98       ------Sequential Create------ --------Random Create--------
gringotts           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency                46us       4us     431us      36us       2us      42us
1.98,1.98,gringotts,4,1583538515,64G,,8192,5,,,870088,70,366971,42,,,1484196,53,641.8,48,1,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,,24399us,936ms,,2036ms,365ms,46us,4us,431us,36us,2us,42us
Since the read speed of 1.4GB/s given by bonnie++ is obviously wrong, I did a dd test (compression is off):
Code:
# dd if=/dev/zero of=zero bs=10M
57524879360 bytes (58 GB, 54 GiB) copied, 65.9013 s, 873 MB/s
Code:
(after a reboot):
# dd if=/tank/media/zero of=/dev/null bs=10M
57524879360 bytes (58 GB, 54 GiB) copied, 60.6991 s, 948 MB/s
So:
  • initial benchmark: 820MB/s write, 320MB/s rewrite, 677MB/s read
  • with SMT off: 896MB/s write, 336MB/s rewrite, 745MB/s read
  • with SMT off and zfetch_max_distance=2147483648: 850MB/s write, 358MB/s rewrite, 948MB/s read
+40% for the read speed... Not too bad!

Even so, the extreme zfetch_max_distance value seems like a temporary workaround for something wrong with curent ZoL version and it is probably impacting the IOPS. But at least it confirms my expectations were reasonable :)

Now I just need to find a way to make SMB multi channel work with a single 10GbE link or swallow the jumbo pill...
2020-03-06 - DiskSpeedTest with jumbo.png

One thing you could still try is to create a single VDEV RAIDZ1 pool with those 3 spare drives and see if it performs as expected.

By the way, did you benchmark or burn-in individual disks before creating the pool? If not, at least do it with those spare drives before going forward.
For the last part, all drives were either recycled from my previous NAS or bought second hand. I benchmarked one of the (used) spare drive to 144/136MB/s, which confirmed something was off with the read speed from my ZFS pool.
Thanks again for your sound advices!
 
Last edited: