AiO: ESXi, OmniOS, Napp-it on Ryzen 3rd gen

netforceatg

New Member
Jun 2, 2015
18
1
3
36
First of, thank you all for all your contributions here on the forum. I'm not an active writing user, but I often read here as I'm really into various computer and network stuff.

I have the last couple of years run a AiO, with ESXi with 4-5 VMs, one of them with OmniOS with Nappit for ZFS, aswell as a few other machines(not heavy loads). I often download for temporary storage on a the sas drives through a virtual win2019 machine, and then transfer them to the raidz2 array. Hardware:
- Asrock D1541D4U-2T8R (D-1541, SAS3008, X550 10GB onboard)
- 512gb SATA for ESXI and virtual machines, mounted directly in esxi
- X-Case RM424 Pro-EX V2 24 Bay (SAS expander 2x4 port, 12GB\s (assume LSI chipset expander)
- 128GB Ram ecc 2400mhz. 80GB given to OmniOS/ZFS. All CPU cores passed through to omnios
- HBA Passthrough 3x 900GB HUSMM1680ASS (stripe) and RaidZ2 for 11x Seagate ST8000VN0022
- ESXi 6.7U2, OmniOS v11 r151032p, Nappit 19.12a1, Win2019 DC

However, I've never been really satisfied with the ZFS speed performance. Running local mc file transfer locally on OmniOS from SAS array to the RaidZ2 array gives 500-700mb/s(large 20gb file transfer). Looks like the CPU is getting close to max in ESXi overview. Mounting the disks over samba to virtual Win2019 machine, vmxnet3 driver. Gives around 220-240mb/s, from the sas array to the raidz2 array.

Now for my sanity check. I consider to upgrade the hardware, keep the disks, chassi and ram, and replace the aging D-1541 motherboard with this:
- X570 motherboard (Consider Asus X570-ACE or a Gigabyte MB, to get more PCIe\NVMe ports)
- Amd Ryzen 3700x (seems to be the sweetspot between price/performance)
- Add a HBA, SAS3008 (buy one of the 9300-8I from ebay)
Additional considers:
- Add a NVME PCIe 4.0 2TB (like Cosair MP600) for VM's and possible to replace the need for the SAS array
- Maybe add graphics card that can be passed through to the Win2019, and get cals for 3-4 remote uses. Recommendations?
- Not install a 10gb NIC for now, as most transfers are local on the host

Other:
Might wait until Samsung releases their 980 NVME PCIe 4.0 instead of the Cosair MP600. Uncertain if this should be mounted in omnios and then shared over network to esxi for vms, or just directly attatched to esxi host itself?

Whats your guys thoughts?
 
Last edited:
  • Like
Reactions: StevenDTX

netforceatg

New Member
Jun 2, 2015
18
1
3
36
At first look the Asus X570-ACE looks good, but seems there are some limitations on NVMe drives and PCIe slots. Have looked at a few more motherboards and Gigabyte X570 AORUS ULTRA looks to be a great contender in the same price range, with more features \ pcie lanes exposed.

I'm planning to populate it accordingly:
(CPU) PCI x16 slot, running x8: NVIDIA Quadro 4000 Passthrough to WinS2019 for RDP, basic use(Future HW upgrade)
(CPU) PCI x16 slot, running x8: LSI SAS3008 HBA Passthrough to OmniOS
(PCH) PCI x16 slot, running x4: Free for future use
(PCH) PCI x1 slot, Radeon 5450 for ESXi Host
(PCH) PCI x1 slot, Mellanox ConnectX-3 XCAT311

(CPU) NVMe M2A: MP600 2TB PCIe 4.0 4x SSD
(PCH) NVMe M2B: Future proof, or Samsung S970 EVO for misc
(PCH) NVMe M2C: Future proof

A few caveats with this, which im fine with.
- I can only use 4 of 6 SATA ports if both NVMe B and C ports are populated
- Do not need wifi, but board has it
- Mellanox card will only run by around 8Gbit, but thats fine due to running it in a x1 slot. Can move it to x4 slot
- much stuff hanging off the X570 pch, I will probably experience degraded performance if nvme and all pcie slots gets saturated at the same time

One thing I am worried about is the ability for this board to IMMOU and passthrough of pcie cards to VMWare virtual machines
 

netforceatg

New Member
Jun 2, 2015
18
1
3
36
I ended up converting this rig from old Xeon D1541 motherboard to ASUS ROG STRIX X570-F with Ryzen 3900x. Been running with ESXi and nappit since April and very happy with it. Had to get SAS3008 pci-e card, as the my old mb had this onboard. Reason to my upgrade was that CPU was maxing out upon file transfers, so I hoped a Ryzen 3900x would sort that out :)

However I'm not happy with the Omnios\ZFS\Nappit read\write performance to either of my pools, which was my main reason for upgrading. I have two pools I use for data in addition to the os disk, one raidz3 with 11 of Seagate 8TB Drivers (ST8000VN0022-2EL) and one NVMe (Samsung PM981 PM963), both passed through to OmniOS virtual machine.

Running ESXI 7.0U1 on host(64gb ecc ram) with virtual machine OmniOS r151036 and nappit 18.12w6 with 24gb mem and 8 cores allocated.

When using midnight commander and copying a file(29gb) from Samsung NVM to the Raidz3 pool, I get around 400-500mbyte\s and host cpu utilization increases from 12% to approx. 70-75%. When I use a virtual machine on the same host, copying between from the nvme to raidz3 over smb, i get 100-250mbyte\s.

I would expect much higher transfers, especially locally on the omnios virtual machine. Any advice? Almost seems like I'm running out of cpu horse power, which I can not understand....
 
Last edited:

netforceatg

New Member
Jun 2, 2015
18
1
3
36
Thanks for the support. I've created a .pdf with this data. ZFS-testing-diagnose-rev1.pdf

I struggle a bit to understand this data, but seems both the RaidZ3 pool and the SAS stripe array performance is ok, but the Samsung NVMe PM963 drive performance is not that good(datasheet list 2000MB/s read and 1200MB/s write). So the test result seem very low. Last page also shows about 560MB/s of transfer from the SAS stripe arrray to the raidz3.

Thoughts? Any idea on how to improve?

Most important to me is the NVMe to Raidz3, so considering buying a PCIe 4.0 NVMe(like WD850) or\and adding more ram to increase this throughput.
 

J-san

Member
Nov 27, 2014
68
43
18
41
Vancouver, BC
Try reducing the # of cores assigned to your OmniOS VM down to 4 to see if that helps.

Have you tried measuring the performance of the NVMe by not passing it through to OmniOS, but instead creating a VM datastore on it in ESXi and creating a vmdk under it and adding that to your OmniOS? That might help you isolate if it's an OmniOS NVMe setting or driver related issue?

One other thing, are all your pools passed through the following?
- X-Case RM424 Pro-EX V2 24 Bay (SAS expander 2x4 port, 12GB\s (assume LSI chipset expander)

If so, you could try directly connecting a drive (or two) to test the speed directly from your HBA using a breakout cable and rule your expander out as a bottleneck..
 

netforceatg

New Member
Jun 2, 2015
18
1
3
36
Thanks for the input. I've over the past 7months tried various changes and also had 2 failures of the 8TB disks(which was very easy to manage and replace without any dataloss thanks to Nappit and ZFS).

Overall I've got some more write\read throughput. But im still miles of from what I expect from performance from 11disk 8tb raidz3 array(only 912MB/s write and 512MB/s read) which is my main issue. Is my expectation wrong when testing bonnie locally on the OmniOS virtual machine or is somthing seriously wrong her?

Any suggestion to how I can move forward to meet higher speeds to this pool?

What I would hope to gain:
- Increased tank1 write speed to 1500MB\s or above. Increase readspeed to 1000MB\s or above from this 11pcs of 8TB disks
- Increased NVMe readspeed to deliver enough output to maximise the writespeed of tank1(raidz3) pool

My own observations and considerations:
- Potential misconfiguration from my side, since I'm getting low speeds on both SAS3(SSD) and HDD pools. Tempted to create the SAS pool from scratch just to test if this changes performance
- I've considered if the SAS3008 controller is bad, but I had the same issue with the XEON D-1541 motherboard which had same controller onboard
- Compatibility issues between the SAS3008 controller and Sierra chip backplane\expander
- Powersupply issues? Or one or more bad 8TB drive causing bad pool performance
- Add a PCIe X4 NVMe (Like 980Pro or WD 850S) to gain read speed on a single disk(this is used for temporary storage of videos, util they are transfered to the RaidZ3 array), but this is pointless until I sort the RaidZ3 11disk write\read speed
- Was planning to pay for nappit, but now considering to try Truenas and migrate(and also change\test a Xeon dual cpu platform)

What I have tested:
I've played around with the number of vCPU assigned to the OmniOS guest. Seems like the best performance is 8vCPU's (I actually think this is 4cores, as ESXI sees 24vCPU's due to the HT since my 3900x have 12cores).

Instead of passing through the NVMe drive to OmniOS, i've mounted in through a VM datastore in ESXI. This boosted NVMe performance quite a bit, especially for the write to the NVMe.

Both the pools ( 3x 900GB HUSMM1680ASS (stripe) and RaidZ3 for 11x Seagate ST8000VN0022) are connected through the backplane and expander. I found out that this X-Case RM424 Pro-EX V2 24bay case is actually produced by Gooxi, and uses a PCM Sierra Expander chips (belive PM8044) and not LSI. I've reached out to both Gooxi and X-Case, have not heard back on if they have a more updated firmware for the expander.

I've tried to connect the 3x900 SAS array directy to the controller. Minimum performance change(less than 5-10%) improvement. I'm unable to to this for the 11x Seagate drives, as LSI SAS3008 controller only have 8 ports...

Other changes, which I belive have contributed to some improvement:
- Updated to ESXi 7.0U2, OmniOS r151038g, Nappit 19.12b17
- BIOS changes: Enable 4G Decoding, Disable: C-States + C1 declaration + ACPI-CST, change power supply idle to typical current idle
- Bios update to latest 4002(improved AGESA) and also changed memory slots A2 + B2(increased memory bandwith)


1625990199157.png
 

gea

Well-Known Member
Dec 31, 2010
2,673
914
113
DE
Your results are quite as expected for a single Z3 pool. ZFS performance depends on raw disk performance mainly iops, pool layout, CPU and RAM. A pool from a single Z3 lacks mainly iops as you can only count around 300 iops per vdev (same as a single disk).

To test the pool with a mix of sequential and random workload, run menu Pool > Benchmark. This is a series of filebenches with sync enabled vs sync disabled. You can compare with my last benchmarks in https://napp-it.org/doc/downloads/epyc_performance.pdf with a test of a Xeon Silver vs a faster Epyc system

Code:
151036      async           sync            sync+slog
Xeon Silver
Disk Pool   930-1170MB/s    48-50MB/s       779-1195MB/s
Flash pool  1112-1388MB/s   430-613MB/s     -
Optane pool 1600-2000MB/s   550-750MB/s     -

AMD Epyc
Disk Pool   1178-1197 MB/s  51-55 MB/s      908-956 MB/s
Flash pool  1591-1877 MB/s  943-1134 MB/s   -
Optane pool 3339-3707 MB/s  1445-1471 MB/s  -
As you see, your performance goal is reachable with ZFS but requires very fast disks and/or a very fast system. From system tuning, there are mainly the option of another pool layout. Multi Mirror is faster than Z3. For VM usage a slower ZFS recsize ex 32k or 64k can improve VM performance. Persistent L3Arc (in 151038) can also help a little as well as a special vdev mirror for small io and metadata but do not expect too much.

In all my tests Oracle Solaris and native ZFS was the fastest ZFS server. But as there are no free updates, I do not recommend. Regarding Open-ZFS performance Free-BSD, Illumos and Linux are not too different. Illumos offers a quite native ZFS environment what limits resource needs a little and gives the best "it just works" experience and other specials like the kernelbased SMB server.

I do not expect much difference with the disk pool if you avoid or replace the expander. Using desktop SSDs in a server is also not really helpful as steady write performance is mostly bad.

From mainboard, bios or system mainly RAM is relevant.

For a VM storage, you should enable sync to avoid corrupted guests after a crash. You may need a good Slog for performance.
 

netforceatg

New Member
Jun 2, 2015
18
1
3
36
Thanks gea,

Good to hear. I'm not using ZFS\Nappit guest to share storage to any VMs(or to ESXI). Virtual machines have their own direct attached storage(I've decided to have it like this until I see the value/performance of transferring them over to ZFS guest).

I was under the impression that the Ryzen 3900x as CPU would suffice and then some on up maximizing the disk throughput of my spinning hdd's (11x Seagate Irownwolf 8TB in RaidZ3, about 150-210MB\s per disk @ totaling around 1500MB\s). I can also scale the ram up to 48GB(or buy more if it really increases my performance I'm looking for), but found it easy to keep it low for bench-marking to avoid too large files when doing this.

I will disable sync on this pool, and do some further benchmarking. Is it the filebench_sequantial you recommend, any special settings beside the standard?

Main reasoning to why I would consider a desktop NVME PCIe Gen4 was that I intend to transfer large files(video 10-100gb) from this 'temporary drive' over to the 11Disk HDD Z3 pool every now and then. Per today I'm using the 3 SAS SSD for this. But what would be the recommended upgrade to achieve high transferspeed from a temporary drive\pool(below 1.5-2TB) to the large capacity Z3 pool?
 

gea

Well-Known Member
Dec 31, 2010
2,673
914
113
DE
Thanks gea,

Good to hear. I'm not using ZFS\Nappit guest to share storage to any VMs(or to ESXI). Virtual machines have their own direct attached storage(I've decided to have it like this until I see the value/performance of transferring them over to ZFS guest).

I was under the impression that the Ryzen 3900x as CPU would suffice and then some on up maximizing the disk throughput of my spinning hdd's (11x Seagate Irownwolf 8TB in RaidZ3, about 150-210MB\s per disk @ totaling around 1500MB\s). I can also scale the ram up to 48GB(or buy more if it really increases my performance I'm looking for), but found it easy to keep it low for bench-marking to avoid too large files when doing this.

I will disable sync on this pool, and do some further benchmarking. Is it the filebench_sequantial you recommend, any special settings beside the standard?

Main reasoning to why I would consider a desktop NVME PCIe Gen4 was that I intend to transfer large files(video 10-100gb) from this 'temporary drive' over to the 11Disk HDD Z3 pool every now and then. Per today I'm using the 3 SAS SSD for this. But what would be the recommended upgrade to achieve high transferspeed from a temporary drive\pool(below 1.5-2TB) to the large capacity Z3 pool?
Main reason for ZFS is not performance but data security and unlimited snaps. ZFS can be faster with enough RAM as read/write cache but to secure the rambased write cache, you should enable sync. To avoid a heavy performance degration you shoul add an slog then.

For benchmarks in napp-ut use the menu Pool > Benchmark. These are several sequential and random tests with sync enabled vs disabled

For single large sequential transfers a desktop NVMe is ok
 
Last edited:

netforceatg

New Member
Jun 2, 2015
18
1
3
36
Ok, so I've done a few more benchmarks. First with 16gb ram then with 42gb ram. The RaidZ3 write with sync disabled increased from 1238MB\s to 1647MB\s.

Still, just using MC and copying from an nvme drive or from the 3x900gb SAS stripe pool to this raidz3 only gives about half that at around 800mb (all disk running with sync disabled)

For my main raidz3 pool 11x8TB one of my drives failed. So I ran the tests with only 10disks. I will replace the failed disked tomorrow. Don't know how much this affects the results?

Any way to get beyond this speed from either the nvme disk or the 3xsas3pool to raidz3 array?

I see that the overall ESXi load goes up from 10% cpu, to 50-80% for the physical host(AMD 3900x). I have considered if this mean that im close to the what the CPU can handle, but unsure since I still have 20-50% headroom.

Do I understand it that a Optane as a slog will never help me beyond the speeds that I see with sync beeing disabled?

What will likely improve this setup the most :
- Replace the PM961 and the 3x900GB sas stripe drives, with WD 2TB SN850 drive as my read drive(I randomly export data from this to tank1 hdd pool)
- Add 64gb more of memory? (I had 128gb, 96 reserved to the ZFS VM with my past D1541 platform, but speeds were still bad)
- Upgrade to next gen Ryzen 5900x?
- Add an Optane?

I also have a Dual CPU motherboard from a HP Z840 workstation with Dual E5-2620 v4 and 64gb ram. I could potentially rebuild the setup with this, but would require a bit of effort, so don't wont to go down that path if its unlikely to give anything extra in terms of performance.
 

Attachments

Last edited:

netforceatg

New Member
Jun 2, 2015
18
1
3
36
Just replaced one of the disks of the 11x8TB IronWolf pool. Getting great resilvering speeds, with 1.13-1.44GB\s read from the pool and 119-148MB\s write to the repaced disk. CPU usage is around 30-50%

Code:
root@nas:~# zpool status tank1
  pool: tank1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul 22 06:51:52 2021
        8.79T scanned at 2.26G/s, 5.86T issued at 1.50G/s, 32.0T total
        545G resilvered, 18.30% done, 0 days 04:56:35 to go
config:

        NAME                         STATE     READ WRITE CKSUM
        tank1                        DEGRADED     0     0     0
          raidz3-0                   DEGRADED     0     0     0
            c0t5000C5009364DA95d0    ONLINE       0     0     0
            replacing-1              OFFLINE      0     0     0
              c0t5000C5009365A998d0  OFFLINE      0     0     0
              c0t5000C500D69CF836d0  ONLINE       0     0     0  (resilvering)
            c0t5000C50093666100d0    ONLINE       0     0     0
            c0t5000C5009367D526d0    ONLINE       0     0     0
            c0t5000C500A2FFC365d0    ONLINE       0     0     0
            c0t5000C500D55ED79Ed0    ONLINE       0     0     0
            c0t5000C500B0B9015Ad0    ONLINE       0     0     0
            c0t5000C500B0BABB08d0    ONLINE       0     0     0
            c0t5000C500B0BAEA60d0    ONLINE       0     0     0
            c0t5000C500B0D129FBd0    ONLINE       0     0     0
            c0t5000C500B1789C82d0    ONLINE       0     0     0

errors: No known data errors


root@nas:~# zpool iostat tank1 2
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       32.0T  48.0T  1.30K     72   102M  13.4M
tank1       32.0T  48.0T  18.6K    831  1.14G   120M
tank1       32.0T  48.0T  17.5K    650  1.39G   143M
tank1       32.0T  48.0T  20.1K    624  1.44G   148M
tank1       32.0T  48.0T  10.7K    746  1.16G   119M
tank1       32.0T  48.0T  16.1K    672  1.44G   147M
tank1       32.0T  48.0T  21.7K    629  1.40G   144M
tank1       32.0T  48.0T  15.5K    638  1.13G   115M
tank1       32.0T  48.0T  21.1K    613  1.43G   146M
tank1       32.0T  48.0T  15.4K    824  1.13G   120M
tank1       32.0T  48.0T  17.7K    638  1.40G   142M
tank1       32.0T  48.0T  18.6K    644  1.38G   141M
tank1       32.0T  48.0T  13.9K    874  1.13G   119M
 

gea

Well-Known Member
Dec 31, 2010
2,673
914
113
DE
Ok, so I've done a few more benchmarks. First with 16gb ram then with 42gb ram. The RaidZ3 write with sync disabled increased from 1238MB\s to 1647MB\s.

Still, just using MC and copying from an nvme drive or from the 3x900gb SAS stripe pool to this raidz3 only gives about half that at around 800mb (all disk running with sync disabled)

For my main raidz3 pool 11x8TB one of my drives failed. So I ran the tests with only 10disks. I will replace the failed disked tomorrow. Don't know how much this affects the results?

Any way to get beyond this speed from either the nvme disk or the 3xsas3pool to raidz3 array?

I see that the overall ESXi load goes up from 10% cpu, to 50-80% for the physical host(AMD 3900x). I have considered if this mean that im close to the what the CPU can handle, but unsure since I still have 20-50% headroom.

Do I understand it that a Optane as a slog will never help me beyond the speeds that I see with sync beeing disabled?

What will likely improve this setup the most :
- Replace the PM961 and the 3x900GB sas stripe drives, with WD 2TB SN850 drive as my read drive(I randomly export data from this to tank1 hdd pool)
- Add 64gb more of memory? (I had 128gb, 96 reserved to the ZFS VM with my past D1541 platform, but speeds were still bad)
- Upgrade to next gen Ryzen 5900x?
- Add an Optane?

I also have a Dual CPU motherboard from a HP Z840 workstation with Dual E5-2620 v4 and 64gb ram. I could potentially rebuild the setup with this, but would require a bit of effort, so don't wont to go down that path if its unlikely to give anything extra in terms of performance.
ZFS gives best of all data security but this comes at a price. Checksums means more data, softwareraid means more CPU load and Copy on Write more fragmentation. If you want a very high performance say > 1GByte/s from a pool, you need fast disks mainly regarding iops, a fast CPU to process data fast and a lot of RAM for read/write caches. An additional tuning option is persistent l2arc (automatically enabled on current OmniOS) or a special vdev mirror for small io and metadata.

Especially the rambased write cache can be critical. Size is several Gigabytes that are lost on crash what can mean a corrupt database or guest filesystem with VMs. ZFS Copy on Write can only guarantee that a ZFS filesystem never got corrupted not data content. If you need to protect the writecache ex for databases or VM storage, you must activate sync. That means that every committed write is immediatly logged to disk and additionally collected in the rambased write cache and written as a large sequential write with a delay. In effect you need to write everything twice. This cannot be as fast as a single async write. If you use a dedicated Slog for sync logging you can only achieve thet the performance degration of sync is acceptable.

In the end, you need raw disk, cpu and ram power for performance.
 
  • Like
Reactions: dswartz