FreeNAS SSD SLOG bottleneck?

lunadesign · May 26, 2017

STHers - I'm really struggling with this one and could really use your help. I've been stuck on this for 2 weeks and have an urgent need to finish this system build.

Summary: I need to set up a FreeNAS box as an ESXi NFS datastore. For data integrity, I want to use sync=enabled. To "win back" at least some of the performance, I want to use striped SLOGs and vdevs. However, no matter how many SSDs I throw at the problem, I can't get ESXi NFS writes higher than 250 MB/s. There appears to be a bottleneck when I have sync writes with the smaller block sizes that NFS tends to use.

Box #1
VMware ESXi 6.0 U2
Supermicro X9SRE-F, E5-1650V2, 16GB ECC memory
Intel X710-DA4 10GbE
1 local SSD with 1 test Win 8.1 VM VMDK

Box #2
FreeNAS 9.10.2 U2
Supermicro X9SRE-F, E5-1650V2, 64GB ECC memory
LSI 9305-16i
Intel X710-DA4 10GbE
1 pool:
- 3 x Intel DC S3710 200GB striped SLOG (overprovisioned so only 16 GB is usable)
- 3 x Samsung SSD 850 Pro 1TB striped
- atime=off, sync=always

Tests:
- NFS test: VM runs Iometer 2MB sequential write test on drive that's on the NFS datastore
- Local disk test: FreeNAS box runs "dd" test copying 1GB of data from RAM disk to pool

Observations from NFS test:

If sync=always, the test maxes out at roughly 250MB/s.
I get roughly the same results with the following sets of drives:
- 1 Intel SSD SLOG + 1 Samsung SSD pool
- 2 Intel SSDs in a striped SLOG + 2 Samsung SSDs in a striped pool
- 3 Intel SSDs in a striped SLOG + 3 Samsung SSDs in a striped pool
Modifying the Iometer test to increase QD from 1 to 4 or even 32 barely increases the throughput
If sync=disabled, the 3/3 combo can write 1050 MB/s.

Observations from local disk test:

If sync=enabled, the output blocksize of the dd test has a significant effect on the throughput and using extra SSDs only helps at the larger block sizes (see below)
- Here are the results with 1 Intel SSD SLOG + 1 Samsung SSD pool:
  - obs=1024k: 304 MB/s
  - obs=512k: 293 MB/s
  - obs=256k: 281 MB/s
  - obs=128k: 258 MB/s
  - obs=64k: 191 MB/s
  - obs=32k: 128 MB/s
  - obs=16k: 78 MB/s
- Here are the results with 3 Intel SSDs in a striped SLOG + 3 Samsung SSDs in a striped pool:
  - obs=1024k: 690 MB/s
  - obs=512k: 535 MB/s
  - obs=256k: 417 MB/s
  - obs=128k: 247 MB/s
  - obs=64k: 189 MB/s
  - obs=32k: 127 MB/s
  - obs=16k: 77 MB/s
If sync=disabled, the 3/3 combo can write 1500 MB/s, even at obs=16k.

In all of these tests, I've watched the pool activity with "zpool iostat -v 1 | grep gpt" and see the writes being spread over the various SSDs.

In the sync=always cases, it's especially strange watching the 1 SLOG + 1 pool drive push 250 MB/s each drive, then adding drives and watching the 250 MB/s spread evenly across the drives instead of the new drives accepting extra writes.

Any ideas what I'm doing wrong?

nitrobass24 · May 26, 2017

Well, throughput isn't really correct test for judging the performance of your ESX datastore. You should be looking at the IOPS specific to your expected usage pattern.

For my setup, I see roughly random 33% write, 67% read. You should think about what your use-case will look like and test accordingly.

As for your disk setup.
When you say the 3x SLOG Stripe, do you mean a striped single vDEV? I ask because a vDEV has the max performance of a single disk. If you want a stripe (in the traditional RAID0 sense) you need to create your SLOG with multiple single drive vDEVs

I am not sure if that will really help you that much though because you have a latency problem. NFS Sync Writes are latency sensitive and throwing more SATA SSDs at the problem simply won't fix it. The best thing you could do is get an NVMe drive for your SLOG like an Intel DC P3700 400GB. You could probably sell your 3x S3710s for what the PCIe option will cost you.

lunadesign · May 27, 2017

nitrobass24 said:
Well, throughput isn't really correct test for judging the performance of your ESX datastore. You should be looking at the IOPS specific to your expected usage pattern.

That's a valid point. And I totally realize that any synthetic benchmark isn't necessarily a reflection of reality. Any suggested tests for IOPS? Or is Iometer good enough at this?

nitrobass24 said:
For my setup, I see roughly random 33% write, 67% read. You should think about what your use-case will look like and test accordingly.

How did you measure this?

nitrobass24 said:
When you say the 3x SLOG Stripe, do you mean a striped single vDEV? I ask because a vDEV has the max performance of a single disk. If you want a stripe (in the traditional RAID0 sense) you need to create your SLOG with multiple single drive vDEVs

I believe the latter is what I have. Here's a snippet from the output of "zpool status":

Code:

        NAME                                          STATE     READ WRITE CKSUM
        vol1                                          ONLINE       0     0     0
          gptid/3ac7790a-4332-11e7-a7eb-3cfdfe9eff20  ONLINE       0     0     0
          gptid/3af9ecea-4332-11e7-a7eb-3cfdfe9eff20  ONLINE       0     0     0
          gptid/3b2e62b5-4332-11e7-a7eb-3cfdfe9eff20  ONLINE       0     0     0
        logs
          gptid/3b66f6e8-4332-11e7-a7eb-3cfdfe9eff20  ONLINE       0     0     0
          gptid/44d6893d-4332-11e7-a7eb-3cfdfe9eff20  ONLINE       0     0     0
          gptid/4e80a4a1-4332-11e7-a7eb-3cfdfe9eff20  ONLINE       0     0     0

nitrobass24 said:
I am not sure if that will really help you that much though because you have a latency problem. NFS Sync Writes are latency sensitive and throwing more SATA SSDs at the problem simply won't fix it. The best thing you could do is get an NVMe drive for your SLOG like an Intel DC P3700 400GB. You could probably sell your 3x S3710s for what the PCIe option will cost you.

I understand but at the same time it doesn't really make sense to me that adding extra SLOG drives doesn't increase performance at all. I've learned that NFS is supposed to be able to queue multiple sync requests at the same time so that ZFS log can aggregate them but that doesn't appear to be working. What's interesting is that I've just learned that iSCSI *is* able to take advantage of multiple SLOG drives so that might be my solution although I'm just learning about iSCSI and nervous about the fragmentation issues. I'd love to go the P3700 route but am unfortunately out of PCIe 3.0 slots.

nitrobass24 · May 29, 2017

lunadesign said:
That's a valid point. And I totally realize that any synthetic benchmark isn't necessarily a reflection of reality. Any suggested tests for IOPS? Or is Iometer good enough at this?

How did you measure this?

I believe the latter is what I have. Here's a snippet from the output of "zpool status":

Code:

NAME STATE READ WRITE CKSUM vol1 ONLINE 0 0 0 gptid/3ac7790a-4332-11e7-a7eb-3cfdfe9eff20 ONLINE 0 0 0 gptid/3af9ecea-4332-11e7-a7eb-3cfdfe9eff20 ONLINE 0 0 0 gptid/3b2e62b5-4332-11e7-a7eb-3cfdfe9eff20 ONLINE 0 0 0 logs gptid/3b66f6e8-4332-11e7-a7eb-3cfdfe9eff20 ONLINE 0 0 0 gptid/44d6893d-4332-11e7-a7eb-3cfdfe9eff20 ONLINE 0 0 0 gptid/4e80a4a1-4332-11e7-a7eb-3cfdfe9eff20 ONLINE 0 0 0

I understand but at the same time, it doesn't really make sense to me that adding extra SLOG drives doesn't increase performance at all. I've learned that NFS is supposed to be able to queue multiple sync requests at the same time so that ZFS log can aggregate them but that doesn't appear to be working. What's interesting is that I've just learned that iSCSI *is* able to take advantage of multiple SLOG drives so that might be my solution although I'm just learning about iSCSI and nervous about the fragmentation issues. I'd love to go the P3700 route but am unfortunately out of PCIe 3.0 slots.

IOMeter is what I use.

I determined my usage model by looking at the LBAs written and LBAs read on the SMART status of my drives in the pool. It's not an exact science but it's better than guessing.

Hard to tell from that output, mind running it with verbose output? zpool status -v

If you are out of PCIe then maybe you could grab a ZeusRAM (SAS connection). Bottom line is that latency does not change when you add more drives. Think about it this way. Your source system sends a packet of data. It travels thru the ether, into the Motherboard, to the SATA controller, then finally to one of the drives. That path is exactly the same "distance" regardless of which drive(s) it goes to. So to make it faster, you have to "shorten the path".

Search

FreeNAS SSD SLOG bottleneck?

lunadesign

Active Member

nitrobass24

Moderator

lunadesign

Active Member

nitrobass24

Moderator