Speedy NAS, help needed: Want 1.0GB/s

Jason Hirsch

Member
Feb 12, 2018
35
6
8
43
Thank you for taking the time to read-

I have a number of nodes that (I'm being told) now need to write to a NAS (that was designed for redundancy and capacity) at much faster speeds than operational.

Currently a Windows box over a 40gbe Mellanox fibre card writes to the TrueNAS box (40gbe Chelsio) at 220MB/s. Done some of the optimizing, etc- but still way too slow and not breaking 300MB/s. There are 4 vdev, 8 wide, 8tb drives.

The individual PCs that are connecting are full of SSDs- from PC to PC (40gbe mellanox again) I hit easily 500MB/s, however there are no optimizations done yet.

So honestly- what's it take to get up there? Do I need to gut the nodes and slap all the NVME drives into the disk shelf? I've got to seriously consider retooling all of this because of design changes, and if I go that road I'm going to get roasted- but it'll be worth it to get it working.

Suggestions very most welcome. Because I'm at my wits end.
 

BoredSysadmin

Active Member
Mar 2, 2019
453
124
43
Is that peak 1GB/s or steady write speed you're looking for? if you have sufficiently large ZIL (sort of write cache device) then you could easily get a higher speed.
Does your TrueNAS box have a zil/zlog drive?
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,045
1,583
113
CA
Is that peak 1GB/s or steady write speed you're looking for? if you have sufficiently large ZIL (sort of write cache device) then you could easily get a higher speed.
Does your TrueNAS box have a zil/zlog drive?
I'd urge we don't throw darts and hope it works and instead ask questions to get a better understanding of the work load.

@Jason Hirsch

1- Is sync performance a concern?
1a- Is sync on, always or off?

2-What is the work load?
2a- Sequential large writes or lots of small random writes?
2b- What's the pool configuration? (Compression? Dedup? 4k, 16k, 128k?)

3-Do you need 1GB\s from 1 system to storage or all combined total 1GB\s?

4-Is connection dedicated to storage or shared for other purposes too?

5-What are COMPLETE specifications of the NAS box? (CPU, RAM, Etc.)
 
Last edited:

DedoBOT

Member
Dec 24, 2018
38
5
8
There is no single and simple answer . Ethernet, storage arrays , transport protocols, other client/server hardware/software setups ...

Thats mine 10Gbe , es16-xg switch , Win7pro/2012r2 essnt. clients and solaris :
viber_image_2019-06-01_23-32-52.jpg viber_image_2019-07-01_16-54-57.jpg viber_image_2019-07-01_17-36-47.jpg
 

Jason Hirsch

Member
Feb 12, 2018
35
6
8
43
The TrueNAS box has Zils and ARC. At least I think so... I'm having trouble finding my notes. Looking at an old hardware debug log, I see-
20 core, 128GB ram.
I see a 'pmem' device for 16gb. I recall a conversation about optane memory (maybe in regard to that).

In discussions my understanding is that I had one in there that would constantly flush data.

The method of transferring has imply been copy/pasting from windows to TrueNAS- each file is 400MB in length. Would transfer about 50 to 100 files. This network is 100% isolated, and currently (there was no work) the only significant traffic was on these machines.

Pool is configured as Raid-Z3. There are 4x Vdevs, each 8 drives wide. 2 spares hot, 1 spare cold (sitting on the shelf next to it at 67F...). No deduplication.

Currently no other ops running on anything.

Running iperf3 between NAS and NAS (chelsio cards) I could get 35 to 39Gb/s. Between Windows box (Mellanox) and NAS (Chelsio) 23Gbs to 36Gbs (very erratic speeds- but straight consistent during the run).

I'd like to be able to write the data to the NAS as fast as possible. We're talking TB of data being offloaded from portable drives, so the faster the better.
 

DedoBOT

Member
Dec 24, 2018
38
5
8
SMB and Samba properties - especially disable smb2 and smb3 signing requirements and encryption. Not needed in isolated network .
zfs set sync=disable, if you can afford ! Think about it like of hw raid controller's battery backup unit.
Not always a necessary addition .
Copy/paste large files like yours over smb w/o thoose is completely ok according to my standards .
Need to add that the sync=standard did not slow the file transfers with a single bit, but was devastating against AJA and BM tests, cause theirs caching politics :)

btw.
if you chase performance - raid 10 is mandatory , i finished with a pool of 8 mirrors .
 
Last edited:

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,045
1,583
113
CA
If it's all sequential data then disable sync.

Are you using lz4 compression? If not, why not? Enable it, slight performance boost.

Re-test, post results.
 
  • Like
Reactions: gigatexal

Jason Hirsch

Member
Feb 12, 2018
35
6
8
43
If it's all sequential data then disable sync.

Are you using lz4 compression? If not, why not? Enable it, slight performance boost.

Re-test, post results.
I don't recall that being an option turned on- the data should compress nicely at least at 1:1.7, given experience, but I'll check next.

SMB and Samba properties - especially disable smb2 and smb3 signing requirements and encryption. Not needed in isolated network .
zfs set sync=disable, if you can afford ! Think about it like of hw raid controller's battery backup unit.
Not always a necessary addition .
...
btw.
if you chase performance - raid 10 is mandatory , i finished with a pool of 8 mirrors .

Gonna have to hunt those down. I don't recall seeing them (But honestly at this point I'm fried), so...
Originally this was to be a repository, and I was expecting about 700MB/s transfer. Given that I may be shifting it to a data dump I'll probably have to go RAID-10- which of course is going to finally take me from 67% to 50% disk utilization. *sigh*.
 

vanfawx

Active Member
Jan 4, 2015
368
68
28
41
Vancouver, Canada
Just be aware that if you set sync=disabled, you open yourself up to losing at least 1TXG commit (up to 5s, based on defaults). This is especially important if you're having other systems writing to it and they assume that all the writes they sent were committed to stable storage, when in reality, the last 5s didn't actually make it to disk. If all the access is local to this server, it's not so much of an issue.
 

aero

Active Member
Apr 27, 2016
312
54
28
50
I'm not a zfs guru like some folks here, but if your workloads are largely sequential, Which it sounds like they are, I'd recommend changing the record size in your pool to 1MB. You'd likely see a huge improvement.
 

vrod

Active Member
Jan 18, 2015
233
33
28
28
Well in worst-case scenario he will lose the entire pool without having sync on in a power-outage event.

If you can, get a SLOG. I can recommend the Intel P3600 series which has a 1GB/s sequential write speed.

Since the ZIL writes the data every 5s to the pool, for 1GB/s you would need a ZIL of 5GB. I always double up on this and go for 10GB. I highly recommend just creating a partition of 10GB, not the entire device. If you secure-erase the device ans create a 10GB partition, you will basically be overprovisioning the device and extend its lifespan.

Without a SLOG, the ZIL is located on uour pool drives, basically meaning that they have to do double the IO for a task. By offloading this to a NVMe device, you greatly reduce latency and improve IO.

As mentioned by others, recordsize is an important thing to set correctly. It can be set on a running pool but it will only be effectice for new files. You should also disable atime.

Would be interesting if you could post the result of “zfs get <poolname>”. :)
 

Rand__

Well-Known Member
Mar 6, 2014
4,589
912
113
If you can, get a SLOG. I can recommend the Intel P3600 series which has a 1GB/s sequential write speed.
I dont think a p3600 can provide 1GB/s sync write speed even on streaming and over provisioned heavily?
 

gea

Well-Known Member
Dec 31, 2010
2,520
852
113
DE
On a crash during write, you loose the content of the rambased write cache. On a genuine Solaris ZFS size ist time restricted (5s of last writes), on Open-ZFS this ist RAM restricted (default is 10% RAM, max 4GB).

As ZFS is CopyOnWrite, ZFS will never become corrupted on a crash or power outage. No danger for your pool but data is lost. If you use ZFS as VM storage, a guest filesystem can become corrupted. If you use a database withe dependend transactions (transfer money from one account to another), the database can become inconsistent.

To protect the ramcache, you enable sync that loggs each committed write. On a crash all missing writes are done on next reboot. Sync reduces write performance of disks to a fraction of unsync value. Zhis is why you do not use mostly the pool itself for logging (ZIL) but a dedicated logging device (Slog). One of the "Best of all" Slog is the Intel Optane 4800X that can go up to near 1 GB/s sync performance. A P3600 may be less than half the value.

As already said, recordsize can affect performance. In a multiuser VM environment 32k-64k may be the fastest setting. In a media server environment 1M may be faster. For average mixed use, keep the 128k default.

If this is a simple filer or backup server, disable sync. A file that is currently written on a crash or power outage is damaged in nearly all cases does not matter the sync setting unless the writing app like MS word cares about. If you mostly write small files that may be completely in cache on a power outage (ex a mailserver) prefer sync enabled. On a VM server or a database server, always enable sync.
 
Last edited:

vrod

Active Member
Jan 18, 2015
233
33
28
28
I dont think a p3600 can provide 1GB/s sync write speed even on streaming and over provisioned heavily?
I have used a P3600 with VMware and ZFS. with VMware, moving VMs to such a SSD usually went with like 1,1GB/s. With ZFS I recall having a little over 1GB/s on sequential writes.

Of course what's important not to forget is that the pool vdevs also need to be able to handle the load. But it doesn't really matter that it is "sync" writes.. It all comes down to if it's 4k, sequential or whatnot. The big "problem" you have with sync writes without an SLOG is that your pool vdevs basically need to write double. You get rid of that by adding the SLOG, which then needs to carry that out. I have even heard somewhere that the ZIL "collects" all the "writes" that comes in (obviously) and then arranges the data in a way that makes it possible to get written sequentially to the vdevs... but maybe that was a dream. It makes very much sense though, since an SSD SLOG speeds up the 4K writes for a HDD pool a lot...

I am actually in the process of building a 3vdev raidz array with 15 S4510's and a P3600 800GB as SLOG. I will let you know the results once I have done my first tests :)
 

Rand__

Well-Known Member
Mar 6, 2014
4,589
912
113
That was with a P3600 as target device or as cache?
And looking forward to seeing results:)
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,045
1,583
113
CA
I have used a P3600 with VMware and ZFS. with VMware, moving VMs to such a SSD usually went with like 1,1GB/s. With ZFS I recall having a little over 1GB/s on sequential writes.

Of course what's important not to forget is that the pool vdevs also need to be able to handle the load. But it doesn't really matter that it is "sync" writes.. It all comes down to if it's 4k, sequential or whatnot. The big "problem" you have with sync writes without an SLOG is that your pool vdevs basically need to write double. You get rid of that by adding the SLOG, which then needs to carry that out. I have even heard somewhere that the ZIL "collects" all the "writes" that comes in (obviously) and then arranges the data in a way that makes it possible to get written sequentially to the vdevs... but maybe that was a dream. It makes very much sense though, since an SSD SLOG speeds up the 4K writes for a HDD pool a lot...

I am actually in the process of building a 3vdev raidz array with 15 S4510's and a P3600 800GB as SLOG. I will let you know the results once I have done my first tests :)


- There's no way a P3600 can handle 1GB\s random writes while acting as a SLOG device. Sequential-Yes.

- SLOG device is not used at all unless a crash\reboot. It does make the "OK" we got it -- keep the data coming, FASTER than relying on in-pool writes for ZIL thus why SLOG is needed\beneficial

- I've heard\read the SLOG 'dumps' data sequentially but if that was true why isn't ZFS arranging the data in a transaction, all the time, every time, and writing to pools sequentially, always?? This would be a huge performance increase as all pool writes wouldn't be random, there's more to how this part of ZFS functions. Maybe @gea has some input.
 

gea

Well-Known Member
Dec 31, 2010
2,520
852
113
DE
With ZFS all writes go the the ramcache, say 4GB cache size. When the cache is full it is written to pool as a large and fast seqential write with low fragmentation. In the meantime new writes go to a second area of the writecache. This is why the ramcache must be 2 x the size of caching and this is why ZFS is fast even with small random writes.

The problem: On a crash during write the cache content is lost. You can now enable sync. Unlike other systems, sync on ZFS does not mean to disable the write cache to write every small commit immediatly to disk. This would be very bad for performance and would introduce a huge data fragmentation. Instead ZFS continue to write all data sequentially over the ramcache but log every commit to the onpool ZIL to allow a write redo on next bootup. As this small random writes are onpool but outside the regular data area, they do not introduce additional data fragmentation that would otherwise negatively affect read performance. You can describe such a ZIL as a writing method to a special area of the pool to protect the current ramcache content. Often you speak of a ZIL device but this is more a writing to the pool in a special area. As every small write commit must be on stable disk, you cannot additioanally cache or optimize such writes. This is the prize for security.

If you use an Slog, what means that this "ZIL"-Logging is done to a separate disk that is optimized for this sort of small sync random writing, you can get a much higher performance. Regular disks are bad for small random QD1 writes even when writing logs to a special data area of the pool.
 
Last edited:

vrod

Active Member
Jan 18, 2015
233
33
28
28
Unfortunately DHL decided to mess up my shipment of the rest of S4510’s so doesn’t look like I will get them before next week. I managed to setup a small 3 vdev striped pool with the P3600 as SLOG.

Did a quick erase of the drive, achieved about 900MB/s. A dd with bs=1M on the created pool with sync=always gave 800MB/s from the SLOG device.

I guess I was wrong about the P3600 but everything aside I still think it’s a very viable candidate for SLOG.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,045
1,583
113
CA
Unfortunately DHL decided to mess up my shipment of the rest of S4510’s so doesn’t look like I will get them before next week. I managed to setup a small 3 vdev striped pool with the P3600 as SLOG.

Did a quick erase of the drive, achieved about 900MB/s. A dd with bs=1M on the created pool with sync=always gave 800MB/s from the SLOG device.

I guess I was wrong about the P3600 but everything aside I still think it’s a very viable candidate for SLOG.
Doh! I hate when that happens, or they show up early and no one is there to sign!!

FWIW -- For the $$$ my choices:

P3700 - $100 on ebay 25,000 more write iop than p3600 800
100G - Enterprise Optane $280 Even better :) or 900p or 905p 280GB for $220-260 depending on deals.