ZLOG Benchmark is coming

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Patrick

Administrator
Staff member
Dec 21, 2010
12,513
5,804
113
@BackupProphet setup a box with a P3700 and an Optane P4800X with FreeNAS 11 u4. Operation not supported. Any ideas?

Code:
[patrick@freenas ~]$ sudo ./diskinfo -vSw /dev/nvd0
/dev/nvd0
        512             # sectorsize
        400088457216    # mediasize in bytes (373G)
        781422768       # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
        CVFT4291009H400GGN      # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported
[patrick@freenas ~]$
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,513
5,804
113
FreeBSD has a protection for writing directly to disk, try executing
Code:
sysctl kern.geom.debugflags=16
I tried that:

Code:
[patrick@freenas ~]$ sudo sysctl kern.geom.debugflags=16
Password:
kern.geom.debugflags: 16 -> 16
[patrick@freenas ~]$ sudo ./diskinfo -vSw /dev/nvd0
/dev/nvd0
        512             # sectorsize
        400088457216    # mediasize in bytes (373G)
        781422768       # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
        CVFT4291009H400GGN      # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported
Do you think this is a FreeNAS specific issue? Did you try on vanilla FreeBSD?
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,513
5,804
113
@BackupProphet getting the same thing with FreeBSD 11.1.
Code:
root@slogtesting:/usr/home/patrick # ./diskinfo -vSw /dev/nvd0
/dev/nvd0
        512             # sectorsize
        400088457216    # mediasize in bytes (373G)
        781422768       # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
        CVFT4291009H400GGN      # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported
Tried on a S3610 as a sanity check
Code:
root@slogtesting:/usr/home/patrick # ./diskinfo -vSw /dev/mfid1
/dev/mfid1
        512             # sectorsize
        479559942144    # mediasize in bytes (447G)
        936640512       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        58303           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
                        # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported
root@slogtesting:/usr/home/patrick # ./diskinfo -vSw /dev/mfid1p1
/dev/mfid1p1
        512             # sectorsize
        2147483648      # mediasize in bytes (2.0G)
        4194304         # mediasize in sectors
        0               # stripesize
        65536           # stripeoffset
        261             # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
                        # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported
 
Last edited:

BackupProphet

Well-Known Member
Jul 2, 2014
1,092
650
113
Stavanger, Norway
olavgg.com
Another reason this could be happening, the drives already have one or more partitions. So you need to remove them with
Code:
gpart destroy -F DISK-CVPR125004YA080BGN
You still need debugflags set to 16
 
Last edited:

Ayfid

New Member
Sep 14, 2017
4
1
3
34
without more data... I wouldn't say how awful the 950 pro is.
More likely ... look how shitty bsd nvme drivers are.

but now I am curious to find out which it is...
The ZIL is entirely fsync writes, which disable the write cache on drives that do not have power loss protection. On an SSD, this absolutely destroys both performance and disk endurance. As both the 950 and 960 Pro do not have power loss protection, you should expect its performance to be awful.

This is the reason why you must use drives with power loss protection for the ZIL (and many vm and db uses too), as using a consumer SSD is nearly pointless and is a great way to wear out the drive.

It does not matter whether or not the drivers are terrible, as the hardware is just not capable of performing well under this kind of workload.
 

lni

Member
Aug 20, 2017
42
11
8
43
This is the reason why you must use drives with power loss protection for the ZIL (and many vm and db uses too), as using a consumer SSD is nearly pointless and is a great way to wear out the drive.
Intel 750 has pretty good fsync performance, it is a wonderful choice for a personal dev workstation storing dev/test data. I have been doing that for ages, it is pretty hard to destroy that disk. In fact I am so impressed with the performance that I picked up another brand new one yesterday knowing it already reached its EOL.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
848
402
63
USA
ioflood.com
It looks like they rated a 950 Pro actually.

They mention "power loss protection may be related" -- i.e. the 950 Pro does not have power loss protection, so in order to ensure a cache flush has been executed, it must actually write the data to disk. It is possible the Intel drives simply ensure that they are prepared to flush the data successfully if a power loss occurs, rather than actually flushing the data when requested. I.E. perhaps the Intel drive can do (or is doing) writeback caching even when explicitly told to flush.

It would be interesting to see the result with a hardware raid + bbu / supercap.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
848
402
63
USA
ioflood.com
Intel 750 has pretty good fsync performance, it is a wonderful choice for a personal dev workstation storing dev/test data. I have been doing that for ages, it is pretty hard to destroy that disk. In fact I am so impressed with the performance that I picked up another brand new one yesterday knowing it already reached its EOL.
Do you mean the Intel 710?

Intel 750: https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-750-spec.pdf

Endurance Rating
70 GB Writes Per Day
Up to 127 TBW (Tera bytes Written)

Pretty poor endurance.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,513
5,804
113
@funkywizard I know the specs, but I have always thought 127TBW for the 750 is low to protect the enterprise market.
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Too bad nobody did an endurance test on the 750 as it has been done on the S series;)
 

Ayfid

New Member
Sep 14, 2017
4
1
3
34
They mention "power loss protection may be related" -- i.e. the 950 Pro does not have power loss protection, so in order to ensure a cache flush has been executed, it must actually write the data to disk. It is possible the Intel drives simply ensure that they are prepared to flush the data successfully if a power loss occurs, rather than actually flushing the data when requested. I.E. perhaps the Intel drive can do (or is doing) writeback caching even when explicitly told to flush.
Yes, this is exactly what is happening. Drives with power loss protection can ignore fsync commands, as their cache is considered safe. I think many people underestimate how significant the performance hit (and endurance hit too!) an SSD takes when it has to flush its buffer and write through to NAND on writes. It brings the drive down to performance more on par with a fast HDD, and every single write has to write out an entire block which will drop your endurance literally by orders of magnitude.

Do you mean the Intel 710?
The Intel 750 is one of the only (the only?) consumer NVMe SSDs with power loss protection. Depending on your workload, the endurance may not be an issue (and you can replace the drive a few times before it costs the same as an enterprise NVMe drive), which makes the 750 a really good value option. A heavy database, or virtual machine host, that is loaded 24/7 will blow through that write endurance fast enough that the drive would probably not make sense, but there are still going to be many situations where it would not be such an issue. A ZFS SLOG device for a pool that is not touched often, but wants to be fast when it is (perhaps VM backups) would work well with the 750, as would most home or homelab uses, or as Ini said, dev/workstations.
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Especially since the actually used size might be 8/16 GB depending on network speed, so you have a ton of overprovisioning going for you even on the 400 gb model
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
848
402
63
USA
ioflood.com
@funkywizard I know the specs, but I have always thought 127TBW for the 750 is low to protect the enterprise market.
Certainly possible. Even so, you should see the wear level indicator drop fairly quick as you use it. I don't want to use a drive at wear 001 for an extended period without any idea when it's going to fail. True that wear 001 might have 50% of it's lifespan remaining, maybe 10% more remaining. In a handful of cases, maybe 80% remaining. Intel could certainly use better or worse nand for each batch, and so long as it meets the minimum spec they publish, they're in the clear. I don't want to take the risk.

I learned the hard way from the Samsung 850 Pro that the explanation "it should be much better than rated, so the specs are probably a lie to get you to buy the more expensive one" is a tough one to rely on. The 850 Pro was reviewed as "it uses 3d nand so the endurance only being X whereas the previous generation was X/2, is probably being very conservative to protect the enterprise market / margins. Based on (whitepaper talking about 3d nand's future promises) it should have 10x the endurance of planar nand. Besides, you won't need the endurance anyway because that's tons of writes -- who would ever write that much." I've got a graveyard of worn out drives to show I should not have listened to those reviews.

If this is a well documented quirk of this model of drive, and I didn't have better options, by all means I don't want to throw money away. Luckily some of the enterprise drives are on ebay at similar prices and performance, so it's moot by now.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
848
402
63
USA
ioflood.com
Yes, this is exactly what is happening. Drives with power loss protection can ignore fsync commands, as their cache is considered safe. I think many people underestimate how significant the performance hit (and endurance hit too!) an SSD takes when it has to flush its buffer and write through to NAND on writes. It brings the drive down to performance more on par with a fast HDD, and every single write has to write out an entire block which will drop your endurance literally by orders of magnitude.



The Intel 750 is one of the only (the only?) consumer NVMe SSDs with power loss protection. Depending on your workload, the endurance may not be an issue (and you can replace the drive a few times before it costs the same as an enterprise NVMe drive), which makes the 750 a really good value option. A heavy database, or virtual machine host, that is loaded 24/7 will blow through that write endurance fast enough that the drive would probably not make sense, but there are still going to be many situations where it would not be such an issue. A ZFS SLOG device for a pool that is not touched often, but wants to be fast when it is (perhaps VM backups) would work well with the 750, as would most home or homelab uses, or as Ini said, dev/workstations.
I do think that is very interesting. I hadn't considered that SSDs with power loss protection may be issuing write-commits to the OS well before the data is written, much the way that a hardware raid would. I still suspect a hardware raid controller would substantially increase performance in this use case when using sata SSDs.

I am curious, however, how that configuration stacks up compared to an NVMe SSD with PLP. To my understanding, SSDs (even NVMe) don't perform all that great in a queue-depth-1 workload, which I would imagine is what this would be. I've seen QD-1 workloads on writeback-cache hardware raid (9271 with cachevault) in the 1 - 1.2GB/s ballpark for anything that fits in the cache memory. I would bet that beats an NVMe drive for small block writes at least. Would be interesting to see how it compared.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,640
2,058
113
It would be interesting to see the result with a hardware raid + bbu / supercap.
I still suspect a hardware raid controller would substantially increase performance in this use case when using sata SSDs.
o_O I think you may be in the wrong thread, this is a ZFS SLOG device performance thread absolutely NOTHING to do or discuss about comparing to hardware RAID. No one here is in denial that other file systems perform > than ZFS, but lack the benefits of ZFS as well ;)

Especially since the actually used size might be 8/16 GB depending on network speed, so you have a ton of overprovisioning going for you even on the 400 gb model
yep! 127TBW with 400GB usable, chop that down to 8GB usable (50x lower capacity) and your endurance goes through the roof, likely enough for home labs, and small production environments.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
848
402
63
USA
ioflood.com
o_O I think you may be in the wrong thread, this is a ZFS SLOG device performance thread absolutely NOTHING to do or discuss about comparing to hardware RAID. No one here is in denial that other file systems perform > than ZFS, but lack the benefits of ZFS as well ;)



yep! 127TBW with 400GB usable, chop that down to 8GB usable (50x lower capacity) and your endurance goes through the roof, likely enough for home labs, and small production environments.
Hardware raid cards dont have to be used for raid : )

configure it as a 1-drive raid 0 (or 2-drive raid 1 if you prefer redundancy), and you can benefit from the writeback cache on the raid card. It would be interesting to see what provides better performance: a 400gb dc s3700 as a "1 drive hardware raid 0", or a much faster nvme ssd.

Also I bring up some of these use cases because the workloads that I am more familiar with are in some ways similar, so perhaps those performance stats are also relevant (but perhaps not).