ZLOG Benchmark is coming

Patrick · Oct 9, 2017

@BackupProphet setup a box with a P3700 and an Optane P4800X with FreeNAS 11 u4. Operation not supported. Any ideas?

Code:

[patrick@freenas ~]$ sudo ./diskinfo -vSw /dev/nvd0
/dev/nvd0
        512             # sectorsize
        400088457216    # mediasize in bytes (373G)
        781422768       # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
        CVFT4291009H400GGN      # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported
[patrick@freenas ~]$

BackupProphet · Oct 9, 2017

FreeBSD has a protection for writing directly to disk, try executing

Code:

sysctl kern.geom.debugflags=16

Patrick · Oct 9, 2017

BackupProphet said:
FreeBSD has a protection for writing directly to disk, try executing

Code:

sysctl kern.geom.debugflags=16

I tried that:

Code:

[patrick@freenas ~]$ sudo sysctl kern.geom.debugflags=16
Password:
kern.geom.debugflags: 16 -> 16
[patrick@freenas ~]$ sudo ./diskinfo -vSw /dev/nvd0
/dev/nvd0
        512             # sectorsize
        400088457216    # mediasize in bytes (373G)
        781422768       # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
        CVFT4291009H400GGN      # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported

Do you think this is a FreeNAS specific issue? Did you try on vanilla FreeBSD?

BackupProphet · Oct 9, 2017

I have no idea, I tried this on FreeBSD.

I think you need minimum FreeBSD 11.1 kernel

Patrick · Oct 9, 2017

BackupProphet said:
I have no idea, I tried this on FreeBSD.

I think you need minimum FreeBSD 11.1 kernel

OK swapping to FreeBSD.

Patrick · Oct 9, 2017

@BackupProphet getting the same thing with FreeBSD 11.1.

Code:

root@slogtesting:/usr/home/patrick # ./diskinfo -vSw /dev/nvd0
/dev/nvd0
        512             # sectorsize
        400088457216    # mediasize in bytes (373G)
        781422768       # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
        CVFT4291009H400GGN      # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported

Tried on a S3610 as a sanity check

Code:

root@slogtesting:/usr/home/patrick # ./diskinfo -vSw /dev/mfid1
/dev/mfid1
        512             # sectorsize
        479559942144    # mediasize in bytes (447G)
        936640512       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        58303           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
                        # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported
root@slogtesting:/usr/home/patrick # ./diskinfo -vSw /dev/mfid1p1
/dev/mfid1p1
        512             # sectorsize
        2147483648      # mediasize in bytes (2.0G)
        4194304         # mediasize in sectors
        0               # stripesize
        65536           # stripeoffset
        261             # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
                        # Disk ident.

Synchronous random writes:
         0.5 kbytes: diskinfo: Flush error: Operation not supported

BackupProphet · Oct 10, 2017

I just tested again here, I get your error message if I do not set debugflags to 16

Again try with

Code:

sudo sysctl kern.geom.debugflags=16

BackupProphet · Oct 10, 2017

Another reason this could be happening, the drives already have one or more partitions. So you need to remove them with

Code:

gpart destroy -F DISK-CVPR125004YA080BGN

You still need debugflags set to 16

Ayfid · Oct 12, 2017

Patriot said:
without more data... I wouldn't say how awful the 950 pro is.
More likely ... look how shitty bsd nvme drivers are.

but now I am curious to find out which it is...

The ZIL is entirely fsync writes, which disable the write cache on drives that do not have power loss protection. On an SSD, this absolutely destroys both performance and disk endurance. As both the 950 and 960 Pro do not have power loss protection, you should expect its performance to be awful.

This is the reason why you must use drives with power loss protection for the ZIL (and many vm and db uses too), as using a consumer SSD is nearly pointless and is a great way to wear out the drive.

It does not matter whether or not the drivers are terrible, as the hardware is just not capable of performing well under this kind of workload.

lni · Oct 14, 2017

Ayfid said:
This is the reason why you must use drives with power loss protection for the ZIL (and many vm and db uses too), as using a consumer SSD is nearly pointless and is a great way to wear out the drive.

Intel 750 has pretty good fsync performance, it is a wonderful choice for a personal dev workstation storing dev/test data. I have been doing that for ages, it is pretty hard to destroy that disk. In fact I am so impressed with the performance that I picked up another brand new one yesterday knowing it already reached its EOL.

funkywizard · Oct 15, 2017

BackupProphet said:
Project: FreeBSD / Id: freebsd-r321928 - FreshBSD - The latest *BSD Commits

Notice how awful Samsung 960 Pro is

It looks like they rated a 950 Pro actually.

They mention "power loss protection may be related" -- i.e. the 950 Pro does not have power loss protection, so in order to ensure a cache flush has been executed, it must actually write the data to disk. It is possible the Intel drives simply ensure that they are prepared to flush the data successfully if a power loss occurs, rather than actually flushing the data when requested. I.E. perhaps the Intel drive can do (or is doing) writeback caching even when explicitly told to flush.

It would be interesting to see the result with a hardware raid + bbu / supercap.

funkywizard · Oct 15, 2017

lni said:
Intel 750 has pretty good fsync performance, it is a wonderful choice for a personal dev workstation storing dev/test data. I have been doing that for ages, it is pretty hard to destroy that disk. In fact I am so impressed with the performance that I picked up another brand new one yesterday knowing it already reached its EOL.

Do you mean the Intel 710?

Intel 750: https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-750-spec.pdf

Endurance Rating
70 GB Writes Per Day
Up to 127 TBW (Tera bytes Written)

Pretty poor endurance.

Patrick · Oct 15, 2017

@funkywizard I know the specs, but I have always thought 127TBW for the 750 is low to protect the enterprise market.

Rand__ · Oct 15, 2017

Too bad nobody did an endurance test on the 750 as it has been done on the S series

Ayfid · Oct 15, 2017

funkywizard said:
They mention "power loss protection may be related" -- i.e. the 950 Pro does not have power loss protection, so in order to ensure a cache flush has been executed, it must actually write the data to disk. It is possible the Intel drives simply ensure that they are prepared to flush the data successfully if a power loss occurs, rather than actually flushing the data when requested. I.E. perhaps the Intel drive can do (or is doing) writeback caching even when explicitly told to flush.

Yes, this is exactly what is happening. Drives with power loss protection can ignore fsync commands, as their cache is considered safe. I think many people underestimate how significant the performance hit (and endurance hit too!) an SSD takes when it has to flush its buffer and write through to NAND on writes. It brings the drive down to performance more on par with a fast HDD, and every single write has to write out an entire block which will drop your endurance literally by orders of magnitude.

funkywizard said:
Do you mean the Intel 710?

The Intel 750 is one of the only (the only?) consumer NVMe SSDs with power loss protection. Depending on your workload, the endurance may not be an issue (and you can replace the drive a few times before it costs the same as an enterprise NVMe drive), which makes the 750 a really good value option. A heavy database, or virtual machine host, that is loaded 24/7 will blow through that write endurance fast enough that the drive would probably not make sense, but there are still going to be many situations where it would not be such an issue. A ZFS SLOG device for a pool that is not touched often, but wants to be fast when it is (perhaps VM backups) would work well with the 750, as would most home or homelab uses, or as Ini said, dev/workstations.

Rand__ · Oct 15, 2017

Especially since the actually used size might be 8/16 GB depending on network speed, so you have a ton of overprovisioning going for you even on the 400 gb model

funkywizard · Oct 15, 2017

Patrick said:
@funkywizard I know the specs, but I have always thought 127TBW for the 750 is low to protect the enterprise market.

Certainly possible. Even so, you should see the wear level indicator drop fairly quick as you use it. I don't want to use a drive at wear 001 for an extended period without any idea when it's going to fail. True that wear 001 might have 50% of it's lifespan remaining, maybe 10% more remaining. In a handful of cases, maybe 80% remaining. Intel could certainly use better or worse nand for each batch, and so long as it meets the minimum spec they publish, they're in the clear. I don't want to take the risk.

I learned the hard way from the Samsung 850 Pro that the explanation "it should be much better than rated, so the specs are probably a lie to get you to buy the more expensive one" is a tough one to rely on. The 850 Pro was reviewed as "it uses 3d nand so the endurance only being X whereas the previous generation was X/2, is probably being very conservative to protect the enterprise market / margins. Based on (whitepaper talking about 3d nand's future promises) it should have 10x the endurance of planar nand. Besides, you won't need the endurance anyway because that's tons of writes -- who would ever write that much." I've got a graveyard of worn out drives to show I should not have listened to those reviews.

If this is a well documented quirk of this model of drive, and I didn't have better options, by all means I don't want to throw money away. Luckily some of the enterprise drives are on ebay at similar prices and performance, so it's moot by now.

funkywizard · Oct 15, 2017

Ayfid said:
Yes, this is exactly what is happening. Drives with power loss protection can ignore fsync commands, as their cache is considered safe. I think many people underestimate how significant the performance hit (and endurance hit too!) an SSD takes when it has to flush its buffer and write through to NAND on writes. It brings the drive down to performance more on par with a fast HDD, and every single write has to write out an entire block which will drop your endurance literally by orders of magnitude.

The Intel 750 is one of the only (the only?) consumer NVMe SSDs with power loss protection. Depending on your workload, the endurance may not be an issue (and you can replace the drive a few times before it costs the same as an enterprise NVMe drive), which makes the 750 a really good value option. A heavy database, or virtual machine host, that is loaded 24/7 will blow through that write endurance fast enough that the drive would probably not make sense, but there are still going to be many situations where it would not be such an issue. A ZFS SLOG device for a pool that is not touched often, but wants to be fast when it is (perhaps VM backups) would work well with the 750, as would most home or homelab uses, or as Ini said, dev/workstations.

I do think that is very interesting. I hadn't considered that SSDs with power loss protection may be issuing write-commits to the OS well before the data is written, much the way that a hardware raid would. I still suspect a hardware raid controller would substantially increase performance in this use case when using sata SSDs.

I am curious, however, how that configuration stacks up compared to an NVMe SSD with PLP. To my understanding, SSDs (even NVMe) don't perform all that great in a queue-depth-1 workload, which I would imagine is what this would be. I've seen QD-1 workloads on writeback-cache hardware raid (9271 with cachevault) in the 1 - 1.2GB/s ballpark for anything that fits in the cache memory. I would bet that beats an NVMe drive for small block writes at least. Would be interesting to see how it compared.

T_Minus · Oct 15, 2017

funkywizard said:
It would be interesting to see the result with a hardware raid + bbu / supercap.

funkywizard said:
I still suspect a hardware raid controller would substantially increase performance in this use case when using sata SSDs.

I think you may be in the wrong thread, this is a ZFS SLOG device performance thread absolutely NOTHING to do or discuss about comparing to hardware RAID. No one here is in denial that other file systems perform > than ZFS, but lack the benefits of ZFS as well

Rand__ said:
Especially since the actually used size might be 8/16 GB depending on network speed, so you have a ton of overprovisioning going for you even on the 400 gb model

yep! 127TBW with 400GB usable, chop that down to 8GB usable (50x lower capacity) and your endurance goes through the roof, likely enough for home labs, and small production environments.

funkywizard · Oct 15, 2017

T_Minus said:
I think you may be in the wrong thread, this is a ZFS SLOG device performance thread absolutely NOTHING to do or discuss about comparing to hardware RAID. No one here is in denial that other file systems perform > than ZFS, but lack the benefits of ZFS as well

yep! 127TBW with 400GB usable, chop that down to 8GB usable (50x lower capacity) and your endurance goes through the roof, likely enough for home labs, and small production environments.

Hardware raid cards dont have to be used for raid : )

configure it as a 1-drive raid 0 (or 2-drive raid 1 if you prefer redundancy), and you can benefit from the writeback cache on the raid card. It would be interesting to see what provides better performance: a 400gb dc s3700 as a "1 drive hardware raid 0", or a much faster nvme ssd.

Also I bring up some of these use cases because the workloads that I am more familiar with are in some ways similar, so perhaps those performance stats are also relevant (but perhaps not).

ZLOG Benchmark is coming

Administrator

Well-Known Member

Administrator

Well-Known Member

Administrator

Administrator

Well-Known Member

Well-Known Member

New Member

Member

mmm.... bandwidth.

mmm.... bandwidth.

Administrator

Well-Known Member

New Member

Well-Known Member

mmm.... bandwidth.

mmm.... bandwidth.

Build. Break. Fix. Repeat

mmm.... bandwidth.