ZLOG Benchmark is coming

gea · Oct 17, 2017

From ZFS view a hardware raid behaves like a single disk. Due checksums ZFS will detect any data corruption. As there is no redundancy from ZFS view, the corruption cannot be repaired. The hardwareraid has redundancy but cannot detect the problem as it is not aware of the checksumms.

Another problem is the write hole problem of hardwarraid that can lead to a corrupt raid or filesystem on a crash during write. Even a cache + BBU cannot help.

Last problem is that ZFS want to guarantee that a commit with sync write is on disk. A cache on a disk or raid can give dataloss. A cache + BBU can help.

T_Minus · Oct 17, 2017

BackupProphet said:
ZFS don't need SMART data. ZFS behaves correctly as long as it can read a storage device with bytes. It is very common to run ZFS with hardware raid. A failure on a hardware raid will behave exactly the same as it would on any drive.

VERY common to use ZFS with hardware RAID !? What?

Where is this common? Please, show 10+ examples, I mean you said "Very common" so maybe show me 100+ installs that would be starting to get "very common". This is completely absurd, and untrue until you prove otherwise.

Hardware - OpenZFS

"Hardware RAID controllers should not be used with ZFS. While ZFS will likely be more reliable than other filesystems on Hardware RAID, it will not be as reliable as it would be on its own."

"Hardware RAID will limit opportunities for ZFS to perform self healing on checksum failures"

Go to the above URL if you want to read more of why it's a TERRIBLE IDEA.

BackupProphet · Oct 17, 2017

Creating a raid6 on hardware raid and then give it to ZFS is of course a very bad idea. But creating single devs with write back enabled and then assemble the pool with ZFS is a great thing.

whitey · Oct 17, 2017

SMFH is all I can say to 'some' of this thread's topics today :-D

That's all I'll say...to the adventurous...HAVE AT IT!

gea · Oct 17, 2017

BackupProphet said:
Creating a raid6 on hardware raid and then give it to ZFS is of course a very bad idea. But creating single devs with write back enabled and then assemble the pool with ZFS is a great thing.

Not true beside a very special but thinkable case of an slog with a hardwareraid controller and more than 4GB cache + BBU. Otherwise a ZFS rambased writecache from highspeed system memory (default 4GB with enough RAM) is way faster and larger than any cache on a hardware raid.

ZFS + pools on a hardware raid controller even without using hardwareraids is not a good idea. Use pure HBA with raidless IT mode and you get the fastest and reliablest ZFS pools with best of all driver quality.

_alex · Oct 17, 2017

gea said:
From ZFS view a hardware raid behaves like a single disk. Due checksums ZFS will detect any data corruption. As there is no redundancy from ZFS view, the corruption cannot be repaired. The hardwareraid has redundancy but cannot detect the problem as it is not aware of the checksumms.

Another problem is the write hole problem of hardwarraid that can lead to a corrupt raid or filesystem on a crash during write. Even a cache + BBU cannot help.

Last problem is that ZFS want to guarantee that a commit with sync write is on disk. A cache on a disk or raid can give dataloss. A cache + BBU can help.

isn't this the same for any other filesystem, that just would not detect corruption, or even zfs on a single drive with copies=1?

BackupProphet · Oct 17, 2017

gea said:
Not true beside a very special but thinkable case of an slog with a hardwareraid controller and more than 4GB cache + BBU. Otherwise a ZFS rambased writecache from highspeed system memory (default 4GB with enough RAM) is way faster and larger than any cache on a hardware raid.

ZFS + pools on a hardware raid controller even without using hardwareraids is not a good idea. Use pure HBA with raidless IT mode and you get the fastest and reliablest ZFS pools with best of all driver quality.

Hardware raid controllers with write back cache has safe writes. That is the big win using them, not the "hardware accelerated parity calculations".

gea · Oct 17, 2017

BackupProphet said:
Hardware raid controllers with write back cache has safe writes. That is the big win using them, not the "hardware accelerated parity calculations".

Only if you use a BBU to protect the cache.
ZFS use a ZIL/ Slog device to guarantee safe writes additionally to the fast sequential writes over the rambased writecache. Hardwareraid + BBU is simply not the way Sun invented ZFS to work.

gea · Oct 17, 2017

_alex said:
isn't this the same for any other filesystem, that just would not detect corruption, or even zfs on a single drive with copies=1?

ZFS can detect all errors due the data + metadata checksums on every datablock. Other filesystems or hardwarraid lack this feature.

BackupProphet · Oct 17, 2017

Oracle/Sun do says it is ok to use ZFS with a hardware raid controller Recommended Storage Pool Practices - Managing ZFS File Systems in Oracle® Solaris 11.3 I never heard that they discourage it.

funkywizard · Oct 17, 2017

gea said:
The Slog must be capable to log the content of the rambased ZFS writecache, per default up to 4GB so a hardware raid + cache + BBU may be a solution only if its ramcache is large enough.

Mostly I would not see hardware raid as a suitable Slog solution. This is because of cachesize and prize and reliability of BBUs.

The data gets periodically flushed from the ram to the SSD, so the ram does not need to be the full size of the data. As cache flushes can be large sequential writes, it makes better use of the performance of the underlying SSD.

As to BBUs, really the better option is a supercapacitor such as cachevault. I still use the term BBU because people are more familiar with it.

In any case, I only bring this up because of the *abysmal* performance scores at the link from the OP.

The example they gave of a "good" slog drive (intel 710) only managed 24.7mb/s for 4k writes, and the "bad" example (samsung 950 pro) gave the horrific performance of 1.9mb/s.

Call me crazy if you want to, but I have a hard time seeing how you *wouldn't* get better performance out of a 9271-4i hw raid together with a 400gb intel s3700. The 1gb ram would be used for coalescing writes into large sequential blocks, which that drive is capable of handling at 400mb/s.

This very well could be a bad idea or just not be as fast as I assume. Even so, what's the solution then? It sure sounds like writing to the SLOG is, well, a slog.

whitey · Oct 17, 2017

This is an example of a HUSMM1620 sas3 drive as a SLOG. I say 1.5Gbps sucked in and flushing down to disks is 'good nuff for me' for a $70 device.

400MB/sec is optimistic for ZFS/SLOG IMHO, I never even saw it on a P3700, close to 300MB but still, bang for buck I'll stick to my sas3 ent class drives for SLOG.

T_Minus · Oct 17, 2017

This discussion/argument/debate about HW Raid + S3700 for a SLOG is ridiculous and void of mattering at all.

If you don't care about integrity sync OFF and you don't NEED a SLOG device at all and best performance...

i386 · Oct 17, 2017

T_Minus said:
I'm still looken forward to NVME vs Optane SLOG device comparison

vs Flashtec Nvram

Flashtec NVRAM Drives | Microsemi

gea · Oct 18, 2017

funkywizard said:
Call me crazy if you want to, but I have a hard time seeing how you *wouldn't* get better performance out of a 9271-4i hw raid together with a 400gb intel s3700. The 1gb ram would be used for coalescing writes into large sequential blocks, which that drive is capable of handling at 400mb/s.

This very well could be a bad idea or just not be as fast as I assume. Even so, what's the solution then? It sure sounds like writing to the SLOG is, well, a slog.

One would need indeed some performance tests for differences between an Slog build from
- Intel S3700
- Intel S3700 behind an 9271 with 1GB cache and cache protection (BBU vs Cachevault probably identical performancewise)
- P3700 NVMe as Slog or as a simple pool from them without extra Slog

I would expect that performance is in the same order with the best solution some P3700 as a pool without dedicated slog as this will improve regular write performance as well, not only logging performance.

My tests show the huge difference on sync write performance depending on size of data written per commit. Even a disk based pool from 2 x 6disk HGST Ultrastar can give around 1 GB/s with sync enabled writing 5GB of data. With ongoing small writes of 8k per commit it went down to 32KB/s. A similar SSD only pool gave me 840 KB/s (all tests using ZIL not Slog). I would really be interested about the performance differences due the ramcache between above configurations.

see my tests at simple sync write test

funkywizard · Oct 18, 2017

gea said:
One would need indeed some performance tests for differences between an Slog build from
- Intel S3700
- Intel S3700 behind an 9271 with 1GB cache and cache protection (BBU vs Cachevault probably identical performancewise)
- P3700 NVMe as Slog or as a simple pool from them without extra Slog

I would expect that performance is in the same order with the best solution some P3700 as a pool without dedicated slog as this will improve regular write performance as well, not only logging performance.

My tests show the huge difference on sync write performance depending on size of data written per commit. Even a disk based pool from 2 x 6disk HGST Ultrastar can give around 1 GB/s with sync enabled writing 5GB of data. With ongoing small writes of 8k per commit it went down to 32KB/s. A similar SSD only pool gave me 840 KB/s (all tests using ZIL not Slog). I would really be interested about the performance differences due the ramcache between above configurations.

see my tests at simple sync write test

I agree it needs testing to know the difference. If SSDs provided the "expected" level of performance I wouldn't even consider it.

Another question worth asking -- if enterprise SSDs have power loss protection and are doing some form of on-device write caching, why is the single queue depth small block performance so bad? Isn't the point of committing the write before it's written, so that it can coalesce a lot of small writes into larger sequential ones? If so, you'd think that you would see small block random write performance not be far off from large block random write performance -- but it's not even close.

If the reason the 950 Pro is so terrible is because it doesn't have power loss protection, and that is what is slowing down sync writes, then why don't we see incredible sync write performance from those SSDs that do have it? You see an improvement, but nowhere near what I would expect if they were well optimized devices.

T_Minus · Oct 18, 2017

Samsung 950 Pro and like generation Samsung Enterprise NVME (and older) have high latency compared to Intel drives, I don't know specifically why but I'm pretty sure it's the firmware of the drive itself.

Newer generation Intel NVME are higher capacity and have higher capacity per-die which is one reason why they suffer in random write work loads, there's less chips to write to at once like a 4x RAID0 vs a 10x RAID0 would have less performance.

Latency is the factor trying to get low QD / QD1 performance. There's a lot that impacts latency. SATA interface being 1 of the biggest bottlenecks, especially with how queues are limited/handled on SATA.

You need to follow the entire process of data movement if you want to understand why performance is how it is.

This is why there's more room for performance for Intel's XPOINT as a DIMM than NVME, and why SSD got faster on NVME vs. SATA and SAS, and why SAS is faster/better than SATA. To understand why each is fast and/or limited you'll need to deep-dive into the movement of data from starting point to finish and then take into account varying firmware which may purposely cripple performance as to not cannibalize other parts of the market.

This is also why consumer drives are rated for X but steady state is crap compared to enterprise.

Even the highly regarded (and high performing for that matter) Intel S3700 drives drop from around 80,000 write IOPs down to around 30,000 IOPs during steady-state and why so many people see the 'rated up to' # for consumer and think they're faster than enterprise drives because they're simply not rated at sustained performance.

i386 · Oct 18, 2017

Firmware on ssds can do magic, just look at the Intel ssds from 2010-2013. They used the same nand and Sandforce controllers that made other ssds with the same hardware notorious.

funkywizard said:
why is the single queue depth small block performance so bad?

I think you mean queue depth of one (= one command in the queue for the ssd controller)?
Because you can't run algorithms in firmware to optimize one command: this command is executed as fast as possible with the controller + nand.

Stux · Oct 18, 2017

Rand__ said:
Too bad nobody did an endurance test on the 750 as it has been done on the S series

There's always this...

Intel 750 PCIe (nvme) Wear Endurance falling insanely fast (on RMA'd drive too)??

lni · Oct 19, 2017

Stux said:
There's always this...

Intel 750 PCIe (nvme) Wear Endurance falling insanely fast (on RMA'd drive too)??

that thread is _not_ about wear endurance test, it is about a single user experiencing abnormal wear endurance issue. as mentioned by that user himself/herself, there is public discussion about another user's Intel 750 wear endurance experience in which "100 TB of data to his 750 over the course of three or four months which caused his write durance to drop 10 or 15 points".

ZLOG Benchmark is coming

Well-Known Member

Build. Break. Fix. Repeat

Well-Known Member

Moderator

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

mmm.... bandwidth.

Moderator

Build. Break. Fix. Repeat

Well-Known Member

Well-Known Member

mmm.... bandwidth.

Build. Break. Fix. Repeat

Well-Known Member

Member

Member