RAID for Windows (That's Not Storage Spaces)?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gea

Well-Known Member
Dec 31, 2010
3,578
1,406
113
DE
I know that. I edited my post above to say "No SLOG", sorry, was writing too fast. :)

So, if there's no SLOG, and sync=always, and the pool is all on the raid controller(s), it'll try to write there concurrently with RAM. Correct?
No,
when you enable sync without slog, logging is done on the (slower) onpool ZIL area.
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
No,
when you enable sync without slog, logging is done on the (slower) onpool ZIL area.
And if the pool is not "slow" then there's no issues, right?

I apologize for hammering this point, because people either use ZFS on "slow" HDDs, or go straight to all flash, with nothing in between. This is not a limitation of ZFS in anyway, and yet it never gets talked about. We have implementations coming out that use ZFS over S3!

GitHub - Barre/ZeroFS: ZeroFS - The Filesystem That Makes S3 your Primary Storage. ZeroFS is 9P/NFS/NBD on top of S3.

ZFS can be ran over block devices from a SAN, you're not doing "local" SLOGs in those cases either.
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
Radiant rms-200
This is just DDR3 DRAM with onboard flash and a supercap, with NVME protocols thrown over it. That's exactly what's inside a modern raid controller, minus the NVME protocols.

PCIe x8 Gen3 Host Bus Interface
PCIe Low Profile/Short-Length Form Factor
DDR3 NVRAM Capacities: 2GB, 4GB, or 8GB
On-Board Ultracapacitors (no remote ultracapacitor pack required)
NVMe Multi-Channel DMA Engines
supporting NVMe Command set
Support for Programmed I/O (PIO) Operations
Fault Tolerant Flush-to-Flash™ Backup System
DuraLife™ Ultracapacitor Power Management System
DiaLog™ OEM Diagnostic Lifecycle Monitoring
 

mattventura

Well-Known Member
Nov 9, 2022
732
396
63
The "best practices" that get thrown around in relation to ZFS, simply don't work in this case. I can't give ZFS direct access to drives because:
1. It's too slow that way.
This sounds like something else is going on. It shouldn't be slower.

2. That takes away all the management and ops aspect of running a large storage infra. The production infra lives in an Equinix DC and I need to be able to tell a remote hands person "Go replace the disk in that bay where the red light is lit". No fumbling around, no running OS commands to find out which disk should be replaced etc. I can't do this with ZFS with native HDD access.
The OS command is `ledctl failure=/dev/sdX`, or let ledmon monitor your array and blink the LED of a failed disk automatically. Basically the same as you'd get with a hardware RAID card or standalone SAN device.

This is both a feature and a problem. The way ZFS works with write coalescing (in RAM) and an optional ZIL, still leaves me with potential data loss in case of a power crash. Whatever device ZFS can be given for a ZIL can be no better protected than the DRAM cache on a raid controller (with Supercap power loss protection). And a raid card does similar things, where if it was unable to write out the full stripe from the DRAM cache to the underlying disks (due to a power event), the array will come up dirty and will need to be rebuilt. But the full stripe I (to be written) is alive and well in the DRAM cache.
Neither one would lose data in that particular scenario. A good NVMe drive will either have its own PLP via capacitors and thus be able to write out any data, or won't report a write complete until it is completely committed. It's really the same probability, which is to say that it's not going to happen barring a bizarre hardware failure, but having a RAID card in the way just adds one more layer that can fail.

I'm not entirely convinced. Any data inflight before a file system gets it, is certainly at risk in case of a power event. But, the only advantage with copy-on-write is that the original blocks remain unchanged in such a case, hence no (potential corruption). The problem of course is that there's data that did need to get written and I can't lose it. (Well, I can per se...but fixing that hole in the dataset is ...painful).
You have to understand the difference between sync and async writes. When a process writes data, it can either write it synchronously, in which case ZFS will not tell the process that the write is complete until it is written to the ZIL - whether it's on a SLOG device or your main devices. No chance of data loss unless all of your mirrored SLOG devices somehow fail. If the write is async, it will report the write as complete when the data is in RAM, because the entire point of an async write from a software standpoint is "I don't actually care if/when this finishes, it's a fire-and-forget". But as pointed out elsewhere, you can set `sync=always` to force it to always treat every write as synchronous on a per-dataset basis, if you need to work around buggy software. However, if the software is buggy enough, no filesystem or hardware raid will be able to fix it.
 

gea

Well-Known Member
Dec 31, 2010
3,578
1,406
113
DE
Sync logging confirms writes on every commit. Unlike the large writes via ZFS writecache (GB range), such logs are small ex 4K. Such small writes are even quite slow on a very fast Slog. So your overall sync write performance even with the world fastest dram based slog is significant lower than with sync disabled.

If you create a ZFS pool on SAN blockdevices ex fc/iSCSI targets, situation is the same. You create a pool on these targets with a ZIL included.
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
@mattventura - I know all this.

This sounds like something else is going on. It shouldn't be slower.
It is and always will be. If you give ZFS bare access to an HDD:

1. It disables the disk drive write cache, which it should. And so do RAID controllers.
2. But with bare access now ZFS is writing directly to the disk with no buffer in between. With a RAID controller, the DRAM cache is the buffer (and it's power protected).

The OS command is `ledctl failure=/dev/sdX`, or let ledmon monitor your array and blink the LED of a failed disk automatically. Basically the same as you'd get with a hardware RAID card or standalone SAN device.
Somebody needs to run that command. That can't be the remote hands person, and the infra guys are not always glued to the monitoring system. A disk failure can happen anytime, and the usual way (and I operate similarly) is to keep ready spares onsite at the DC, with a well defined playbook on which rack/bay etc the data lives on, and what needs to be done.

The remote hands have standing instructions (and they get paid for this) to run that playbook when a ticket comes to them, and the ticket is generated automatically by the monitoring/observability systems in place. No one has to click anything for this to happen.

Neither one would lose data in that particular scenario. A good NVMe drive will either have its own PLP via capacitors and thus be able to write out any data, or won't report a write complete until it is completely committed. It's really the same probability, which is to say that it's not going to happen barring a bizarre hardware failure
Agreed. But that's my point. That power protected NVME is no different than a power protected controller cache, is it?

having a RAID card in the way just adds one more layer that can fail.
But, that's no different than the failure domain of the NVME drive (in this case), is it? Unless you mean mirrored NVME as SLOG, vs "not" mirrored RAID controllers?

You have to understand the difference between sync and async writes. When a process writes data, it can either write it synchronously, in which case ZFS will not tell the process that the write is complete until it is written to the ZIL - whether it's on a SLOG device or your main devices. No chance of data loss unless all of your mirrored SLOG devices somehow fail. If the write is async, it will report the write as complete when the data is in RAM, because the entire point of an async write from a software standpoint is "I don't actually care if/when this finishes, it's a fire-and-forget". But as pointed out elsewhere, you can set `sync=always` to force it to always treat every write as synchronous on a per-dataset basis, if you need to work around buggy software. However, if the software is buggy enough, no filesystem or hardware raid will be able to fix it.
It's not about buggy software. If I'm gonna run ZFS in my use case, it has to run with sync=always. The data is realtime (and fast, which is why it runs in the Equinix DC with very low latency to the NYSE. The NBBO feed can approach 40gbps...) and has to be written reliably.
 
Last edited:

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
Sync logging confirms writes on every commit. Unlike the large writes via ZFS writecache (several GB), such logs are small ex 4K. Such small writes are even quite slow on a very fast Slog. So your overall sync write performance even with the world fastest dram based slog is significant lower than with sync disabled.
Argh. That's gonna be a problem.
 

gea

Well-Known Member
Dec 31, 2010
3,578
1,406
113
DE
If your performance needs can only be achieved with sync disabled, a pool on hardwareraid with cache protection is probably the best compromise. Not as safe as a pure ZFS software raid but propably "quite ok" and faster
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
If your performance needs can only be achieved with sync disabled, a pool on hardwareraid with cache protection is probably the best compromise. Not as safe as a pure ZFS software raid but propably "quite ok" and faster
:) But...there's always a but, isn't it?

My storage needs are expanding faster than I can keep up with the current implementation (on XFS and hardware raid). The system is already close to 2PB in the DC, and data growth predictions are somewhere around 200-250TB per year. That doesn't sound like a lot per se, but markets are already starting to stay open longer than they were historically, and the SEC is reviewing proposals to allow 24x7 trading... :eek:

Like I said earlier, my main draw to ZFS is compression, because this data compresses beautifully. But, it has to be able to perform within the performance constraints.
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
@macdaddy2012 - I think I owe you an apology. I didn't mean to derail your thread, even though (I think), what's being discussed is quite relevant.
 

mattventura

Well-Known Member
Nov 9, 2022
732
396
63
@mattventura - I know all this.


It is and always will be. If you give ZFS bare access to an HDD:

1. It disables the disk drive write cache, which it should. And so do RAID controllers.
2. But with bare access now ZFS is writing directly to the disk with no buffer in between. With a RAID controller, the DRAM cache is the buffer (and it's power protected).
The apples-to-apples comparison here would be to use some SSDs with PLP as your ZIL SLOG. That way, data is written to a capacitor-backed DRAM cache, and the drive is able to safely write the data in the event of a power loss.

Agreed. But that's my point. That power protected NVME is no different than a power protected controller cache, is it?
They fill the same role, yes. But being able to directly expose the drive to ZFS is preferable for performance.

But, that's no different than the failure domain of the NVME drive (in this case), is it? Unless you mean mirrored NVME as SLOG, vs "not" mirrored RAID controllers?
Yes, you should mirror your SLOG to avoid having a single point of failure which could cause you to lose unwritten data in the event of a power loss. The same is true if you use a single non-redundant RAID controller - you can lose data if the RAID controller itself dies.

It's not about buggy software. If I'm gonna run ZFS in my use case, it has to run with sync=always. The data is realtime (and fast, which is why it runs in the Equinix DC with very low latency to the NYSE. The NBBO feed can approach 40gbps...) and has to be written reliably.
Let me explain this in more detail.

When an application writes data, it can do so synchronously or asynchronously. When an application performs a synchronous write, it is expecting that the write operation does not return back to the program until the write is truly committed (written to permanent storage or a battery-backed cache). This is used for things like databases where you don't want the DB to tell the client that the data is committed until and unless it is really committed. Async writes, on the other hand, immediately return control back to the program. This is useful for when the program is dealing with less-important or temporary data, and doesn't care that the data might actually be lost. Or for things like simple log files, where you don't want to impact the performance of the program just to write a log.

By default, i.e. `sync=default`, ZFS will treat synchronous writes as complete when they are written to the actual storage devices, or in the case of a SLOG device, to the SLOG. Similarly, a RAID card with a battery- or supercap-backed write cache will report a sync write as "done" when it is in the cache. ZFS will treat async writes as complete when they are in the RAM-based write cache. async writes never actually hit the SLOG - it just puts it directly into main storage. It doesn't try to use a SLOG as an accelerator, because nothing is actually waiting for the write to complete.

If you set sync=always, there is no difference for sync writes. All that does is cause it to also treat async writes as sync writes. Likewise, setting sync=never causes it to treat sync writes as if they were async (dangerous, but has some niche uses). Thus, the only reason to use sync=always is if your application really should be using sync writes, but is actually performing async writes.

No solution, not even a hardware RAID with BBU, will stop all "data loss" - the point isn't to never lose data, it's to never lose data that the storage layer claimed was committed. Similarly, the purpose of a database transaction is not to guarantee that the data will be written - it's to ensure that if the DB says the data was written, that it actually was.
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
it's to ensure that if the DB says the data was written, that it actually was.
And that's what this is all about. The only thing running on top of this particular storage is a massive Postgres cluster (Both multi master and a couple of read replicas). The rest of the system runs on a separate flash based (redundant/replicated) storage, but the sizing needs for that are minuscule (it all fits in two 2U chassis).

As far as PG is concerned, when it says "write", I need the file and storage system to really write/commit. Which is why I said sync=always for this storage.
 

mattventura

Well-Known Member
Nov 9, 2022
732
396
63
And that's what this is all about. The only thing running on top of this particular storage is a massive Postgres cluster (Both multi master and a couple of read replicas). The rest of the system runs on a separate flash based (redundant/replicated) storage, but the sizing needs for that are minuscule (it all fits in two 2U chassis).

As far as PG is concerned, when it says "write", I need the file and storage system to really write/commit.
Yes, if your Postgres isn't majorly misconfigured, then it will be doing synchronous writes, so ZFS wouldn't report the write as being complete until it's in ZIL (either a SLOG if present, or in the main vdevs). sync=always wouldn't make a difference. If your SLOGs are enterprise-grade SSDs (i.e. with PLP), then it's going to give you the same level of safety and performance as a HW RAID cache.
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
sync=always wouldn't make a difference
Agreed. I was using that phrase as a metaphor, since it'll be coming from PG, not me configuring ZFS manually.

If your SLOGs are enterprise-grade SSDs (i.e. with PLP), then it's going to give you the same level of safety and performance as a HW RAID cache.
Agreed, but, apologies for belaboring the point. What if there's no SLOG? Then ZFS is writing (sync writes) to the ZIL, which is in the VDEVs, which are on the RAID controller. Is my understanding correct?
 

mattventura

Well-Known Member
Nov 9, 2022
732
396
63
Agreed. I was using that phrase as a metaphor, since it'll be coming from PG, not me configuring ZFS manually.


Agreed, but, apologies for belaboring the point. What if there's no SLOG? Then ZFS is writing (sync writes) to the ZIL, which is in the VDEVs, which are on the RAID controller. Is my understanding correct?
Yes, that is correct. It will write to the normal ZIL in the vdevs so that it can write sequentially for better performance, report the write as complete. Under normal circumstances, it will then proceed to write the data from RAM to its real location on disk. It only reads back the ZIL in the event that the system crashes.
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
It will write to the normal ZIL in the vdevs so that it can write sequentially for better performance, report the write as complete.
Awesome. Appreciate all the patience during my uh, well...animated discussion on the internals of ZFS.

So, really my options are:

- If the hardware RAID arrays on the controllers offer enough performance for the ZIL, then no separate device is needed. They're already redundant and power protected.

- If they don't, then potentially look at dedicated SLOG device(s), whatever they may be.

The RAID arrays are currently (in my test lab) configured for a 256kb stripe size and ZFS is configured for a (the default) record size of 128kb (which I'll be changing, to fine-tune the PG<-->ZFS page/recordsize combo). In either case, the multiples are even, so none of the components should be unhappy.

The RAID arrays can saturate a pcie 3.0x8 bus raw, i.e. without ZFS on top of them. I need to test more to see how the performance evolves after adding ZFS on top of them.
 

mattventura

Well-Known Member
Nov 9, 2022
732
396
63
Awesome. Appreciate all the patience during my uh, well...animated discussion on the internals of ZFS.

So, really my options are:

- If the hardware RAID arrays on the controllers offer enough performance for the ZIL, then no separate device is needed. They're already redundant and power protected.

- If they don't, then potentially look at dedicated SLOG device(s), whatever they may be.

The RAID arrays are currently (in my test lab) configured for a 256kb stripe size and ZFS is configured for a (the default) record size of 128kb (which I'll be changing, to fine-tune the PG<-->ZFS page/recordsize combo). In either case, the multiples are even, so none of the components should be unhappy.

The RAID arrays can saturate a pcie 3.0x8 bus raw, i.e. without ZFS on top of them. I need to test more to see how the performance evolves after adding ZFS on top of them.
If you're already able to bottleneck the PCIe bus using only the RAID controller, then having a SLOG device also controlled by that raid controller would probably hurt, because data would need to be written to the SLOG and then again to the main array. The one advantage of hardware RAID here is that writes are more efficient in terms of bus usage - the bus only sees the normal data, and any additional mirrors or parity of the data happens on the RAID card. Using separate NVMe devices for your SLOG would avoid at least the first bottleneck.

The record size doesn't mean that ZFS will always write that amount. Any write smaller than the record size will only consume up to the nearest block (e.g. if using 4k blocks, a 6kb write will only take up 8kb).

If you're trying to maximize throughput and are stuck on PCIe 3.0 x8, then you'd want to split it across multiple HBAs, but that precludes the use of hardware RAID unless you're doing multiple HW raids. It gets a little messy when you try to mix HW RAID and ZFS regardless of how you do it. I don't think you'd be likely to see much of an increase in performance. I'd test out:
1. Pure HW RAID
2. HW RAID for main vdev, 2x NVMe SLOG
3. Plain HBA (possibly multiple), let ZFS do all the RAID, 2x NVMe SLOG

What hardware is this hosted on?
 

kapone

Well-Known Member
May 23, 2015
1,799
1,189
113
If you're already able to bottleneck the PCIe bus using only the RAID controller, then having a SLOG device also controlled by that raid controller would probably hurt, because data would need to be written to the SLOG and then again to the main array. The one advantage of hardware RAID here is that writes are more efficient in terms of bus usage - the bus only sees the normal data, and any additional mirrors or parity of the data happens on the RAID card. Using separate NVMe devices for your SLOG would avoid at least the first bottleneck.

The record size doesn't mean that ZFS will always write that amount. Any write smaller than the record size will only consume up to the nearest block (e.g. if using 4k blocks, a 6kb write will only take up 8kb).

If you're trying to maximize throughput and are stuck on PCIe 3.0 x8, then you'd want to split it across multiple HBAs, but that precludes the use of hardware RAID unless you're doing multiple HW raids. It gets a little messy when you try to mix HW RAID and ZFS regardless of how you do it. I don't think you'd be likely to see much of an increase in performance. I'd test out:
1. Pure HW RAID
2. HW RAID for main vdev, 2x NVMe SLOG
3. Plain HBA (possibly multiple), let ZFS do all the RAID, 2x NVMe SLOG

What hardware is this hosted on?
Good questions.

then having a SLOG device also controlled by that raid controller would probably hurt
No, my assumption was that if a separate SLOG is needed, it would not be on the raid controllers.

Using separate NVMe devices for your SLOG would avoid at least the first bottleneck.
That's the intent, is a SLOG really ends up being needed.

The record size doesn't mean that ZFS will always write that amount. Any write smaller than the record size will only consume up to the nearest block (e.g. if using 4k blocks, a 6kb write will only take up 8kb).
I know, which is why I said fine tune. No tuning may be necessary, if it doesn't offer any tangible benefits.

stuck on PCIe 3.0 x8,
That's only in the test lab, the production hardware will be different (which is a whole another topic...since I'll need to go through a refresh cycle on that).

then you'd want to split it across multiple HBAs,
It already is. 48 spindles + 8x SSDs (maxCache) per Adaptec RAID controller x 2 controllers in each node. There's two nodes, with replication between them.

you're doing multiple HW raids.
I am. Because a single would become a bottleneck like you said.

It gets a little messy when you try to mix HW RAID and ZFS regardless of how you do it.
I kinda agree, but not really. A block device is a block device. Whether ZFS operates on 100 disks natively, or over two very large block devices from the RAID arrays, should not matter. If it does matter, then something definitely is wrong.

Does it increase operational complexity? Yes. Does it offer any operational enhancements? Yes. Does it offer any performance benefits? Majorly.

I'd test out:
1. Pure HW RAID
2. HW RAID for main vdev, 2x NVMe SLOG
3. Plain HBA (possibly multiple), let ZFS do all the RAID, 2x NVMe SLOG
1. - Already done. Raw sequential r/w close to 7GB/s
2. Already did the first part (main VDEVs), but haven't done the second yet. And haven't run any performance tests on the first part yet.
3. Already did, with multiple LSI 9300 (and 9400 just to rule out hardware) HBAs, but without a dedicated SLOG. The performance left a lot to be desired, and the CPU consumption was higher than I'm comfortable with. A file system should not be eating this much CPU. While this can be mitigated, it's still a concern.

What hardware is this hosted on?
The test lab is on old(er) hardware (Supermicro X9 series, Adaptec 81605Zq controllers, HGST HE10 10TB disks). The production hardware will be different.
 

mattventura

Well-Known Member
Nov 9, 2022
732
396
63
I kinda agree, but not really. A block device is a block device. Whether ZFS operates on 100 disks natively, or over two very large block devices from the RAID arrays, should not matter. If it does matter, then something definitely is wrong.

Does it increase operational complexity? Yes. Does it offer any operational enhancements? Yes. Does it offer any performance benefits? Majorly.
When you dig deeper, you find that "a block device is a block device" starts to break down. Even for a single physical device, that isn't true - for example, an NVMe SSD with X GB/s of throughput will beat a similarly-specced SAS SSD in real world-use cases, because the NVMe software and hardware stack is optimized for that from the ground-up. Block devices have different block sizes, different optimal I/O patterns, different queueing types, and so on. If anything, optimizing ZFS to run efficiently on top of a HW RAID requires a lot more tuning, because a RAID array performs quite a bit differently than a normal block device.

ZFS has lots of logic to optimize operations by doing things like prioritizing operations, coalescing operations, or spreading operations across drives more intelligently. These optimizations tend to either become useless or counterproductive the more layers you add between ZFS and the real storage devices. Its ability to repair any silent data corruption is also severely hampered - if the data in question is important, you should absolutely care about that if nothing else.

The test lab is on old(er) hardware (Supermicro X9 series, Adaptec 81605Zq controllers, HGST HE10 10TB disks). The production hardware will be different.
Yeah, that's probably part of it. Modern hardware is going to run circles around that, so the apparent CPU usage should be much lower. Especially with a SLOG, otherwise the more like-for-like comparison would be the RAID card with the cache disabled.

What kind of performance and CPU usage did you see on that hardware?