Optane NVMe for Slog/ Pooldisks or All-in-one via vdisk on OmniOS

gea · Nov 24, 2017

Intel is not known to make faults with the datasheet of a new product so whats going on?

- the 900P has plp but Intel removed this as an advertized feature
to push 4800x sales (we saw this earlier with other non DC SSDs)

- plp on the 900P was not stable enough to guarantee a commited write to be on disk on a sudden powerloss

- Intel plans a 900P Pro/server edition
a 900P as advertized earlier but at a higher prize, the current 900p is for gamer only

- really a fault, sorry, we sacked the responsible person

In general, as there is no cache needed for an Optane,
plp seems for me mainly a firmware/software feature less a hardware feature.

Rand__ · Nov 24, 2017

So better not to update the firmware ?

_alex · Nov 24, 2017

is not all 3dxpoint / optane ,plp by design' as there is no cache?

gea · Nov 24, 2017

I ask myself about the real consequences.

For a critical production setup its clear, the advantages of Optane are huge, the other good Slog devices are expensive too (I paid more for a ZeusRAM than I would for the 4800x) and I would not suggest an Slog device without the guarantee that commited writes are on stable storage.

For many setups you may ask for the danger vs the price of a solution. Without an Slog the whole writecache is in danger, say 4GB of data.

The Slog must only guarantee for the last commit so worst case is one lost commited datablock. Bad if this affects a critical finance transaction or a metadata that results in a filesystem corruption. But in general the real risk is low, much lower than the risk of a corrupted filesystem or a corrupted database without sync.

If I go back a few years, this risk was accepted with the first reasonable Slog (Intel SLC X25) that comes without plp as this one was better than anything else at that time.

So maybe for many setups, the Optane cache version (16/32GB) or the faster 900P remains a good ZFS and Slog option for many setups - while I hope for a cheaper alternative of the 4800x in future.

_alex · Nov 24, 2017

so, what should caps protect in the event of pl, when the ack back to os is issued after inflight writes are persisted to optane without DRAM ?

gea · Nov 24, 2017

_alex said:
is not all 3dxpoint / optane ,plp by design' as there is no cache?

Writing data consists of many steps in software. On a sudden powerloss a system should complete the whole current transactions for a commited write and behaves proper to ensure filesystem consistency, mainly a question of firmware quality with Optane. Intel guarantees this for some models like the 4800x, not for others like the cheaper Optane.

One may discuss if the firmware is the same but for some setups you must rely on a vendors guarantee for a feature does not matter if this is only a marketing difference.

_alex · Nov 24, 2017

well, Intel states (and i guess still does) the Controller of p900 is the Same as for the 4800 - and i really wonder how PLP would look like without a DRAM cache. this could also explain that there are no caps on the 4800 ...

_alex · Nov 24, 2017

but for some setups you must rely on a vendors guarantee for a feature does not matter if this is only a marketing difference.

This is an argument.
But is a feature that is not necessary/existing any more because of technology change and that is therefore only a sentence in a specs sheet still worth to consider ?

I have no problem with people paying a massive premium for an empty, obsolete sentence backed by zero differences in operational aspects if they need to.

Really wonder when/if there will be an official statement about the need/existence of PLP as a ,feature' with optane.

i386 · Nov 24, 2017

3d xpoint is fast, non volatile memory (http://www.intelsalestraining.com/infographics/memory/3DXPointc.pdf), plp should not be necessary.
Unless there is some sort of caching with dram/sram.

gea · Nov 24, 2017

You are right if the PLP guarantee of the 4800x is only a marketing gag by Intel so you can safely ignore the PLP=no of the cheaper Optane.

You say there is no cache that you need to protect. This seems true but shortly prior a cache flush from ZFS the Slog can hold say 4 GB of data. After a crash these data must written to a pool on next reboot. This is why you need the Slog.

Now remember, you want an Slog to guarantee the validity of a database or a filesystem on ZFS due uncomplete atomic writes. Why do you expect that the same would not affect the data and filesystem on the Slog. What if it corrupts on a crash?

_alex · Nov 24, 2017

If data on the slog becomes corrupt, would a (cap or marketing backed) PLP be able to prevent this in any way?
there is always a small slot in time that can't be covered, and be it a single CPU cycle.

the only way to be 100% save is maybe to verifiy the data that has been written to the slog (by zfs/the filesystem) before ack back to the app. And this is for sure no performance boost.

_alex · Nov 24, 2017

oh, and would zfs write corrupted Data from the slog at all, or just fallback to the last txg before the power loss when it sees the checksums of the slog are not ok ?

imho, giving up the corrupted txg and forcing a manual rollback to the last consistent state/sane txg commit would be much more safe than writing a corrupt txg from slog.

gea · Nov 24, 2017

In the end, only an Intel developper can answer about the differences between Optane with or without PLP.

Until then ZFS guarantees only that the filesystem is not corrupt while an slog must guarantee the validity of the writecache and the last commited transactions. If you see Slog only like a regular ZFS filesystem, you cannot guarantee last transactions.

_alex · Nov 24, 2017

yes, this is somehow esoteric

Not sure if there is any difference, as zfs only sees the ack for the writes to slog.

If these are on a persistent layer and no ram-based cache i have a hard time seeing what could be different.

Either data is written and survives the reboot or it's not written. For the later case no ack should be given.

So as long as a drive acks honestly there should never be an issue.

If data gets corrupted after an honest ack something went seriously wrong with a good portion of bad luck that imho is not related to the presence of PLP.

In this case rollback with loss of a single txg is maybe the only option.

gea · Nov 24, 2017

Maybe this is similar to the ECC functionality that Intel artificially limited to server chipsets. From a manufacturing view they cost the same with the result that any cheap NAS comes now without ECC even when it offers 64 GB of RAM.

I expect the same with PLP where it only affects firmware quality. An additional hardware is not required like on hardware raid (BBU or Flash backup) - simply to maximise profits.

_alex · Nov 24, 2017

yes, but in this case it would mean the fw on the cheaper Drives intentionally messes up things or acks too early. can't imagine this but who knows.
imho expensive fud that is smoothened by an additional feature listed on ark that now is not present anymore for the p900 - but might be wrong and only Intel knows what is really going on.

Stux · Nov 24, 2017

Maybe completed sync writes are fine but in flight async writes are not

After all, the sync write should *only* be acknowledged once it is committed.

J-san · Dec 4, 2017

gea said:
I have a P3600 and a DC 3700 Sata around.
I will include them when I redo my tests with a different hardware and a different set of benchmarks.

Would love to see a 200G or 400G Intel DC 3700 SATA drive in there as a slog for comparison!

Keep up the good work!

NYCone · Jan 23, 2020

gea said:
My suggested AiO setup

- use am USB stick ro boot ESXi
- create a local datastore on an Intel Optane 900P and place the napp-it storage VM onto
- Use an LSI HBA or Sata in pass-through mode for your datadisks

- add a 20 G vdisk on the Optane datastore to the napp-it storage VM and use as Slog for your datapool
- add a vdisk for on L2ARC (around 5 and no more than 10 x size of RAM)

Gea,

As you suggested, I've switched to Solaris 11.4 AIO. I'm a bit of a newbie to ESX and Solaris, do you have a how to on what parameters you used for the vdisk SLOG and the L2ARC? Is the exact provisioning important in ESXi?

Thanks

gea · Jan 24, 2020

Just create a vdisk (Optane datastore) for Slog with size=20GB and a second vdisk fpr L2Arc with a size between 5x and 10x the Ram that you have assigned to Solaris.

You can use the default scsi virtual controller. On newer ESXi you can try the virtual NVMe driver (may have a lower latency)

Optane NVMe for Slog/ Pooldisks or All-in-one via vdisk on OmniOS

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Member

Member

Member

Well-Known Member