How does one go about testing that a storage devices PLP is actually working?

BLinux · Nov 26, 2020

Question in title. Products marketed with PLP are fine, and one can open up the drive and examine the components to see if the expected hardware is there, but how do you actually go about testing that PLP is working?

Do you just write a continuous stream of data and yank the power cord and see what happens? How do you record the last bit of data that hit the drive's cache?

Sorry if this is a dumb question, but please educate me...

particular use cases I'm thinking about:
1) simple one: product X claims to have PLP / confirm it works.
2) refurbished product has PLP, but backup power source may have degraded capacity, so how to test it still works?

Stephan · Nov 26, 2020

For Intel DC type products check https://www.intel.com/content/dam/w.../ssd-power-loss-imminent-technology-brief.pdf page 4, there are smart attributes for it.

No idea about NVME, I just looked at some SSDs I have but couldn't make anything out. Maybe errors will show up in error-log.

Short of desoldering the capacitors to check for SMART errors, which I wholeheartedly do not recommend unless you have Louis Rossmann level skills, you probably have to trust the caps work and trust the firmware checking it regularly. The caps employed are usually good quality Nichicon or Panasonic rated for 105degC and 2000-5000 hours, which at server operating temperature of 40 degC means 10x more or 50.000 hours, before they go out of spec. That would be ~6 years. Presumably the PLP functionality even works at 50% capacity still, so we are talking more like 10-20 years of capacitor lifetime.

i386 · Dec 6, 2020

BLinux said:
Do you just write a continuous stream of data and yank the power cord and see what happens?

That's (almost) how a (mobile gaming) company from serbia tested plp with different ssds in 2015: Power Failure Testing with SSDs

BLinux · Dec 6, 2020

i386 said:
That's (almost) how a (mobile gaming) company from serbia tested plp with different ssds in 2015: Power Failure Testing with SSDs

yeah, i found that too in my own search for an answer. it's an interesting approach... they log the I/O transaction over the network , pull the plug, the "other" node that stays up has a record of what I/O happened, then they boot up the downed system and verify against the log.

UhClem · Dec 7, 2020

BLinux said:
Question in title. Products marketed with PLP are fine, and one can open up the drive and examine the components to see if the expected hardware is there, but how do you actually go about testing that PLP is working?

"Trust--but verify." ... a wise maxim

Do you just write a continuous stream of data and yank the power cord and see what happens?

That should work ... [since you say "power cord", I assume (in script below) you mean system power (vs SSD power)]

How do you record the last bit of data that hit the drive's cache?

[assuming Linux] (shame on @Byou if that's a bad assumption)
Note, also, that we need to write directly to the tested SSD device itself (or a partition on it); else our "log" could lose sync with the SSD cache. dev-name (below) should be, e.g., /dev/nvme0n1 or /dev/nvme0n1p3.

Code:

for i in {0..1000000}
do
echo -n $i "" ; dd if=/boot/vmlinuz of=dev-name oflag=direct bs=1M seek=$i count=1 2> /dev/null ; echo == $i
done

This will output:

Code:

0 == 0
1 == 1
2 == 2
...

In an attempt to (completely/significantly) fill the device cache, don't pull the cord until N (== N) reaches ~ C / (B - W)
where
C cache size (in MiB)
B bus speed (in MiB/sec)
W estimated write speed (in MiB/sec)
[Edit:] Hence, e.g., ~~>>IF<<~~ ~~C is 100000 (~100GB) and~~ if B is 3.2 (PCIe gen3 x4) and W is 1.2, then N = C / 2
Will probably take ~~a few minutes~~ several seconds.
[Make note of that last N; at cord-pull -- so best to use actual console (or ssh in from a different system) which won't go blank at cord-pull.]

Then, after a fresh boot, do:

Code:

od /boot/vmlinuz | head -1 | cut -d " " -f 7-9 | tr " " x > /tmp/cut-me
od -Ad dev-name | tr " " x | grep `cat /tmp/cut-me`

That last command will run (and spew) for a long time. Go have a meal, watch a show ... [It could be made a lot faster, by having originally written to a file (instead of device), but it would risk accuracy of N, if at the first ^C, the actual file metadata written lagged the (O_DIRECT) output.]

When it stops spewing, you can ctrl-C it.
For the last line output: compare the leading (decimal) number [preceding the first "x"] divided by 1048576 (1 Mi) with that N (just before cord-pull). How's your PLP? -- Did you get it all???

[Disclaimer: yes, I know I could have added " | tail -1 | ... cut ... | bc ..." instead of the ^C, but ...

]

UhClem · Dec 7, 2020

Consider the above hack (an attempt at) a proof of concept. If it appears viable, there are tweaks to make it faster and more robust.

Note: edited above. C (cache size) of 100GB ?? Wrong! (for at least 5-10 years) Actually, I was thinking DRAM + SLC(?) -- but PLP only needs to protect DRAM, right?

@BLinux : rsvp w/ bugs, results, etc.

i386 · Dec 7, 2020

UhClem said:
but PLP only needs to protect DRAM, right?

Powerloss protection protects the controller, dram and nand. The ram contains data from the os for the io and data about the mapping of pages for the controller to manage the nand.

Search

How does one go about testing that a storage devices PLP is actually working?

BLinux

cat lover server enthusiast

Stephan

Well-Known Member

i386

Well-Known Member

BLinux

cat lover server enthusiast

UhClem

just another Bozo on the bus

UhClem

just another Bozo on the bus

i386

Well-Known Member