Slowdown over time of Crucial P3 SSD (vs Intel P4610)

homeserver78 · Feb 9, 2024

November last year I started using a newly built NAS. The main storage drives are:

One 6.4 TB Intel P4610 (Oracle FW...) hooked up via an adapter card in PCIe slot. Used for OS + VM images + remote file storage for desktop computers.
One Crucial P3 4 TB in M.2 slot. Used for WORM storage - movies and music.

(I use single-disk ZFS pools on these and periodically snapshot and migrate the data to an HDD drive on the same system as a first-line backup.)

Although the P4610 data is more "active" than the P3 data, the vast majority of the data on both disks were written initially and haven't been re-written since.

Now to the interesting part: the time it takes to scrub the disks! I ran a scrub first thing after having migrated the data to them, and from memory the first scrub took about 20 minutes on the P4610 and 45 minutes on the P3. From 2023-12-12 I have weekly automated scrubs running with email reporting, and the timings look like this:

P4610: scrub repaired 0B in 00:19:41 with 0 errors on Tue Dec 12 23:26:25 2023 => 1033 MB/s (zpool ALLOC "1.22T")
CrucialP3: scrub repaired 0B in 00:59:47 with 0 errors on Wed Dec 13 02:25:51 2023 => 552 MB/s (zpool ALLOC "1.98T")

P4610: scrub repaired 0B in 00:20:31 with 0 errors on Tue Dec 19 13:41:19 2023
CrucialP3: scrub repaired 0B in 01:21:30 with 0 errors on Wed Dec 20 04:51:31 2023

P4610: scrub repaired 0B in 00:19:19 with 0 errors on Wed Dec 27 02:49:20 2023
CrucialP3: scrub repaired 0B in 01:27:57 with 0 errors on Wed Dec 27 05:27:58 2023

P4610: scrub repaired 0B in 00:18:45 with 0 errors on Wed Jan 3 02:48:46 2024
CrucialP3: scrub repaired 0B in 01:24:07 with 0 errors on Wed Jan 3 05:24:08 2024

P4610: scrub repaired 0B in 00:18:01 with 0 errors on Wed Jan 10 02:48:02 2024
CrucialP3: scrub repaired 0B in 01:31:39 with 0 errors on Wed Jan 10 05:31:40 2024

P4610: scrub repaired 0B in 00:20:00 with 0 errors on Wed Jan 17 02:50:01 2024
CrucialP3: scrub repaired 0B in 01:35:10 with 0 errors on Wed Jan 17 05:35:11 2024

P4610: scrub repaired 0B in 00:19:51 with 0 errors on Wed Jan 24 02:49:53 2024
CrucialP3: scrub repaired 0B in 01:39:18 with 0 errors on Wed Jan 24 05:39:19 2024

P4610: scrub repaired 0B in 00:19:10 with 0 errors on Wed Jan 31 02:49:11 2024
CrucialP3: scrub repaired 0B in 01:47:47 with 0 errors on Wed Jan 31 05:47:48 2024

P4610: scrub repaired 0B in 00:18:44 with 0 errors on Wed Feb 7 02:48:45 2024 => 1334 MB/s (zpool ALLOC "1.50T")
CrucialP3: scrub repaired 0B in 01:37:40 with 0 errors on Wed Feb 7 05:37:41 2024 => 343 MB/s (zpool ALLOC "2.01T")

So clearly the Crucial P3 read speed slows down significantly during the first few weeks after data has been written - to less than half if my memory of the first scrub is correct! - but seems to reach a plateau eventually. While if anything the P4610 gets a bit faster with time! This is something I haven't seen taken into account in online reviews of SSDs.

I wonder how something like the WD SN850X does in this regard.

pimposh · Feb 10, 2024

Actually P3 is one of the worst drive available on the market. Slowdowns due to crappy controller and QLC NAND are nothing uncommon down to circa 100MB/s.

So everything is all right with yours.

homeserver78 · Feb 10, 2024

Thanks, pimposh. Yeah, I know the P3 is a bottom-of-the-barrel drive. So I don't really have any expectations on its performance. This phenomenon though - the slowdown of reads with age of the written data - is something I haven't seen mentioned before, and I thought it was interesting. Does it happen with all QLC-based drives? What about consumer-oriented TLC drives? Etc. Perhaps it's something that should be added to reviewer's toolboxes (compare read speed of freshly written data to the same data after a week of idle time).

Anyway, I'm happy with this drive (so far): e.g. indexing of my music collection is still an order of magnitude faster compared to when I had it on an HDD (WD30EFRX).

T_Minus · Feb 10, 2024

The workload is unlikely to actually be 100% READ though... so a mixed workload, a consumer SSD and a low-end version at that... not unexpected.

homeserver78 · Feb 10, 2024

@T_Minus: You mean you have seen similar behaviour before?

(Again: the slowdown over time is not a problem for me, I just found it interesting. The workload is me writing almost 2 TB to the drive and then reading parts of it back now and again, while also adding a few 10ths of gigabytes of new data over the last months. Smartctl reports 2.52 TB written and 26.5 TB read. So most of the data on the drive is the same age, from when I first wrote it a few months ago.)

Edit: That 2.52 TB written also includes some testing I did on the drive before taking it "into production".

nexox · Feb 10, 2024

I suspect the phenomenon that @T_Minus refers to is how writes affect reads on NAND storage, which tends to be a lot more significant on less-sophisticated ssd controllers, and which gets worse as the drive gets more full. If you have any writes running during the scrub, that will slow the reads, potentially a lot, but you wouldn't have seen that when the drive was new, because it would have plenty of erased blocks ready to take writes without any background shuffling of data to contend with user IO.

It's been a while and technology has moved on a bit, but back when I was paid to abuse storage, using a bursty random write benchmark I was able to make a Samsung 840 Evo hit read latencies of over one second, and an 850 Evo of almost half a second, with a write load of only 1MB/s average. The P3 is likely about in that general performance class, with newer controller technology partly offset by the penalty of QLC.

homeserver78 · Feb 10, 2024

Hmm, well, that shouldn't be the case here given how the drive's been used, and the 2.52 TB total written reported by smartctl? It's a whole-disk zfs pool used for WORM storage only (no OS, log files or similar). zpool autotrim is on, and it's mounted using noatime. So the 2.52 TB total written checks out. There should still be almost 1.5 TB of space left on the drive that's never been written to. I think this really is due to age of data and not internal fragmentation or lack of erased blocks. (Unless I'm completely missing something?)

Edit: Also, the drive should have been completely erased from the start since I did an nvme format to change block size to 4K.

Stephan · Feb 10, 2024

P3... recommend to make sure your backups work.

nexox · Feb 10, 2024

Write amplification is likely the missing bit, from what I understand ZFS is often kinda weak there, as are low-end SSD controllers, I would assume that the NAND has taken 2-3x as many writes as the SMART host writes stat indicates. If it's actually the age of the write causing read performance to drop then that's a sign that the NAND is losing charge and requiring re-reads to pass checksum, in which case toss that thing right now.

homeserver78 · Feb 10, 2024

The smart stats I'm referring to are the "Data Units Written", i.e. the number of 512 kB blocks written. Not the host write stats (serial numbers and likely irrelevant lines removed for brevity):

Code:

# smartctl -a <path-to-disk>

=== START OF INFORMATION SECTION ===
Model Number:                       CT4000P3SSD8
Firmware Version:                   P9CR30A
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4 000 787 030 016 [4,00 TB]
Namespace 1 Formatted LBA Size:     4096

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    51 952 844 [26,5 TB]
Data Units Written:                 4 935 036 [2,52 TB]
Host Read Commands:                 202 961 944
Host Write Commands:                22 769 235
Controller Busy Time:               1 166
Power Cycles:                       29
Power On Hours:                     2 067
Unsafe Shutdowns:                   14
Media and Data Integrity Errors:    0
Error Information Log Entries:      88
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               32 Celsius
Temperature Sensor 2:               35 Celsius
Temperature Sensor 8:               32 Celsius

(The error log entries are "Invalid Field in Command" errors.)

The NAND losing charge and getting harder to read is pretty much the mechanism that I had in mind when I saw these slowdowns. I did not expect the effect to be this large though. So that's why I feel it would be interesting to know how other drives fare.

Either way, I have backups, zfs checks my data integrity, and I can tolerate some downtime, so I will keep this thing running if only for curiosity

nexox · Feb 10, 2024

The SMART data units written is still generally host writes, not NAND writes, though that may vary by controller. You can always mount it read only during a scrub to test, though you may need to give it some time to finish background work before starting the scrub.

homeserver78 · Feb 10, 2024

I started a manual scrub and it seems the scrub itself generates a little bit of write activity, which is reflected in the Data Units Written and Host Write Commands fields. So yeah, maybe there's (yet another) bug in zfs in that it doesn't trim blocks freed during a scrub, and the Data Units Written only reflect host writes and not actual writes to the NAND, which in combination could make the drive believe that there are no free blocks left for wear leveling and whatnot and make it start shuffling data around so that reads get slower.

I'm not sure how one would untangle this except by writing raw data (no file system) to a freshly erased drive and then reading it back immediately + later and compare. I don't think I'm interested enough to do that though.

homeserver78 · Feb 11, 2024

For completeness, here's the smart data after the scrub completed. The starting conditions were the same as the earlier data, i.e. no further data was written between taking the previous smart data and starting the scrub. (This scrub took 1:43:51 to finish, BTW.)

Code:

Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    56 264 691 [28,8 TB]
Data Units Written:                 4 939 317 [2,52 TB]
Host Read Commands:                 219 818 373
Host Write Commands:                22 983 298
Controller Busy Time:               1 268
Power Cycles:                       29
Power On Hours:                     2 070
Unsafe Shutdowns:                   14
Media and Data Integrity Errors:    0
Error Information Log Entries:      93
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               33 Celsius
Temperature Sensor 2:               43 Celsius
Temperature Sensor 8:               33 Celsius

So the scrub led to an additional 4281 data units written, or about 2.2 GB. So the scrubs done so far still amounts to way less than what would be required to fill up the 1.5 TB of erased blocks remaining, even if zfs didn't trim them. And even if the drive does a lot of write amplification internally, it should still keep track of the amount of free (trimmed) blocks, right? So I'm still thinking that this has to do with data age rather than something else.

pimposh · Feb 11, 2024

homeserver78 said:
Does it happen with all QLC-based drives? What about consumer-oriented TLC drives? Etc.

To some extent yes but no. Enterprise graded QLCs these days are getting better every iteration and i wouldn't pass next to em myself with certain budget available.
TLCs due to it's nature is also prone to slowdowns but simple math can give you an answer how much more possible rewrites at cell level might be necessary in both cases (xLC vs xLC) + factor of NAND layers getting increased. In general this simple matter gets very complicated.

homeserver78 said:
Perhaps it's something that should be added to reviewer's toolboxes (compare read speed of freshly written data to the same data after a week of idle time).

Depending on workload testing scenarios are always different but simplifying all of this, fill up drive to 70% couple of times, and play with fio in way similiar to workload you will got -either seq or random things (r/w).
2-3 hours of playing and you'll get impression if drive is going to serve or annoy you.
At the end of the day this is all you want to know vs understanding internal drive behaviour on NAND level.

Everything is there on STH if you dig deeply enough.

P3
Combination of single-core controller/CPU (per this) + fact that scrubbing is never read only case + probably very poorly implemented garbage collection + fact of being HBA/DRAMless + slow NAND + COW fiile system.

There's a lot to consider what is going on in the background.
In case of your drive, worst thing is - you will never be able to get it back to pre-filling write/read speeds unless you format namespace completely with nvme tool.

And emptying/moving around filesystem based blocks is somewhat different from clearing internal NAND pages by nvme-drive controller/CPU. Although goal is the same, these processes aren't completely related to each other, which also magnifies this issue.

And if one combine N of such drives in a pool - results are annoying at least to say and point to again - buy nice or buy twice.
Unfortunately these days without bit of dilligence it's easy to step into shallow waters of consumer crap. This might be an helpful site for you, with all data stored in one place and nice chart for dummies (no pun intended). This as well.

homeserver78 · Feb 11, 2024

pimposh said:
To some extent yes but no.

Yes but no, huh?

pimposh said:
2-3 hours of playing and you'll get impression if drive is going to serve or annoy you.
At the end of the day this is all you want to know vs understanding internal drive behaviour on NAND level.

For the umpteenth time, I already know that this drive serves me well (so far at least).

This thread is not about solving a problem (or using multiple of these drives in a pool, or getting back to higher speeds, or...). On the contrary: it's all about understanding the internal drive behaviour. I guess I'm a nerd!

But thank you for trying to help.

pimposh said:
There's a lot to consider what is going on in the background.

Indeed. But in the end, if a sector has been trimmed the drive knows it's free, and given enough idle time (which this particular drive has lots of) it should be able to consolidate blocks to erase. But most importantly, this drive has only gotten some 2.5 TB of writes in total during its entire lifetime, at least 2 TB of which was at its start-of-life. The read slowdown has happened gradually since, and mostly during a few weeks when just a few 10ths of gigabytes were written. I'm not sure how you correlate that with poor garbage collection, or lack of DRAM, or a COW file system.

But if you have detailed knowledge that actually explains how these things could cause this particular type of slowdown, I'm all ears.

---

Apart from the NAND losing charge and getting more difficult to read, I'm thinking one possible reason for slowdown of reads over time could be the drive consolidating written data, either from SLC/MLC/TLC to QLC storage, or maybe even to weaker NAND blocks, to have more reserves for future writes? Several weeks of idle time seems "a bit" much for QLC consolidation, but to find weaker NAND blocks? Maybe? I'm totally speculating/thinking out loud here, I have no idea if this is a thing that's even done. Maybe there's some other reason for the drive to slowly move data around in a way that happen to make reads slower?

nexox · Feb 11, 2024

Are you on the latest firmware with the P3? It could easily have a TRIM bug or anything else. Background consolidating and preemptive erasing is itself a source of write amplification, perhaps the controller doesn't do too much because the lifetime of super cheap QLC is not great. Also if ZFS is issuing TRIM commands during the scrub, that could interfere with read requests as well, with no DRAM or PLP that data has to be written to NAND, historically cheap drives use simple data structures that can be very write intensive to update.

You may also want to try various nvme-cli commands to see if that will tell you more than SMART, even just nvme list should show how many blocks are used in the namespace (at least it does with enterprise drives.)

pimposh · Feb 11, 2024

homeserver78 said:
But if you have detailed knowledge that actually explains how these things could cause this particular type of slowdown, I'm all ears.

LINK here you will likely find answer to your specific question. Look for BER (Bit Error Rate) NAND related knowledge.

twin_savage · Feb 11, 2024

homeserver78 said:
Apart from the NAND losing charge and getting more difficult to read, I'm thinking one possible reason for slowdown of reads over time could be the drive consolidating written data, either from SLC/MLC/TLC to QLC storage, or maybe even to weaker NAND blocks, to have more reserves for future writes? Several weeks of idle time seems "a bit" much for QLC consolidation, but to find weaker NAND blocks? Maybe? I'm totally speculating/thinking out loud here, I have no idea if this is a thing that's even done. Maybe there's some other reason for the drive to slowly move data around in a way that happen to make reads slower?

The behavior on the consumer drive you are experiencing is almost certainly NAND cell charge decay.
The SLC cache recovery on these consumer drives happens within a matter of minutes to maybe an hour for the very largest SSDs with a high SLC cache ratio.

I'd expect the consumer drive to continue to slow down 1-2 orders of magnitude of it's original read performance over the course of a couple years (depending on pattern of writes); this is par for the course on consumer SSDs. Most enterprise SSDs don't experience this problem because of some of the clever algorithms they've added to their GC routine.

We've been having this same discussion over on L1 and even had Malventano drop in on the thread with some nuggets of wisdom.

homeserver78 · Feb 12, 2024

nexox said:
Are you on the latest firmware with the P3? It could easily have a TRIM bug or anything else.

I guess I am; at least Crucial haven't released any updated firmware for the drive.

nexox said:
Background consolidating and preemptive erasing is itself a source of write amplification, perhaps the controller doesn't do too much because the lifetime of super cheap QLC is not great.

Yes, but even if the controller does do a lot, 1) it shouldn't affect the number of known-by-the-drive free blocks over time, and 2) it should have reasonably caused the same amount of slowdown from the start, after the first 2 TB was written, since very little data has been written since.

nexox said:
Also if ZFS is issuing TRIM commands during the scrub, that could interfere with read requests as well, with no DRAM or PLP that data has to be written to NAND, historically cheap drives use simple data structures that can be very write intensive to update.

Again, why would that cause a significant slowdown now, but not when the data was fresh?

nexox said:
You may also want to try various nvme-cli commands to see if that will tell you more than SMART, even just nvme list should show how many blocks are used in the namespace (at least it does with enterprise drives.)

For both my P3 and the three namespaces I have on my P4610, this shows the usage as 100 % (4 TB/4 TB, 5.92 TB/5.92 TB, 240.52 GB/240.52 GB etc). Unfortunately 'man nvme-list' doesn't give any explanation of that field at all. It looks almost like it lists how much partitioned space each device contains?

homeserver78 · Feb 12, 2024

twin_savage said:
The behavior on the consumer drive you are experiencing is almost certainly NAND cell charge decay.
The SLC cache recovery on these consumer drives happens within a matter of minutes to maybe an hour for the very largest SSDs with a high SLC cache ratio.

Interesting, thanks!

twin_savage said:
I'd expect the consumer drive to continue to slow down 1-2 orders of magnitude of it's original read performance over the course of a couple years (depending on pattern of writes); this is par for the course on consumer SSDs.

That's rather horrible. I would have expected (or at least accepted) that kind of slowdown of writes due to the drive filling up, reducing the available SLC cache. Two orders of magnitude slowdown of reads though ...

twin_savage said:
We've been having this same discussion over on L1 and even had Malventano drop in on the thread with some nuggets of wisdom.

That sounds interesting, I'd like to read that! Is this the thread? Ssd data retention

Slowdown over time of Crucial P3 SSD (vs Intel P4610)

New Member

hardware pimp

New Member

Build. Break. Fix. Repeat

New Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

New Member

New Member

hardware pimp

New Member

Well-Known Member

hardware pimp

Member

New Member

New Member