Micron 7450 NVMes and Synology as cache

kachunkachunk · Nov 1, 2023

Hey All,

I've been scratching my head at this for a bit. I posted this on Reddit, but really figure the experienced folks here will know better how to approach or troubleshoot this. It's proving elusive because I can't get any meaningful logging out of my Synology NAS, or the drives.

Re-writing my post on here properly, saving some clicks/reading, I hope:

These are two 2TB Micron 7450 Pro NVMes, installed into a Synology RS1221+ via E10M20-T1 PCIe card.

They are detected fine and without error, and:

I can benchmark (read and write to) the drives.
I can perform read-write tests via CLI using tools like fdisk, mkfs.ext4, and dd.
msecli and nvme can read information about the devices just fine.
I can reformat the namespaces. I tried 4K and 512-byte format, without any new outcomes so far.
I can get device SMART and log counts/outputs, and there are no problems or events of interest logged.
Errors are not incrementing on the drives between each attempt at reproducing the issue, according to SMART info.
I have installed the latest Micron firmware into one of the firmware slots and reloaed/reset/rebooted.
Seagate Firecuda 530 NVMes were working completely fine on the PCIe card, as cache.

The issue: When I try to add these Micro NVMess as cache for my main storage volume via Synology Storage Manager, the process almost immediately fails with:

The system failed to mount an SSD read-write cache on Volume 1. Please try again.

I can retry, to no avail.
And if I view the SMART stats via the UI, it errors out, unable to gather SMART stats, and marks the drive as Critical.

Interestingly, I also can't write to either of the devices' namespaces via CLI anymore; I just see Operation not permitted errors.
Either the namespace or controller is entering some problem state that I can't discern/find from CLI and logs, or the NAS' OS is preventing writes to the devices (a bit more likely, but still a guess).

I can get back into a usable state by clearing the critical state of the devices in the Synology Db. I run the following python script to simplify/automate this:

Python:

import sqlite3
database_paths = [
    "/var/log/synolog/.SYNODISKDB",
    "/var/log/synolog/.SYNODISKTESTDB",
    "/var/log/synolog/.SYNODISKHEALTHDB"
]

# Note: Substitute "your_serial_number1" and 2 with your actual drive serial numbers
target_strings = ["your-serial-number1", "your-serial-number2"]
for db_path in database_paths:
    # Connect to the SQLite database
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Fetch all table names
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = cursor.fetchall()
    for table in tables:
        table_name = table[0]

        # Fetch column names for the table
        cursor.execute(f"PRAGMA table_info({table_name});")
        columns = [column[1] for column in cursor.fetchall()]

        for col in columns:
            for target in target_strings:
                delete_query = f"DELETE FROM {table_name} WHERE {col} LIKE ?"
                cursor.execute(delete_query, (f"%{target}%",))

    # Commit changes and close the connection
    conn.commit()
    conn.close()

print("Finished processing all databases.")

Then you have to disable and re-enable the NVMe devices and rescan the PCI bus. I scripted some of the following, but here's a rundown:

Run lspci -q to list and look up all devices.
Locate the Non-Volatile memory controller entries and note the IDs.
Run echo "1" > /sys/bus/pci/devices/0000\:06\:00.0/remove where "06" is the correct PCI ID from step 2.
Run the same thing for the other device: echo "1" > /sys/bus/pci/devices/0000\:07\:00.0/remove
Run sleep 1 to wait/sleep one second, particularly if scripting this stuff.
Run echo "1" > /sys/bus/pci/rescan to rescan and pick up the NVMes again.
DSM will immediately detect the cache devices, and the aforementioned Operation not permitted errors upon writing via CLI will cease. I can write to the namespaces again, read SMART stats in the GUI, manage the devices, etc., until trying again to attach them as a cache device.

So, while I can prove the drives work from the troubleshooting steps outlined at the top, unfortunately I still can't figure out a way to create the cache, nor get some more useful feedback or information on what is going wrong here. Logging on the Synology and from the drives are not telling me anything, and I am not at all hopeful that I can get any help from Micron nor Synology.

Following one of many additional attempts, I took more notice of the following:
Despite there not being a cache at the moment, /dev/mapper/cachedev_0 exists, and is the mount point for several BTRFS snapshot-enabled volumes. Maybe more specifically volumes where the snapshots are browsable.

And, on one occasion/attempt, I noticed two partitions created on one of the NVMes, interestingly:

Code:

# fdisk -l /dev/nvme0n1

Disk /dev/nvme0n1: 1.8 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: Micron_7450_MTFDKBG1T9TFR             
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xbd083443

Device         Boot    Start      End  Sectors Size Id Type
/dev/nvme0n1p1          8192 16785407 16777216   8G fd Linux raid autodetect
/dev/nvme0n1p2      16785408 20979711  4194304   2G fd Linux raid autodetect

That had not occurred before, or at least each time. The other NVMe remained empty.

So, my thinking for now is that the cache creation process is attempting some command that isn't supported by the controller and it reports an error (invalid field in CDB maybe, but I have no logs or events that corroborate any of this). Then the cache creator wizard does not gracefully exit, and I think this leaves the system side cache device present:

Code:

Disk /dev/mapper/cachedev_0: 48.9 TiB, 53693533650944 bytes, 104870182912 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 65536 bytes / 458752 bytes

And it's probably locking the devices down to be non-writable outside the kernel. It just... probably should not do that unless the cache creation process actually finishes.

Caveat for sure: I'm making several assumptions here. Maybe /dev/mapper/cachedev_? always exists for each volume, and is a Synology user's primary mount point for several things, even if you've never created a cache before. I can't confirm/compare this anywhere at the moment.

I can't find anything in the journal or exported syslogs, drive logs, etc. Would greatly appreciate some suggestions or help!

kachunkachunk · Nov 1, 2023

On my latter thoughts there, the cachedev device entries seem to be dummies and intended to just make cache creation in future easier (rather than having to restructure the array).

Here's a comparison between my currently cacheless RS1221+ and a DS1515+ (with cache):

Code:

RS1221+:
dmsetup table cachedev_0
0 104870182912 flashcache-syno conf:
        ssd dev (none), disk dev (/dev/vg1/volume_1) cache mode(DUMMY)
        capacity(0M), associativity(0), data block size(0K)
        skip sequential thresh(0K)
        total blocks(0), cached blocks(0), cache percent(0)
        nr_queued(0)
        split-io(0) support pin file(0) version(0)
Size Hist: <long list of crap stripped for brevity>
bits_all=0 bits_used=0

DS1515+:
dmsetup table cachedev_0
0 1910505472 flashcache-syno conf:
        ssd dev (/dev/shared_cache_vg1/alloc_cache_2), disk dev (/dev/vg1/volume_2) cache mode(WRITE_BACK)
        capacity(78720M), associativity(512), data block size(64K) metadata block size(4096b)
        skip sequential thresh(1024K)
        total blocks(1259520), cached blocks(429453), cache percent(34)
        dirty blocks(1233), dirty percent(0)
        nr_queued(0)
        split-io(0) support pin file(1) version(12)
Size Hist: <long list of crap stripped for brevity>
bits_all=0 bits_used=0

The flashcache_create command is probably what ties a new cache device with the existing RAID array, but if not, there's some additional dmraid stuff I have to figure out, maybe:

Code:

flashcache_create:
Usage: flashcache_create [-v] -p back|thru|around|dummy [-b block size] [-m md block size] [-s cache size] [-a associativity] [-g group_uuid] -n version cachedev ssd_devname disk_devname
        -n cache_version: choose cache version. e.g. 10 for 64K block size, 11 for hash mapping. Note: If mode is dummy, then this value will be ignored
        -n: set for 64kb cache driver
        ssd_devname: e.g. /dev/md3, set "none" for dummy cache
        disk_devname: e.g. /dev/md2

Note: Default units for -b, -m, -s are sectors, or specify in k/M/G. Default associativity is 512.
git commit: f5ab88a7be8a

UhClem · Nov 1, 2023

kachunkachunk said:
These are two 2TB Micron 7450 Pro NVMes, installed into a Synology RS1221+ via E10M20-T1 PCIe card.

They are detected fine and without error, and:
...

Seagate Firecuda 530 NVMes were working completely fine on the PCIe card, as cache.

The issue: When I try to add these Micron NVMes as cache for my main storage volume via Synology Storage Manager, the process almost immediately fails ...

So, while I can prove the drives work from the troubleshooting steps outlined at the top, unfortunately I still can't figure out a way to create the cache, nor get some more useful feedback or information on what is going wrong here.
...

You might get some info (maybe, even, that Eureka! lightbulb

) by doing

Code:

nvme id-ctrl -H /dev/nvmeN > ctrlN_before.txt
nvme id-ns -H /dev/nvmeNn1 > nsN_before.txt

right before the add_as_cache operation,
and, as above, but _before ==> _after
right after the add_as_cache op. (because of the "clobber", the id-ns after might fubar)
Then diff the N_before's and N_after's.

Small favor:
Please paste, as "code", the output of lspci -nn | grep witch
tnx

Tech Junky · Nov 1, 2023

I'm assuming these are the M2 version?

I bought a couple for the U.3 drives and both lost their partition tables and couldn't be recovered. One lasted just a couple of hours and the other just under a week. I found a fw update post failure of the first and applied it to the second but still failed. There's something funky with these micron drives imo. I went with Kioxia and haven't had the issue since. I did have to run fsck a couple of times though which prompted a rebuild for Intel to AMD which resolved the issue. I had been pondering an AMD switch for awhile though and this just gave me the excuse to do it.

jei · Nov 1, 2023

Tech Junky said:
I'm assuming these are the M2 version?

I bought a couple for the U.3 drives and both lost their partition tables and couldn't be recovered. One lasted just a couple of hours and the other just under a week. I found a fw update post failure of the first and applied it to the second but still failed. There's something funky with these micron drives imo. I went with Kioxia and haven't had the issue since. I did have to run fsck a couple of times though which prompted a rebuild for Intel to AMD which resolved the issue. I had been pondering an AMD switch for awhile though and this just gave me the excuse to do it.

Ouch. I'm just now debugging 7450 Pro which is seemingly DOA. Not sure yet. Maybe compatibility problem with AOC-SLG3-4E4T + Icydock. Waiting for chinese U.2 adapters to verify.

Tech Junky · Nov 1, 2023

Yeah, they were odd to say the least. Format fien, store data fine, and then have io errors and on reboot wouldn't take data or format again. The partitions just up an left and couldn't be put back on. First one figured was a fluke and ordered a replacement but the second one deemed them as not trying a third. Odd thing was they showed up in lspci electrically but just couldn't get them to do anything. Someone I was talking to about them ordered a couple and hasn't had an issue other than trying to keep them cool.

kachunkachunk · Nov 1, 2023

UhClem said:
You might get some info (maybe, even, that Eureka! lightbulb ) by doing

Code:

nvme id-ctrl -H /dev/nvmeN > ctrlN_before.txt nvme id-ns -H /dev/nvmeNn1 > nsN_before.txt

right before the add_as_cache operation,
and, as above, but _before ==> _after
right after the add_as_cache op. (because of the "clobber", the id-ns after might fubar)
Then diff the N_before's and N_after's.

Small favor:
Please paste, as "code", the output of lspci -nn | grep witch
tnx

Sure! Collected and diff'ed both controllers and the namespace on each. Unfortunately, zero differences between each before/after.

Here's the PCIe switch info:

Code:

01:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:02.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:04.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:08.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:0c.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)

I found some possibly helpful logs eventually. Here's the latest attempt, from /var/log/space_operation.log:

Code:

2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_error_log.c:47 command="/sbin/sfdisk --fast-delete -1 /dev/nvme1n1" Error=""
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_error_log.c:47 command="/sbin/sfdisk -M1 /dev/nvme1n1" Error=""
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_error_log.c:47 command="/sbin/sfdisk --fast-delete -1 /dev/nvme0n1" Error=""
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_error_log.c:47 command="/sbin/sfdisk -M1 /dev/nvme0n1" Error=""
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_error_log.c:47 command="/sbin/sfdisk -M1 /dev/nvme1n1" Error="Error: /dev/nvme1n1: unrecognised disk label Error: Operation not permitted during write on /dev/nvme1n1 "
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: flashcache_disable.c:87 command="/usr/bin/flashcache_disable /dev/mapper/cachedev_0 none /dev/vg1/volume_1 dummy 0 " Error="Incorrect cache mode=dummy
Usage: /usr/bin/flashcache_disable (cachedev|cachedevpath) ssd_path disk_path cache_mode(back|thru|around) force_disable(0|1) 
git commit: f5ab88a7be8a
"

From reading, the unrecognised disk label error is of some interest, and indicates that there isn't a partition table on the device, or one that is recognized. But notably the first few I've already tried pre-partitioning the device with an GPT or MBR partition table, to no real benefit or effect. Have to sit and think on this, but it could *maybe* be erroneous or a symptom of some other issue, probably.

Here's logging from /var/log/synoscgi.log, for the same attempt:

Code:

2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17435]: FlashcacheApiV1.cpp:881 Clean old volume pin information on /volume1 if needed
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: resource_internal_lib.c:404 Space resource: register type [md_id], 3=acquired
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: resource_internal_lib.c:404 Space resource: register type [shared_cache_vg_id], 1=acquired
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: resource_internal_lib.c:404 Space resource: register type [shared_cache_id], 1=acquired
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_ssd_cache_disk_init.c:119 [INFO] Clean all partitions of [/dev/nvme1n1]
2023-11-01T19:34:49-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_ssd_cache_disk_init.c:119 [INFO] Clean all partitions of [/dev/nvme0n1]
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: partition_table_add.c:30 Failed to add ms-dos signature to /dev/nvme1n1, err=255
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: partition_exec.c:39 Failed to partition disk /dev/nvme1n1
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_ssd_cache_disk_init.c:149 failed to create ssd cache on disk [/dev/nvme1n1]. [0x2000 file_get_key_value.c:80]
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_create.c:1547 Failed to init disk. [0x2000 file_get_key_value.c:80]
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: space_lib.cpp:431 failed to create space: tpqs4F-6OLA-qnHj-8YPM-gK03-xcX6-CyJnPc
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: FlashcacheManager.cpp:2214 Failed to create space: tpqs4F-6OLA-qnHj-8YPM-gK03-xcX6-CyJnPc
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: flashcache_disable.c:82 Start to disable cache (ssd=none disk=/dev/vg1/volume_1 type=0)
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: flashcache_disable.c:87 SpaceCommand:command="/usr/bin/flashcache_disable /dev/mapper/cachedev_0 none /dev/vg1/volume_1 dummy 0 " Error="Incorrect cache mode=dummy
Usage: /usr/bin/flashcache_disable (cachedev|cachedevpath) ssd_path disk_path cache_mode(back|thru|around) force_disable(0|1) 
git commit: f5ab88a7be8a
"
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: flashcache_disable.c:89 Exec command failed. commaned = /usr/bin/flashcache_disable /dev/mapper/cachedev_0 none /dev/vg1/volume_1 dummy 0
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: flashcache_disable.c:90 Error = Incorrect cache mode=dummy
Usage: /usr/bin/flashcache_disable (cachedev|cachedevpath) ssd_path disk_path cache_mode(back|thru|around) force_disable(0|1) 
git commit: f5ab88a7be8a
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: FlashcacheManager.cpp:212 Failed to DISABLE flashcache on /dev/mapper/cachedev_0
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: FlashcacheManager.cpp:592 Failed to create new mode cache
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: resource_internal_lib.c:475 Space resource: release type [md_id], 3=
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: resource_internal_lib.c:475 Space resource: release type [shared_cache_vg_id], 1=
2023-11-01T19:34:53-07:00 NAS synoscgi_SYNO.Storage.CGI.Flashcache_1_enable[17461]: resource_internal_lib.c:475 Space resource: release type [shared_cache_id], 1=

So here we are again, with it unable to partition the device. It's weird, because I can partition it manually via CLI, even using the same commands in sfdisk.

After resetting and getting it back into a usable state, show partition info. Notice no partitions made. I know for a fact that it's deleting any partitions I create successfully so far.

Code:

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 1.8 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: Micron_7450_MTFDKBG1T9TFR               
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Create a partition using sfdisk (exactly as the logs indicated). I see an erroneous "unrecognised disk label" warning which is probably an issue... but the partition table is created. Notice it now has a Disklabel type of dos.

Code:

# sfdisk -M1 /dev/nvme0n1
Error: /dev/nvme0n1: unrecognised disk label
no gpt table, skip zero final sector
No default signature.

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 1.8 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: Micron_7450_MTFDKBG1T9TFR               
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x35c27dc

These are M.2s, yes.
The only other experiences I've had with Micron SSDs were some 5100s that were just blowing up left and right for a customer. Months, maybe a year of use, then they randomly just started falling over and not recovering. I'm getting slightly cold feet with these 7450s, but hoping for the best. They're on the latest firmware already.

Tech Junky · Nov 1, 2023

Hmm.... It seems the NAS is trying to partition them while online vs unmounted.

This is probably one of the reasons I consolidated all of my gadgetry into a single PC setup. Not to mention my initial NAS was a 2-bay and no upgrade path for the NIC from 1GE. There was a bunch of other things besides that like automating converting OTA from TS to MP4 that it didn't have the HP for and I supplemented that with a NUC to set beside it and do the heavy lifting of the conversions but also had a DVR for TV and that had a 2TB drive inside of it but, they did away with lifetime passes with the new generations. Anyway.... It's just easier to manage these sorts of things w/o the locked down OS to deal with on top of everything else.

Things I'm seeing as clues are...
GPT
Disk Label
Signature - further looking at the logs - ms-dos seems to be what it's looking for
-- also it appears it's looking for you to set the cache mode as "dummy" isn't working
cache_mode(back|thru|around)

What I would probably try at this point is wiping the partition table and shutting off the NAS and inserting the drives again and powering on to see what it thinks about blank drives and whether it prompts you to initialize them through the GUI. It should be smart enough to do that. Otherwise if it doesn't then I would set it as GPT and not assign a FS to it and try again to see if it picks it up. If it doesn't then I would start with the same FS as the spinners and it should pick them up and start allowing them to be used as cache options.

As to the brand / 7450 being good or bad just between me and the other guy we had 4 drives as a sample and my 2 dropped the partition table and weren't recoverable for further use and his are still working just fine but run hot and need active cooling. It's just a bit odd to be spotting more people having issues with the 7450 models. They're appealing though for the specs / price. I even made sure to order from different vendors to not get the same batch on the second one and while it lasted a week or so it still had the same fault and issues in the end. Just a real PITA to figure them out and try to pin down the issue that's causing the problem. In your case though it seems they're fine in a PC and just being finnicky in the NAS for some reason.

kachunkachunk · Nov 3, 2023

Depending on the NAS you have, they may not support use of the M.2 devices for storage pools/volumes at all. And in the case of using a carrier card like the E10M20-T1, it's (for some reason) never supported; just for use as cache.

It's possible to do it via scripting/CLI, which I was able to do, so that kind of proves they "work" but admittedly the extent of my testing is pretty limited so far there, too. I went back to trying to get use these exclusively as cache.

I reached out to Synology support and they're still kind of hung up on some symptoms and not causes, but we'll maybe make some progress. The whole setup is pretty unsupported though, so credit them for even giving me any attention at all so far.

That said - I was in fact able to follow some old logging for the NAS, when it had successfully built an SSD cache with cheaper NVMes. And, adjusting the parameters a bit (mostly for larger capacity), I was able to create an SSD cache and get it attached to my main volume.
Only problems that remain are that the Synology Storage Manager doesn't recognize that there's a cache attached to the volume yet, and rebooting doesn't seem to load the PVs, LVs, or cache automagically.

I think it could be due to absence of /usr/syno/flashcache.conf, when I compared with my other NAS (which has working cache via SATA SSDs), but it's still a bit unclear how to populate a new file yet. There are some pretty big differences in array makeup, and when you use SHR, it sometimes creates multiple RAID sets in parallel, which is a bit messy. I suspect that the flashcache configuration needs to be set up twice - one per pair of RAID sets, for the same volume, but I'm not totally sure yet.

I'm, somewhat inclined to have to remove the cache and NVMes, insert the old ones, attach them as cache via the GUI, then grab all the config details from the NAS before removing it all again and swapping in the intended NVMes. While logging helped me get this far, unfortunately the contents of configuration files is usually not part of those, haha.

This also assumes the config file is responsible and not something... else.

If I can get the whole process down and working, I'll document it cleanly.

Oh, er, one thing I haven't really dived into much yet, also, was to see how one pins their BTRFS metadata to the SSD cache, via CLI. I have some clues in the logs, but it feels like a battle for another day.

kachunkachunk · Nov 4, 2023

Okay, some progress. I did end up plugging in my old NVMes to build a safer comparison, then put my Micron 7450 Pros back in.

Looking at /usr/syno/flashcache.conf after creating a cache via the Firecudas, I can figure out where all the contents come from, and using logs, I can replicate some of the cache creation process as well.

I don't precisely know how synoscgi (note the bolding to indicate that that's a web UI caller) is calling the creation of this file, unfortunately. I can create the file myself, but I'm finding that it's emptied on reboot and I see no cache device attached to my volume.

At least getting to this point, it's:

Code:

Create the flash cache. Condensed steps in my case:
# sfdisk --fast-delete -1 /dev/nvme0n1
# sfdisk --fast-delete -1 /dev/nvme1n1
# sfdisk -M1 /dev/nvme0n1
# sfdisk -M1 /dev/nvme1n1
# sfdisk -N1 -uS -f -j2048 -z-1 -tfd -F /dev/nvme0n1
# sfdisk -N1 -uS -f -j2048 -z-1 -tfd -F /dev/nvme1n1
# mdadm -C /dev/md3 -e 1.2 -amd --assume-clean -R -l1 -n2 -x0 /dev/nvme1n1p1 /dev/nvme0n1p1
# dd if=/dev/zero of=/dev/md3 bs=512 count=1 status=none
# pvcreate -ff -y --metadatasize 512K /dev/md3
# vgcreate --physicalextentsize 4m /dev/shared_cache_vg1 /dev/md3
# lvcreate -n syno_vg_reserved_area --size 24M /dev/shared_cache_vg1
# lvcreate -n alloc_cache_1 -l 100%FREE /dev/shared_cache_vg1
# flashcache_destroy -f /dev/shared_cache_vg1/alloc_cache_1
# flashcache_enable -s 1739796M -p back -a 512 -n 12 -g LjGcpd-hiE4-D7VX-FvnG-m0bX-R7Ul-22c2GF cachedev_0 /dev/shared_cache_vg1/alloc_cache_1 /dev/vg1/volume_1

cat /proc/flashcache/shared_cache_vg1_alloc_cache_1+volume_1/cache_info
    ssd_dev=/dev/shared_cache_vg1/alloc_cache_1
    disk_dev=/dev/vg1/volume_1
    mode=WRITE_BACK
    capacity_byte=1739360
    associativity=512
    data_block_size_kb=64
    metadata_block_size_byte=4096
    total_blocks=27829760
    cached_blocks=21673
    dirty_blocks=1742
    synced_blocks=0
    occupied_blocks=1742
    support_pin=1
    version=12
    queued_sync_io_processed=0

WIP flashcache.conf file:
[tpqs4F-6OLA-qnHj-8YPM-gK03-xcX6-CyJnPc]
        WriteMode = back
        loaded = 1
        SSDDevPath = /dev/shared_cache_vg1/alloc_cache_1
        DirtyThreshPercent = 0
        SpaceMissing = 0
        SkipSeqIO = 1
        MaxDegradeFlush = 0
        FlushState = normal
        SSDUUID = LjGcpd-hiE4-D7VX-FvnG-m0bX-R7Ul-22c2GF
        SpacePath = /dev/vg1/volume_1
        SSDID = alloc_cache_1_1
        DiskLoc = 2001-2,2001-1
        CacheMissing = 0
        ReclaimPolicy = lru
        CreateCritSection = 0
        ReferencePath = /volume1
        CacheVersion = 3
        ConfigSource = 3
        CacheSizeByte = 1920349503488B
        DelaySeconds = 0
        ApmFlushDone = 0

Above values are all found from this output:
# lvs -o+lv_uuid --units b
  LV                    VG               Attr       LSize           Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert LV UUID                              
  alloc_cache_1         shared_cache_vg1 -wi-ao----  1920349503488B                                                     LjGcpd-hiE4-D7VX-FvnG-m0bX-R7Ul-22c2GF <---- flashcache.conf SSDUUID section, and CacheSizeByte value
  syno_vg_reserved_area shared_cache_vg1 -wi-a-----       25165824B                                                     n16JEW-p7Al-RQQF-d3va-2G2Y-6Ivm-2zWURp
  syno_vg_reserved_area vg1              -wi-a-----       12582912B                                                     ZsIm70-q59H-ob21-f5CZ-2a8G-w0t5-f7TgPJ
  volume_1              vg1              -wi-ao---- 53693533650944B                                                     tpqs4F-6OLA-qnHj-8YPM-gK03-xcX6-CyJnPc <---- flashcache.conf header section

On reboot, mdstat shows that the md3 device isn't mounted (thus no PVs, LVs, or whatever for cache to even start), so I think there's probably some super basic thing I'm missing here. Digging more...

Tech Junky · Nov 4, 2023

kachunkachunk said:
super basic thing I'm missing here

If its not mounting then you have to use the fstab file to mount it. It sounds like you're able to get it working but, it's not persistent until you add t to the file to auto mount it. The other part is saving it in the mdadm file as well w/ the UUID to build the volume.
at least this is how it's done in a vanilla linux configuration.

kachunkachunk · Nov 4, 2023

Yeah, hehe, it's the obvious thing for Linux boxes and there are lots of suggestions to do that. Synology, well, not so much. Way too black boxy, it seems.

/etc/fstab mounts the volumes via dummy cache device or devices:

Code:

# cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/mapper/cachedev_0 /volume1 btrfs auto_reclaim_space,ssd,synoacl,noatime,nodev 0 0

Comparing with another NAS:
# cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/mapper/cachedev_0 /volume2 btrfs auto_reclaim_space,ssd,synoacl,noatime,nodev 0 0
/dev/mapper/cachedev_1 /volume1 btrfs auto_reclaim_space,ssd,synoacl,noatime,nodev 0 0

There's otherwise no mdadm configuration file anywhere, interestingly! I've tried searching around for any signs of config files relating to md, raid, cache, etc., but no really convincing hits yet. Hoping for a breakthrough some time after some rest, or potential input from anyone observing. And from Synology Support, who surprisingly hasn't immediately closed the ticket on me for running unsupported hardware.

Tech Junky · Nov 4, 2023

Hmm....

That's the issue with black box proprietary stuff. You have to reverse engineer things a bit. It seems like there's a hook somewhere that provides the disk IDs then feeds them into the mount process and then fstab provides them for data use. The only plus here seems to be that it's not locked down.

jei · Nov 8, 2023

jei said:
Ouch. I'm just now debugging 7450 Pro which is seemingly DOA. Not sure yet. Maybe compatibility problem with AOC-SLG3-4E4T + Icydock. Waiting for chinese U.2 adapters to verify.

Off-topic update. Verified 7450 is working in another setup. Updated firmware. Still not working with AOC-SLG3-4E4T + Icydock ToughArmor MB699VP-B V3..

kachunkachunk · Nov 8, 2023

I gave up on trying to get it all working via Synology as well, here. There's probably some illegal request response to what is normally considered normal (or required) by Synology and whatever microcontroller logic there is in the IcyDock, etc.

Synology Support did what they could, but somewhere along the way ended up crashing my volume and necessitating a restore from backup. I have no problem with that part, and gave them the green light to try whatever they wanted, since I had backups. But it looks like some drives are just a pain in the ass for some software and hardware combinations, depending on what parts of the standards/spec they wany to conform to, or not. That's at least my guess, and well, I guess it goes both ways. Maybe Micron is conformant and the others have not been.

I'll use the 7450 Pros in desktops (or maybe vSAN later, but ehh future project at best) with a much more typical use case, I guess.

Tech Junky · Nov 8, 2023

kachunkachunk said:
some drives are just a pain in the ass for some software and hardware combinations,

My experience was using Linux on a Z690/12700K setup and not bound to a "NAS" OS at all. I had 2 drives and someone else still has 2 running just fine. It's odd in my book though. Micron is usually a Tier 1 option for what they make and in this case though it's more hassle than it's worth for me.

There's a listing out on EB though for WD drive for $800 but it's Gen3 in terms of speed. Which realistically would have worked fine but, I tend to over engineer things with anticipation of pushing the limits of things during testing.

Search

Micron 7450 NVMes and Synology as cache

kachunkachunk

New Member

kachunkachunk

New Member

UhClem

just another Bozo on the bus

Tech Junky

Active Member

jei

Active Member

Tech Junky

Active Member

kachunkachunk

New Member

Tech Junky

Active Member

kachunkachunk

New Member

kachunkachunk

New Member

Tech Junky

Active Member

kachunkachunk

New Member

Tech Junky

Active Member

jei

Active Member

kachunkachunk

New Member

Tech Junky

Active Member