Hey All,
I've been scratching my head at this for a bit. I posted this on Reddit, but really figure the experienced folks here will know better how to approach or troubleshoot this. It's proving elusive because I can't get any meaningful logging out of my Synology NAS, or the drives.
Re-writing my post on here properly, saving some clicks/reading, I hope:
These are two 2TB Micron 7450 Pro NVMes, installed into a Synology RS1221+ via E10M20-T1 PCIe card.
They are detected fine and without error, and:
And if I view the SMART stats via the UI, it errors out, unable to gather SMART stats, and marks the drive as Critical.
Interestingly, I also can't write to either of the devices' namespaces via CLI anymore; I just see
Either the namespace or controller is entering some problem state that I can't discern/find from CLI and logs, or the NAS' OS is preventing writes to the devices (a bit more likely, but still a guess).
I can get back into a usable state by clearing the critical state of the devices in the Synology Db. I run the following python script to simplify/automate this:
Then you have to disable and re-enable the NVMe devices and rescan the PCI bus. I scripted some of the following, but here's a rundown:
Following one of many additional attempts, I took more notice of the following:
Despite there not being a cache at the moment,
And, on one occasion/attempt, I noticed two partitions created on one of the NVMes, interestingly:
That had not occurred before, or at least each time. The other NVMe remained empty.
So, my thinking for now is that the cache creation process is attempting some command that isn't supported by the controller and it reports an error (invalid field in CDB maybe, but I have no logs or events that corroborate any of this). Then the cache creator wizard does not gracefully exit, and I think this leaves the system side cache device present:
And it's probably locking the devices down to be non-writable outside the kernel. It just... probably should not do that unless the cache creation process actually finishes.
Caveat for sure: I'm making several assumptions here. Maybe
I can't find anything in the journal or exported syslogs, drive logs, etc. Would greatly appreciate some suggestions or help!
I've been scratching my head at this for a bit. I posted this on Reddit, but really figure the experienced folks here will know better how to approach or troubleshoot this. It's proving elusive because I can't get any meaningful logging out of my Synology NAS, or the drives.
Re-writing my post on here properly, saving some clicks/reading, I hope:
These are two 2TB Micron 7450 Pro NVMes, installed into a Synology RS1221+ via E10M20-T1 PCIe card.
They are detected fine and without error, and:
- I can benchmark (read and write to) the drives.
- I can perform read-write tests via CLI using tools like fdisk, mkfs.ext4, and dd.
msecli
andnvme
can read information about the devices just fine.- I can reformat the namespaces. I tried 4K and 512-byte format, without any new outcomes so far.
- I can get device SMART and log counts/outputs, and there are no problems or events of interest logged.
- Errors are not incrementing on the drives between each attempt at reproducing the issue, according to SMART info.
- I have installed the latest Micron firmware into one of the firmware slots and reloaed/reset/rebooted.
- Seagate Firecuda 530 NVMes were working completely fine on the PCIe card, as cache.
I can retry, to no avail.The system failed to mount an SSD read-write cache on Volume 1. Please try again.
And if I view the SMART stats via the UI, it errors out, unable to gather SMART stats, and marks the drive as Critical.
Interestingly, I also can't write to either of the devices' namespaces via CLI anymore; I just see
Operation not permitted
errors.Either the namespace or controller is entering some problem state that I can't discern/find from CLI and logs, or the NAS' OS is preventing writes to the devices (a bit more likely, but still a guess).
I can get back into a usable state by clearing the critical state of the devices in the Synology Db. I run the following python script to simplify/automate this:
Python:
import sqlite3
database_paths = [
"/var/log/synolog/.SYNODISKDB",
"/var/log/synolog/.SYNODISKTESTDB",
"/var/log/synolog/.SYNODISKHEALTHDB"
]
# Note: Substitute "your_serial_number1" and 2 with your actual drive serial numbers
target_strings = ["your-serial-number1", "your-serial-number2"]
for db_path in database_paths:
# Connect to the SQLite database
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Fetch all table names
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
for table in tables:
table_name = table[0]
# Fetch column names for the table
cursor.execute(f"PRAGMA table_info({table_name});")
columns = [column[1] for column in cursor.fetchall()]
for col in columns:
for target in target_strings:
delete_query = f"DELETE FROM {table_name} WHERE {col} LIKE ?"
cursor.execute(delete_query, (f"%{target}%",))
# Commit changes and close the connection
conn.commit()
conn.close()
print("Finished processing all databases.")
- Run
lspci -q
to list and look up all devices. - Locate the Non-Volatile memory controller entries and note the IDs.
- Run
echo "1" > /sys/bus/pci/devices/0000\:06\:00.0/remove
where "06" is the correct PCI ID from step 2. - Run the same thing for the other device:
echo "1" > /sys/bus/pci/devices/0000\:07\:00.0/remove
- Run
sleep 1
to wait/sleep one second, particularly if scripting this stuff. - Run
echo "1" > /sys/bus/pci/rescan
to rescan and pick up the NVMes again.
DSM will immediately detect the cache devices, and the aforementionedOperation not permitted
errors upon writing via CLI will cease. I can write to the namespaces again, read SMART stats in the GUI, manage the devices, etc., until trying again to attach them as a cache device.
Following one of many additional attempts, I took more notice of the following:
Despite there not being a cache at the moment,
/dev/mapper/cachedev_0
exists, and is the mount point for several BTRFS snapshot-enabled volumes. Maybe more specifically volumes where the snapshots are browsable.And, on one occasion/attempt, I noticed two partitions created on one of the NVMes, interestingly:
Code:
# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 1.8 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: Micron_7450_MTFDKBG1T9TFR
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xbd083443
Device Boot Start End Sectors Size Id Type
/dev/nvme0n1p1 8192 16785407 16777216 8G fd Linux raid autodetect
/dev/nvme0n1p2 16785408 20979711 4194304 2G fd Linux raid autodetect
So, my thinking for now is that the cache creation process is attempting some command that isn't supported by the controller and it reports an error (invalid field in CDB maybe, but I have no logs or events that corroborate any of this). Then the cache creator wizard does not gracefully exit, and I think this leaves the system side cache device present:
Code:
Disk /dev/mapper/cachedev_0: 48.9 TiB, 53693533650944 bytes, 104870182912 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 65536 bytes / 458752 bytes
Caveat for sure: I'm making several assumptions here. Maybe
/dev/mapper/cachedev_?
always exists for each volume, and is a Synology user's primary mount point for several things, even if you've never created a cache before. I can't confirm/compare this anywhere at the moment.I can't find anything in the journal or exported syslogs, drive logs, etc. Would greatly appreciate some suggestions or help!