Samsung is like a box of chocolates, you never know what you're gonna get.
care to attach smart log?
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO 1TB
Serial Number: removed for reasons
Firmware Version: 1B2QJXD7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1.000.204.886.016 [1,00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1.000.204.886.016 [1,00 TB]
Namespace 1 Utilization: 18.244.898.816 [18,2 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4b214093b8
Local Time is: Mon Mar 20 14:52:21 2023 MZ
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 200
2 + 9.39W - - 2 2 2 2 0 1000
3 - 0.0400W - - 3 3 3 3 2000 1200
4 - 0.0050W - - 4 4 4 4 500 9500
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 10%
Data Units Read: 447.759 [229 GB]
Data Units Written: 702.829 [359 GB]
Host Read Commands: 7.854.856
Host Write Commands: 16.103.321
Controller Busy Time: 142
Power Cycles: 11
Power On Hours: 1.671
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 50 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
controller busy time has nothing to do with failures. If you get a server pull drive you are almost guaranteed to see a huge number on that attribute.by this it looks like it never crashed by temp; but controller busy time strikes me as potential issue of it hanging.
(here's one where temp issues did bring it down - also samsung.)
View attachment 27992
I'd say avoid samsung in general.
BTW just got a PM9A1 running 7301 FW with some data integrity error recordings on it from a friend. I will try writing large amounts of data using IOmeter and do a before-after comparison to see if the new fw can really completely mitigate the issue. But anyway, I would recommend people to stay away from new NANDs(SSV6+ BICS5 B37/47 and all YMTC/Hynix). They are all pretty problematic in their own ways. Also do not use any desktop drives in even the slightest write-intensive applications, it's not 10 years ago when fws can absolutely botch the ecc algorithm and still get away with it.
Host issues read command on cold data region with some interval. The error bits on that cold data written NAND wordline are abnormally grown by interval read and result in read UECC finally.
The Read recovery time was insufficient which is to eliminate the given voltage. When read, the certain voltage was given and then there is recovery operation follows to eliminate the voltage (= recovery operation). However, in this failure case, due to insufficient time given for recovery operation, there remains a little voltage and it disturbs other word lines.
Interesting. How sure are we this is a Windows Server issue?Samsung drives newer than the 970 (the last to use a Samsung nvme driver) are not reliable with Windows Server.
Over the past several months I've been debugging issues with both Intel and AMD servers that reset to the bios and the SSD is not detected until a power off/on cycle. I was really reluctant to suspect that the entire product line of SSDs was buggy since this is a very popular drive but I'm certain now and it isn't specific to a particular firmware version.
On 4 different systems using Asus, AsRock Rack, or Supermicro motherboards a high load system would crash every couple weeks or so without ever writing a minidump. I suspected defective drives and swapped a 1TB drive for a 2TB one, or swapped the 980 pro for a 990 but the behavior persisted. Meanwhile several systems with 970 pros and the same workload run stable for months.
After I swapped my X13SAE-F motherboard for an Asus W680-ACE (thinking the X13SAE was at fault) I tried restoring a sql server database from backup and observed this was 100% effective at crashing the SSD controller, causing it to disappear until a power cycle. I checked every related bios setting and all the sql server fixes regarding drive sector size to no avail.
Some of the servers that crash are running MySQL instead of Sql Server and the only common denominator is high IO load, Windows Server 2016 or later, and Samsung 980 or later. With so many bug reports related to Samsung's firmware issues it's hard to find corroborating bug reports so I thought I'd share this here.
Be careful with Sabrent rockets. I've had Sabrent Rocket 4's drop like flies on me. Like, Originals died 2 years in. Had to fight with them to get the RMA (they wanted me to have registered the drives within days of purchase) then the RMA replacements died less than 2 years after that. I'm talking complete brick, non-responsive does not appear as a connected device on any system....and two systems with Sabrent Rockets are perfectly stable.
There goes my "I only trust Intel and Samsung SSD's" shopping philosophy, as they are the only brands I've ever used that haven't failed on me.I'd say avoid samsung in general.
Well, I am going to make sure I put heatsinks on mine, and point a small fan straight at them, though this might be tricky what with the m.2 ports on the H12SSL series of motherboards potentially blocking longer PCIe boards if they get too tall...Trust no marketing. Better always do homework and check what are temp limits for the components and obey these conditions. Have learnt also not to trust any temp reporting from the device which maybe failing When more performance promised and use scenario and/or environment temp is above normal additional heatsink needed for all brands. Slower NVMes doesn't have that issue so often as running out of DRAM and speeds drops which lower heat output generated by controller.
fio
will do random r/w, though I can never remember the commands and always have to look them up, here's a page of examples to get you started (no endorsement of Oracle, however): Sample FIO Commands for Block Volume Performance Tests on Linux-based InstancesSome brands may be better or worse than others in general terms, but the better way to shop is to just avoid consumer drives and only buy hardware with full PLP, because manufacturers have a lot of incentive to avoid bugs when they sell to big enterprises with large budgets and long memories, whereas consumer drives are primarily designed to win 20 second benchmarks and cost as little as possible to make.There goes my "I only trust Intel and Samsung SSD's" shopping philosophy
Thank you sir. I will look up fio commands.fio
will do random r/w, though I can never remember the commands and always have to look them up, here's a page of examples to get you started (no endorsement of Oracle, however): Sample FIO Commands for Block Volume Performance Tests on Linux-based Instances
Note that if your drives are already down to 60% random write tests could take a big dent out of the remaining lifetime, you might be better off just assuming these are dead and replacing them with something better soon.
That is true. I take this philosophy with most hardware, but with drives the cost penalty is a little much for my home system, so instead I have just been making sure I have some redundancy and decent backups and buying consumer drives. (Except for my SLOG drives in ZFS, where I used Optanes for obvious reasons)Some brands may be better or worse than others in general terms, but the better way to shop is to just avoid consumer drives and only buy hardware with full PLP, because manufacturers have a lot of incentive to avoid bugs when they sell to big enterprises with large budgets and long memories, whereas consumer drives are primarily designed to win 20 second benchmarks and cost as little as possible to make.
Micron/Crucial (MX500) models. *(its us based company)There goes my "I only trust Intel and Samsung SSD's" shopping philosophy, as they are the only brands I've ever used that haven't failed on me.
Intel exited the market, and Samsung seems to have gone to shit. Now I have no idea what brands I can actually trust.
I know the fins are pointing the wrong way, which is why in my final build I will be sticking a small 40 mm fan on there to make sure I move some air over the fins. It's surprising that supermicro would point the slots in that direction knowing that the airflow in most servers is perpendicular to that. They are usually better than that.Just make sure your finstack has proper direction so air can pass through them. As they are now - they won't get any airflow over its fins.
Thank you. I am very familliar with Smartctl, just cant run it while I am running a hours long memtestAlmost all nvme run hot like that without airflow. I recommend larger finstack heatsink, or proper airflow.
you can use hwinfo (windows) or smartctl -a /dev/sdX | grep "Temperature" on linux to read their temps while running tests.