OKAY, thank you very much for the details. I have saved this information, but not on an LTO tape unfortunatelyDifficult but doable. Also on a budget.
For starters, avoid corruption of data in the running system by employing stable CPU/RAM/board/software combinations, so there is no garbage in/garbage out. This means the full stack from mature application software, mature filesystem (only ZFS can really compete - checksums everything), mature OS (sorry no ReactOS), mature server hardware and no consumer grade (so Xeon E5, Scalable, EPYC, any ECC-RAM).
Hardware should be verifiable and verified regularly by dedicated software for errors. That can be RAM ECC or CPU errors, so smaller E3 Xeons or Ryzen are out, too (not enough uncore-oomph for diagnostics). Observe SMART data from drives, cooling fan performance, ZFS errors during weekly or monthly scrubs, unexpected kernel errors suddenly appearing in the system logs. All drives are vulnerable to data loss. Beaten-up SSDs from ebay left powered off are the worst. Could see unreadable blocks within a year when left sitting in a hot place. HDDs (SATA, SAS) not even one order of magnitude better, I give them a handful of years until they also grow unreadable sectors, e.g. while sitting in a drawer. Only LTO tapes and quality DVD-R/BD-R last 20-50 years until irrecoverable corruption sets in. But discs are 50 GB max and discerning quality is near impossible. Store tapes vertically and in a cold, dark, and moderately humid place.
Only buy hardware with free and easy to access firmware updates, especially SSDs and NVMEs. There should have been 2-3 updates already so the most eggregious programming errors are corrected. Flash devices without power-loss-protection (PLP) are always the inferior choice and should be avoided. Both items are a nasty source of data corruption if ignored.
Employ aggressive snapshotting with automatic grandfathering (scriptable) so a simple user error, like deleting everything, is easy to correct.
Write backups to disk and then to LTO tape. Google the 3-2-1 backup scheme. Tape to create an airgap between encrypting blackmail trojans and your data. Tape to take stuff offsite in case there is a fire or you get robbed. Backup without verification is just wishful thinking. Verify restore is working once a year.
Consider lightning. One 300kA ("Wilder Hausrüttler") strike into the building will zap everything with a chip. Isolate PoE cable runs, 110/230V mains and DSL/telephone lines with quality protection devices. If fiber is available, get fiber.
Prepare for power outages and UPS failures. APC used to be sturdy, but now they're just another Made in China shop. Eaton the same, currently I like them better still. For UPS failures there could be an automatic transfer switch (ATS) which can switch over to mains within 20 milliseconds should your UPS shutdown due to failure.
Personally I use 2nd gen Xeon Scalables, RDIMM ECC RAM, HGST helium SATA drives, Micron MAX SSDs, Supermicro and Asrock Rack boards, Seasonic power supplies. For storage I have a NetApp DS4246 shelf with redundant LSI 9207 8e controllers. For networking I use Mellanox ethernet cards, cables and switches. Lightning protection by Dehn and some specialty brands.
On the software side I run ZFS on Linux, with pyznap for snapshots, Bareos for backup to LTO tape, rasdaemon, smartd, zed, with comprehensive error reporting by e-mail.
Also sorry for lack of brevity. Didn't have time to write a short answer, so I wrote a long one instead.
Problem to solve: VM host/server and important-to-me dataProbably only by testing/verifying the data regularly...
Zfs (the magical silver bullet ), ntfs and other filesystem will happily persist corrupt data they get from an application (I had a somewhat recent problem where mp3tag changed more than id3 tags).
Btw what problem do you try to solve?
For an enterprise the back up strategy and hardware being used could be totally different than what you would need for your plex library at home
1) You mean consumer boards? Easy: While they have a "supports ECC" sticker on the box, how can you be certain that the CPU is actually performing periodic patrol reads of the RAM, looking for errors? And that the OS will get a report about it? You can't be certain with consumer boards, you need heavy server grade for that. E3 for entry features, E5 formerly, EPYC and Xeon Scalable these days for the best feature set. Board manufacturer has to also test, that this error detection and reporting to the operating system indeed works.
2) See 1.
3) Lower total max capacity compared to RDIMMs, usually 2 channels with 4 sticks not 6 or 8 channels with 2 sticks each on big boards. Possible UDIMMs are more expensive. RDIMM is standard stuff and e.g. DDR4 is available in quantity.
4) Anything m.2 NVME or SATA SSD which has PLP. You need a search engine which allows parametrable searches. Personally I use Micron MAX 5200 and 5300 SATA drives for VMs. Not many simultaneous users though and any databases fit into RAM.
5) ZFS RAIDZ2 if you have 8 large disks or more. Below that, RAIDZ1. If you only have two disks, mirror.
If you are just starting out, don't go overboard, start cheap like with an E3 HP Z240. For something bigger, check out this guy:
Some Tips about using a Dell PowerEdge T640 as a Workstation (or Frankenstation)
Rationale As much as I liked my T4X0 machines (T430, T440) with RHEL, I started to feel limited in terms of PCIe slots, HDD bays, DIMM slot...vcojot.blogspot.comSome Tips about PowerEdge as Workstation (Revisited for 14th Gen servers)
A new computer: Dell PowerEdge T440 server. As much as I consider it a very fine machine now, that road wasn't easy. Some of the previo...vcojot.blogspot.comThe Dell T140 as a frankenstation : Compact, Silent and Powerful enough. if you feel adventurous and like to solderSome Tips about running a Dell PowerEdge Tower Server as your workstation
Some use workstations as servers. I'm using servers as workstations. Over the years, I've changed computing gear on quite a few occasions....vcojot.blogspot.com
Budget is "flexible", but might rephrase the question to in the second hand market what is the cheapest EPYC CPU+Board combo that will get me started as an entry point to ECC RDIMM and other important features ?We can't know what 'break the budget' is unless you've told us a budget
Honestly, if your goal is storage and some VMs, without major CPU load, you could just get a Synology or QNAP pre-built NAS box and move on.
As a former user of ZFS in a professional environment (Datto employee) across a very large fleet of devices, I can emphatically say that a lack of ECC and a stick of RAM that has gone bad can and does result in ZFS corruption in a very reproducible fashion. Reproduce means:As far as ECC and storage corruption, those are completely unrelated.
As a former user of ZFS in a professional environment (Datto employee) across a very large fleet of devices, I can emphatically say that a lack of ECC and a stick of RAM that has gone bad can and does result in ZFS corruption in a very reproducible fashion. Reproduce means:
1. I detect corruption (errors at the bottom of `zpool status -v`)
2. I rollback or destroy data to clear the corruption
3. Corruption comes back in live data/newer snapshots
4. Replacing the bad RAM and destroy/rolling back the corruption again results in permanent resolution of the issue.
Use good drives. Scrub your pool regularly. Use ECC. Replicate your data, preferably offsite.
You can use Ryzen and ECC as it is fully supported. I would as soon say buy used E5 because more features, more RAM capacity, maybe cheaper, but it is not a requirement. Memory scrubbing helps to mitigate multi-bit errors proactively before they can occur by detecting single-bit errors and fixing them (the "correcting" part of ECC). Without scrubbing, single bit errors still get fixed when they are detected on read, but it is not a proactive approach. Once a multi-bit error occurs, the system is halted (the "detecting" part of ECC) as it is favorable to stop execution rather than continue with known bad data. Most likely, even with a system that does not have a BMC, you would see a BERT (Boot Error Record Table) entry during startup that would tell you that you ran into a multi-bit error that cause the system to stop.
Section 18.3 Advanced Configuration and Power Interface Specification (uefi.org)
acpi, apei: Add Boot Error Record Table (BERT) support - Patchwork (kernel.org)
> Do you reckon that if I went the Ryzen route, instead of "budget" EPYC route that I was asking for guidance on, that I would be missing some critical features and is it a major disadvantage to have unregistered ECC as that is the only kind that Ryzen supports ?
If you are using it for home use, even for data you care about, I see no issue. The downsides to consumer ryzen boards is very limited -- a lack of out of band management--meaning that if the system hangs or goes down, physical intervention would be necessary. Unregistered just costs more. There are some technical differences but it does not jeopardize your data to use unbuffered ecc as opposed to registered ecc.
With EPYC or one of the AM4 Asrock Rack boards that has IPMI, you could remotely intervene if there was a problem that caused a loss of access to the OS --e.g. system hang, network misconfiguration, etc. The more enterprise platforms generally allow you to have better uptime. You're not jeopardizing your data on a consumer platform with ECC, but you may be limiting the uptime reliability, if that makes sense.
To be clear when I was referring to 'storage corruption' I was talking about corruption in the storage devices themselves, not on the path to/from them, or in the CPU/RAM/etc. All of these are important things to consider, but I'd classify the larger group as 'data corruption', not 'storage corruption'. In that context the use or non-use of ECC RAM in the machine isn't relevant as it has no effect on the storage devices' ability to properly return uncorrupted data that they had been given to storeAs a former user of ZFS in a professional environment (Datto employee) across a very large fleet of devices, I can emphatically say that a lack of ECC and a stick of RAM that has gone bad can and does result in ZFS corruption in a very reproducible fashion.
Now I am really confused.To be clear when I was referring to 'storage corruption' I was talking about corruption in the storage devices themselves, not on the path to/from them, or in the CPU/RAM/etc. All of these are important things to consider, but I'd classify the larger group as 'data corruption', not 'storage corruption'. In that context the use or non-use of ECC RAM in the machine isn't relevant as it has no effect on the storage devices' ability to properly return uncorrupted data that they had been given to store