Proxmox Ceph and ECC vs non ECC

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

marcel

New Member
Oct 15, 2016
19
4
3
51
Has ECC memory any value on a Proxmox Ceph cluster or is non ECC good enough for an homelab?
 

Stephan

Well-Known Member
Apr 21, 2017
944
712
93
Germany
The reason why most people get away with non-ECC is that the window for truly critical writes of data from RAM to disk is tiny.

Consider RAM bit-flips from decaying atoms within chip packagaging or super-fast cosmic ions from outer space. If you run DDR5, structures within the chip are so tiny and fragile, that it has to have ECC internal to each chip to work reliably. What's more, most flips will not matter in the long run. Flips will likely be not in a hot code path leading to a crash. And if they are, just reboot. Flips will change data that will be overwritten within minutes or days or just show benign effects such as a letter on this page changing from a to b. Hit reload and that b is a again. As long as flips aren't written to disk as corrupt data, a simple reboot will always cure it. And even if bad data is written to disk, frankly alot of data in a homelab is worth little anyway. A flip somewhere in a movie will likely not even be noticed. A flip in some database will be flagged by database software checksumming code and thrown out, or maybe once every 10-20 years you will wonder what bug introduced suspiciously wrong values into your Grafana panel. Or a temperature gauge in Grafana might not show 30 but 31 and you will never even find out. Or, a flip did happen that destroys the entire file. But these are Schrödinger's bit-flips, because you will never look at the data again. Like pictures from a party from 30 years ago. And even if you do, you took 200 photos and only one is bad. Since you married the cute girl from that event anyway, you have tons more where that came from by now.

That idyllic picture of digital stoicism and nihilism changes once you run a million or ten computers like CERN or Google. RAM errors will become annoying. Did we just detect a first clue to dark matter, or was it just a bit flip? That kind of annoying.

The people who invented ZFS, myself, and a bunch of others here run a tight ship: Only a few machines, but like CERN. One undetected flip in a 100 years is too much. It's the spirit of pushing for more, where at the end of this path of thinking stand the pyramids. Okay maybe not. Personally, I have witnessed ALOT of online sites and data valuable to me disappear into digital Stovokor over the decades. Not a week goes by where I do not wonder where I'd get some obscure blog post if it wasn't for archive.org. So I save, and archive, alot, often. If you read up on destroyed ancient libraries like Alexandria's or Carthage's (excuse the ignorance of important Indian, Persian et al. libraries, equally burned to the ground by mad men), you may encounter a man named Cassiodorus. In around 550 AD he was your first important private sector datahoarder, bridging the gap from antiquity into and beyond the Middle Ages. He archived often, and alot, and he even did 3-2-1 backups: Copies in multiple places. So when the Republic could not protect its citizens and institutions from destructive vandalism any longer, the works survived. Practicing good data integrity habits like ECC elevates you to a circle of people, who take comfort from knowing that whatever group of mad men, savages or vandals may manage to burn down civilization as we know it, it could be rebuilt from the collection of material from any one of your average datahoarders.

On a practical side, with ECC you will get error counters to detect dying memory sticks.
 

marcel

New Member
Oct 15, 2016
19
4
3
51
Thank you for the explanation! Much appreciated.

It is just a homelab. I like to host a couple of vm's on a Promox cluster with Ceph. Some storage for personal data and a movie. I'm making plans for a new homelab now.
 

marcel

New Member
Oct 15, 2016
19
4
3
51
I've decided to go the Minisforum MS-01 route and not to build myself a system. The MS-01 has no ECC memory if I'm correct but I like the idea of a small barebone system. Thank you all for your input. :)