whats the State of the Art for Mission Critical on COTS hardware?

This isn't seeking specific advice for anything, it's more just a curiosity... since i'm not even up on all of the buzzwords, trends, and technologies necessarily, i'm just curious what is even out there for the purpose of me reading a little about each topic so I can say "yup I may someday need that!" vs "nope" because alot of things out now werent years ago when I last learned about them. (seeing new things about fault tolerant and high availability virtualization for instance which apparently makes the virtual machine more reliable than the hardware it's on because it can migrate around or get auto restarted)

For instance i've heard there are features on big mainframe gear which still to my knowledge has no replication on the desktop level - like multiple processors which compare outputs, looking for a one in a trillion processing error, where performance doesn't matter but perfection does. Is there any way to program virtual machines to replicate this? Likewise could programming replace the need for ECC RAM (or enhance it for even greater error tolerance if already used)? Are we where a virtual machine might have an uptime of decades in the future being migrated between servers that reboot independantly? How does one update software in use which can't be shut off ever?

Clustering seems to count for alot, virtualization alot, designing systems for fault tolerance and automatic failover, data integrity with new filesystems like ZFS seems remarkably well thought out, what is left to do in the future? What is up and coming soon? Have we hit a point where the right systems and software can make even a lowly desktop PC node reliable enough in the aggregate with other systems in a cluster to run a nuclear reactor or life sustaining critical function where a mistaken calculation is lethal consequences?

Sorry if i'm all full of questions, you guys here just seem all cutting edge and knowledgeable so I figured you'd probably know whats out and whats upcoming better than anyone. :) I'm assuming the greatest problem is and will always remain software, and I wonder if there has ever been anything proposed to make even that more reliable, like running multiple versions of an applictation in parallel, comparing output results (similar to how the five I think computer cores on the space shuttle work, a pretty mission critical task, I believe there are two sets of software, and 5 nodes, and a disagreeing answer is outvoted assumed to be an error) and such.
 

Scott Laird

Active Member
Aug 30, 2014
257
102
43
A "once in a trillion" processing error would happen constantly in any reasonably powerful system. In a top-of-the-line current GPU it'd potentially happen ~10x/second.

Take a look at what Google, Facebook, etc do--they don't generally use redundant hardware at all. No redundant power supplies, no RAID (even mirroring, even for boot drives), no redundant networking. If anything fails, then the entire server is dead. If power or networking fails, then a rack (or even part of a datacenter) is dead. All redundancy is added in software in layers well above individual machines. If one machine dies, then pick a different one and use that instead, re-replicating data as needed. Redundancy and spares only show up where the cost of redundancy is less than the expected cost of failure.

If you want to be able to cope with random math failures in a CPU, then you *could* custom-design a machine with multiple CPUs that run in lock-step and vote on each instruction to make sure they're correct, or you could run the same calculations on multiple cheap servers and compare the end results. I'm having a hard time coming up with a situation where the hardware solution makes financial sense. Space flight, I guess, where mass is critical, manual maintenance is impossible, decade-plus operation is a requirement, and you're paying 8 or 9 digits up front.
 
Right, that's actually what i'm curious about, examples like Google/Amazon where they basically need and have 100% uptime, and every workaround is done through software. They also seem to have unlimited scalability almost exclusively through parallelism, which is fine since it fits their highly parallel workloads fine. (any problem requiring sequential computation like certain encryption algorythms designed to be deliberately non-parralellizeable would just do what they always do - bottleneck, take resources, slow things down)

I'm aware custom machines could be designed to do what I understand the Power8 type servers already do (which is exactly that) but i'm curious if there are already software packages which can force that kind of virtualization onto the desktop. :) Means like with already existing software, that doesn't require much resources. It sounds like VMware is halfway there with their fault tolerant VM I mean. I could easily see something like "spawn 3 VM's and compare states", and another step for software testing of 5 VM's with two different software packages. (the old on 3/known to work, the new on 2 for testing - outputs ignored at first but compared against a running server, then eventually onlined)

And none of this is so much about making sense right now, it's just total curiosity of whats out there, and whats coming next. I'm learning about things in virtualization that weren't even concepts ten years ago when I last investigated it, so i'm curious what they'll be doing in 2022.
 

Scott Laird

Active Member
Aug 30, 2014
257
102
43
I don't know about Amazon, but very little at Google is virtualized as such (ignoring Google Cloud, of course). Building a redundant VM manager that gets full performance is more or less impossible, and *still* doesn't actually solve a lot of interesting problems. Try statefully replicating a single VM to 3 different continents, for example, without being so slow that a calculator from the 70s would outperform you. Also, magic VMs largely limit you to ~1 instance of anything, which isn't useful at scale.

If you want highly-reliable, high performance services, then you really need to design them to be highly reliable, and not try to shim reliability in after the fact. There a bunch of design patterns that make this workable. Generally, you want to break complicated services up into manageable components, and then work on making each component reliable. You end up doing things like making your front ends stateless, so it's trivial to run hundreds or thousands of them as needed. Bits that need to be stateful are carefully designed to work in active-active mode, not master-slave. Each layer depends on load balancing to communicate with the other layers, because that isolates higher layers from failures in lower layers. You need to pay attention to cascading failures, because multithreaded systems with queuing pretty much always have unexpected behavior when overloaded.

Try reading some of the papers that Google has published, like Large-scale cluster management at Google with Borg and maybe Google - Site Reliability Engineering.