Maybe I'm just getting old...

pricklypunter · Aug 31, 2019

It seems like lately everything I get involved with falls apart. I had a couple of disks fail at home, losing my redundancy from a rather large ZFS raidz2 pool. No problem I thought, I can still pull the data off in a degraded state while I wait on replacements. Nope, it wasn't to be, 10 mins in, and another crapped out, array offline, no getting that data back anytime soon, if at all. Of course, ZFS being the helpful soul that it is, kept trying to re-silver everything in sight ad-nauseam, making it almost impossible to achieve any meaningful disk access.

Moving on, I pull the 2 definitely dud disks and work on repairing the last one, which I do, eventually, and get it running again. I break sweat, but finally recover 10TB of data successfully, albeit that it was now spread over at least a half dozen random smaller disks of varying years. Not a great place to be in, but at least I got my data back. Replacement disks arrive and I set about getting things back together. More re-silvering...

Wondering why not just wipe the pool and restore from a back-up? Well, I do have one, but it's on the other side of the planet and would involve even more pain to get it here. The really important stuff, I have backed up locally, but that only amounts to about 4TB and is easily portable.

Moving on again, stupidity took hold of me at this point, as it does, and I decided, as it was all bolloxed up anyway, I may as well upgrade everything along the way and start fresh. I cut my teeth not too long ago on Debian, although I have dabbled with other distro's over the years, I like it and it has been good to me, so I move up to using Buster for all my VM's. I'm no expert by any means, but I have been playing with it long enough to find my way around, and anyway, that's for google was invented for, i.e to give you all the answers to the questions you didn't actually ask, but decide you need to know now because they popped up in front of your eyeballs during a search and you didn't want to waste the electricity

I also decide to bring my ESXi up to date and apply the latest 6.7u3 patch. I thought, in for a penny...
I should have had someone slap me at this point.

Well the ESXi thing went sideways immediately, and I was left with a non bootable Host. That was a real pain in the proverbial to get going again, and took me almost 2 days of head scratching, desktop hypervisors and a new Sandisk USB stick to fix. I did eventually get back to where I started, but my head well and truly hurt afterward. My only saving grace was that I usually keep at least 6 old saved config bundles on my laptop, having learned my lesson after last time. As it happened, I needed the oldest one. Now for some reason I fail to explain, despite being able to happily patch up and down until now, I can no longer. The patches all fail with the dreaded ERRNO28 and for various odd random reasons, tools issues, out of disk space when there's plenty, various file issues and cryptic error messages etc. It's an issue that is going to come back to haunt me, I can feel it, but I'll take the win and for the time being, the Host is back up and running, so I move onto upgrading all of my VM's, this should be the easy bit, I thought.

It all looks like it's going great, except, now for some reason my iSCSI config will not load on (re)boot. Much, much digging led me to a bug report from April, before Buster was released I hasten to add, so it could have been fixed beforehand, or at least an update released, stating that one of the libraries used by targetcli is missing a couple of things. Now it does run, if I manually restore my config, but the first couple of times I rebooted, I was like wtf, where did my targets go? Then not realising that my config was still there, just not loaded, proceeded to exit the shell, wiping out said config. The pain didn't stop there either, as the new fancy ZFS 0.8.1 doesn't want to play straight out of the box either. Again, after much pain trawling through other peoples tales of woe, I finally stumble my way through, one by one, all of the issues I was seeing and begin figuring out what the hell was going wrong. I don't remember having any of these ZFS issues when just testing Buster, but I was using 0.8.0 then. Well, it turns out that some libraries are in the wrong place for a start, then there's the fact that it can't import your pool, then it will, if you do it manually, but won't on boot, oh wait, was that my iSCSI, wtf, where did my targets go. You get the idea...

In the end I deleted everything I had just done, got a bare Debian VM install going and took a snapshot. I got so tired trying to remember what steps didn't work that had to reversed again, I just blew it away each time and tried some other approach.

I get to the end of all of that misery and finally get things running again and the power supply in the Host dies, taking the bloody RAM with it. So I quickly do a shonky "get the wife happy again" wiring job (meaning I'll need to spend even more money replacing my new Seasonic cables), swap in some untested RAM and get things going again, while I'm waiting on replacements. All was going great that is up until tonight, when some of my disks began dropping out again, quickly followed by ZFS trying to build me a life raft out of 1's and 0's, which means I can't get at my damned data again. I feared the worst, I could hear my wallet sobbing in the kitchen from the basement, but it turns out that the onboard LSI controller on the Mainboard is now dying. So I have added in a spare controller to get me going yet again just now, and have just ordered a new X9SCM-F to swap out the Mainboard with, because I know where this is heading. I think when I finally get done with this, I'll have the chassis re-galvanised, it feels like that's the only thing I haven't replaced yet

I swear, I should go out and buy a lottery ticket tonight...

Evan · Aug 31, 2019

Always the way thought isn’t it, nothing ever breaks and fixed in a simple way :-/
Being old maybe an advantage in this stuff, I always have this feeling mostly only people of a certain age can really pull everything end to end together with complex systems

amalurk · Aug 31, 2019

Old brains know the value of time so they have a lower tolerance for shit not working like it should.

Stephan · Aug 31, 2019

10 days ago I updated my personal linux VM box from qemu 4.0 to 4.1. Couple of days later I found out a windows 7 vm was severely corrupted, up to the point it would not even boot anymore. What??

First I suspected a regression introduced by the 5.2-stable kernel, which I use together with some custom patches for wifi. Would not have been the first case, been bitten before by stuff greg kh puts out as "-stable". Stable my ass... Anyone who lived through the last disk corruption case about 9 months ago (see 201685 – Incorrect disk IO caused by blk-mq direct issue can lead to file system corruption) that took painful 5 weeks to remedy, will know what I mean. I do hope "Ming Lei" got fired for being that sizable jerk after it was determined that his/her commits were at fault.

Then I suspected rootflags=data=journal,nodelalloc which I introduced only weeks earlier to fortify ext4 better against data loss or trash written into files. Some CERN people hinted in their tracker that this might be a little-tested code path. Wasn't it, either.

After qemu downgrade to 4.0 things calmed down. I did some more tests and finally figured something about qcow2 compression from pre-4.1 must have been tripping up qemu 4.1. Wow.

All in all this wasted a good 3 working days from me to resolve. It made me rethink if Arch Linux is such a suitable distribution for me after all. I will give it some more chances because I have many useful customizations that otherwise work really well. Better and more elegantly than possible in overcomplicated aka bordering unfixable RPM/DEB based systems.

My own lessons:

- Always have world-class backup. Borg backup with its compression and deduplication really is a game changer, because here I could walk back the daily backup history to figure out the last good image. Then I correlated that with changes made after that. In addition I use LTO6 and LTO5 tape backup with Bareos. Just in case the whole machine goes belly up, lightning strikes, or the next bug wipes all data. At most I will loose one working day of data. "3:2:1" is the rule: Keep 3 copies of data at all times, on 2 different media, with 1 copy offsite.

- Keep your setup simple. Complexity is enemy #1 when things go wrong. That's why I use ext4 on mdraid and not ZFS or BTRFS. Using out-of-tree filesystems always sounded like a slightly bad idea to me. In the future I may even switch to mergerfs and snapraid. One drive dead, no problem, two drives dead, only files on that drive will be gone, not the whole file system. Add a larger drive? Sure, just make sure the parity drive is at least as big as the latest added. Complexity extends to complicated pass-through VM setups for Plex transcoding acceleration, networking (do you really need hitless L2 failover in the living room?), or OS choice (install-only, you can't upgrade because it will be broken, like Ubuntu 14 LTS to 16 LTS in-place upgrades and similar disasters). The longer the critical chain, the more chances something will break.

- Choose LTS releases. I really needed to learn this the hard way, but unless I super urgently need a feature from -head or -stable, I will use LTS releases preferredly. Let the latest greatest data corruption fsck-up be figured out by people who get paid for it at RedHat, Canonical, etc.

- When doing any non-trivial update, always be prepared for things to break. Qemu was spitting out warnings the minute I upgraded and rebooted that VM. Next time I'll audit logs alot more aggressively than before. Same with Windows, same with VMware. No-one is paying substantially for Linux QA, commercial ventures have fired their teams. So no-one is really doing any QA anymore to reach mainframe-like code stability. Or at least aspire to that level. In addition software is under constant adaption-pressure from constantly evolving hardware. It has been like this for 20-30 years. And so things will break, often.

- If you do not like long downtimes, having spare hardware around in a box that you could use as replacement is always a good idea.

ecosse · Aug 31, 2019

amalurk said:
Old brains know the value of time so they have a lower tolerance for shit not working like it should.

And despite all the standards on interoperability that sh&t doesn't seem to work any better than it did 20 years ago. Wasted half a day trying to get a Denon AV to perform a simple component switch - f-useless!!

pricklypunter · Aug 31, 2019

Ahh...good, I feel less alone in my misery! Sometimes I just want to stop farkin' about with all this crap and just sit down and watch a TV show on the Plex that I have been struggling so hard with to keep alive for weeks for everyone else to watch

Stephan · Aug 31, 2019

Just been thinking about this some more: One has to wonder if there is even any interest at all to sell pretty much "perfect" products. I.e. stuff that is secure, "just works", and does not demand much maintenance at all.

The reason I am saying this is that I think one can draw parallels between the IT crowd and the laywer/accountant crowd. The latter have lobbied for decades to pile ever more laws and regulations on top of the existing pile, making it pretty much impossible to run a mid/large size company without such people. The built up complexity ensures that they keep their jobs.

Compare this with IT today: Everything is getting more difficult to manage, more complex to integrate, more costly to operate. Is the IT crowd copying the lawyer and accountants crowd here? Establishing more and more job safety by making it hard for the non-IT-crowd to replace them with something? Or optimize the job away without going to great lengths?

I wonder.

pricklypunter · Aug 31, 2019

I think the ultimate problem is that this is all consumer driven. It is the consumer that is demanding things be nice and easy and "just work" right out of the box. There is little, if any, input from the end user now in setting up or configuration, of pretty much any product nowadays. Of course, behind the scenes, in order for all of that to happen, the layers of complexity are pushed our way, and to those, further up and down the line. Not to mention having to think about everything on behalf of the consumer. I think back to when I was younger, there was very little in the way of products that didn't have to come with at least some instruction booklet or other, and that didn't require at least some effort on the part of the end user to get it installed and working. There is also the fact that as time passes, our goals seem to be bigger, better, faster, stronger, more efficient etc, and are ever increasing. That brings it's own complexity as we strive to achieve new technological advancement.

I heard, what to me, was a truly shocking idea being floated around about schools, on the news the other day. They are talking about having no need of pen and paper during exams, because the kiddo's no longer write anything, it's all point and click! All fine and dandy, till there's a power cut or the batteries run flat, or you run out of toner. It makes me really wonder about the direction we are all headed in, because even if you don't like the look of the horizon, you are on the bus along with the rest of us heading for it

Search

Maybe I'm just getting old...

pricklypunter

Well-Known Member

Evan

Well-Known Member

amalurk

Active Member

Stephan

Well-Known Member

ecosse

Active Member

pricklypunter

Well-Known Member

Stephan

Well-Known Member

pricklypunter

Well-Known Member