arcconf recan 1

Myth · Jun 25, 2018

Hey guys,

I have a server in production and it has a failed HDD. It's actually been failed for over 6 months. Luckily it was RAID 6 so it's been running smoothly on RAID 5 ever since the HDD failed.

I've verified that the HDD is actually dead, by pulling it and trying to get it to boot on a server in the lab. But it doesn't boot. So I know it's a faulty HDD.

So I bring a new HDD into the production server, it flashes a little blue light for a second, it spins up and then it doesn't do anything.

We use adaptec controller cards, and the Max View Software doesn't show the drive at all. I've already told the client to reboot the server, but he's afraid it will make it worse. He then said in two weeks they will have no need for the data on that raid array, so he asked me if he should wait until the data was neglible (after a tv show is completed and delivered.)

I told him it was his choice. Then I get a call from his boss and they are so angry. It's a political nightmare.

But anyways, they want this server fixed, they don't want to pay for a backup. They don't want to reboot it.

So I was thinking of running arcconf rescan to see if it could find the new drive that way. What do you guys think? There is talk of it being a bad backplane because some of the LED identifiers for when you highlight a HDD and try to identify the HDD location physically on the back plane, well some of those indicators don't work. But I think the reason why the HDD isn't auto reconziged is because it's been offline for over six months.

What do you guys think?

OH also, drive 0 now has 16,000 aborted commands and I'm afraid it won't last another two weeks.

It's so difficult to troubleshoot because the client (their tech guy) is so sure it's a backplane failure, but I'm 90% positive it's just an HDD issue. So if I tell the client to do something and it fails, they judge me as incompetent.

i386 · Jun 25, 2018

I have some raid controllers from adaptec, 6805 and 81605zq. Both detect "new" devices after a few seconds in max view.

The only time the controllers didn't detect any drive was when I didn't insert the sff8087 cable properly. Leds would blink when hdds/ssd were inserted, but controller won't see them.

Myth said:
What do you guys think?

About the situation? Or the controller?

Myth · Jun 25, 2018

Do you think that the re scan feature would work to detect the new HDD?

i386 · Jun 25, 2018

I don't know.

aero · Jun 25, 2018

Is the data backed up in some fashion?
That should be priority #1 in my opinion.

I think the most sane course of action after that is to do a rescan with the new drive installed.
https://storage.microsemi.com/en-us...ew_ug.xml/Topics/Rescanning_a_Controller.html

ecosse · Jun 26, 2018

One thing I learnt through a number of cloud designs is always provide a standard backup for any environment other than dev. Do not give anyone the option not to backup for a minimum of 7 days retention. Dev you obviously backup the repos.

I'd probably backup the server daily, or to whatever data loss (RPO) you think the client can afford to lose. Learn from it for next time (and get better monitoring of course!)

No actual help on fixing it - as long as I can backup the data I'd probably leave it in place. If you have another setup with the same set of disks you could copy the data across and do a card swap - that would work assuming the backplane isn't the faulty server. Or have you a second server that you could use?

P.S. Virtualisation / containerisation helps in this regard - easier mobility

moblaw · Jun 27, 2018

I know it sounds odd, but can you low-level-format the new drive. Then insert it, and the manager should pick it up. Adaptec manuel will tell you to reboot and CRTL+R to force a rebuild. While the rebuild time would probably take 1-3 days, and the array will be in an even more fragile state, once the rebuild has started. HDDs are more prone to fail during rebuild, so I'd say, if the array can tolerate a complete del, in a a few weeks. I'd not bother, the risk is higher than the reward.

16.000 aborted commands could very well be a backblane/cable problem, aswell as a hdd failure in general. But still, if nr2 drive fails, the data/array is still working condition. 16.000 aborted commands is alot.

you will run into downtime trying to fix this problem, that's for sure. So keep it running.

Myth · Jun 27, 2018

I offered a backup for the client but the client refused the service because he didn't want to pay for it. But I'm sure if it fails, then he will blame us.

I've been having heart palpations when dealing with this client and my boss, both men are egotistical maniacs.

Oh, and also, I let this client know 6 moths ago to replace the failed HDD, but the client was shopping around for cheaper options so he said "I'm looking at my options, I'll let you know"

EffrafaxOfWug · Jun 27, 2018

Myth said:
I offered a backup for the client but the client refused the service because he didn't want to pay for it. But I'm sure if it fails, then he will blame us.

You need to get them to confirm, in writing, that they absolve you of responsibility if the data is lost. If they're going to hold you hostage over a) a perfectly routine hardware failure and b) purposefully failing to allow you to mitigate that failure then you need to be prepared to walk away from the table.

Every situation I've ever been in when a client has refused to budget for backups and then blames everyone but themselves when some data is inevitably lost has been a grade-A twazzock looking to sue someone for their own incompetence. Don't be that victim.

Search

arcconf recan 1

Myth

Member

i386

Well-Known Member

Myth

Member

i386

Well-Known Member

aero

Active Member

ecosse

Active Member

moblaw

Member

Myth

Member

EffrafaxOfWug

Radioactive Member