[SOLVED] - DL380 Gen9, iLO_300_DriveStatusChanged_Failed

Tim · Jun 25, 2017

Problem solved!

A short recap, also see the summary in this threads post #18 here.

After updating the server with the latest SPP and populating box3 with 8 sata hdd's,
the first service request was sent from the server and HPE support called me.
Before this I've used the server with two ssd's without any trouble.
Two service requests later and HPE support started to take this issue seriously.

Doing all sorts of hdd/ssd rotation (see post #5) didn't reveal anything besides
that the bug was triggered by populating box3 with all 8 disks.
And the server was green/ok on all parts, just the annoying service request.

The HPE custom ESXi image was a bit more helpfull when it at least reported
which disk the server meant was causing trouble (false positive).
But the web interface is a bit buggy and not reliable since wbem/sfcbd is unstable.
And again it was not consistent, everything reported green/ok but random failing
of different disks. Pointing to a false positive bug.

HPE support first replaced the backplane, cables and p440ar. Bug still present.
Then replaced the motherboard and now everything is ok.

Original first post follows:

This is the second time the server is booted and each time it has sent a service request to HPE.
The error message that is sent to HPE is "iLO4_300_DriveStatusChanged_Failed"

But I can't figure out what's wrong.

In iLO4 at the service event log I have the following:
Event id is 300, indicating Physical Disc Drive Service Event.
Event Category is HPQSA0300

This first happend after installing the latest SPP (everyting went ok).
Also the only thing installed at the moment is the ESXi 6.5 HPE Custom image.
This is booting off the internal usb (official HP usb key).

There's no error to be found on the system.
The P440ar is green and ok (firmware 5.04)
And the same for the Wellsburg AHCI controller.

And every 8 hdd's connected is green/ok.
There's 6 of the MM1000GBKAL with firmware HPGE in original trays.
And 2 Intel DC S3710 with firmware G2010110, also in original trays.

Any idea what's wrong?
I guess HPE will call me in a day or two (like last time).
But it would be nice to have something to help me understand where this error is coming from.

UPDATE:

In the service event details in HPE.com I found this:

"RecommendedActions:
A hard drive has experienced a failure.
Check for known FW issues with the drive FW rev and if none are found
proceed with drive replacement using spare part number undefined."

The problem is that I can't find any faulty drive.
And in the description, the server doesn't know either:

"Failing Part: Drive Type: Drive Model number: Drive Serial Number: Drive FW Rev:
Drive Spare P/N: undefined Drive Location: Physical Volume Port: Box: 0 Bay: 0
Failing Part Location: The failing hard drive is installed at Physical Volume Port: Box: 0 Bay: 0
of the array controller installed in Smart Array P440ar RAID in Slot 0 of the server XXX a ProLiant DL380 Gen9 with serial number nnnnnnn."

I've xxx and nnn the server name and serial number, but the rest of the numbers missing is just that, missing.

Is this a known error with the firmware from the SPP?
Or do I have a hdd failure I just can't find?

Tim · Jun 26, 2017

According to HPE support this is just a "hiccup".
I find that hard to believe, but I was asked to wait for the third incident before they would look further into this error.

I did generate the ADU diagnostic report for the P440ar and there was no error at all for any of the disks.
My guess is that this must be a bug in the latest SPP. Some kind of false positive.

I'll report back if anything changes.

darklight · Jun 26, 2017

Had a similar issue with my ProLiant a while ago. Swapping the disks between drive bays helped. Had to rebuild the entire RAID, unfortunately :-(

Tim · Jun 28, 2017

I did pull the drives and put them back just in case one of them was badly seated.
I'll try once more and rotate them to see if that helps.

Next report is scheduled for 1. July so I'm hoping everything will be ok.
It would be nice to be able to deactivate those two reports but I've only found that to be possible either with a de-registration of the server or put it in a timed maintenance mode. I could possibly block it in the firewall.

Tim · Jul 1, 2017

Is it the tray revision?

I had access to the server today, so I rotated the disks randomly and booted it.
Straight away it sent a new service request!

So I figured I'd do a round of "debugging".
First thing, I put it in maintenance mode (iLO - remote support - service eventes - maintenance mode checkbox).
Then pulled all disks and noted the tray revisions.
I've got 651687-001 B revision 2.005, 2.008 and 4.010
All orignal HP (only swapped the two ssd's in).

I did a rotation test with the disks with full reboot for each disk and waited a few minutes for a report.

My first test setup:
Channel 1I Box3 Bay1 - Tray rev. 4.010 - Intel SSD DC S3710, firmware G2010110 (non-hp ssd)
Channel 1I Box3 Bay2 - Tray rev. 4.010 - Intel SSD DC S3710, firmware G2010110 (non-hp ssd)
Then I moved the Bay2 SSD to Bay3-Bay8, all green/ok and no service request sent at all.
I figured this would confirm that the P440ar, both channels, the box3 and all 8 bays are working?

The second test setup:
Channel 1I Box3 Bay1 - Tray rev. 4.010 - Intel SSD DC S3710, firmware G2010110 (non-hp ssd)
Channel 1I Box3 Bay2 - Tray rev. 4.010 - Intel SSD DC S3710, firmware G2010110 (non-hp ssd)
Then tried all six hdd's in Channel 1I Box3 Bay3 and Channel 2I Box3 Bay5.
They all returned green/ok and no service request was sent.
There is nothing wrong with the ssd/hdd/trays as they all work separately I guess?

My third test was to boot with only a disk in bay1, wait for a report, reboot and add a new disk.
As listed below, I ended with a disk in bay1 to bay7. Still all green/ok and no report.

The final and only setup that failed was with all 8 disks like this:
Channel 1I Box3 Bay1 - Tray rev. 4.010 - Intel SSD DC S3710, firmware G2010110 (non-hp ssd)
Channel 1I Box3 Bay2 - Tray rev. 4.010 - Intel SSD DC S3710, firmware G2010110 (non-hp ssd)
Channel 1I Box3 Bay3 - Tray rev. 2.008 - HP P/N 614829-003, Model MM1000GBKAL, HP firmware HPGE
Channel 1I Box3 Bay4 - Tray rev. 2.008 - HP P/N 614829-003, Model MM1000GBKAL, HP firmware HPGE
Channel 2I Box3 Bay5 - Tray rev. 2.008 - HP P/N 614829-003, Model MM1000GBKAL, HP firmware HPGE
Channel 2I Box3 Bay6 - Tray rev. 2.008 - HP P/N 614829-003, Model MM1000GBKAL, HP firmware HPGE
Channel 2I Box3 Bay7 - Tray rev. 2.005 - HP P/N 614829-003, Model MM1000GBKAL, HP firmware HPGE
Channel 2I Box3 Bay8 - Tray rev. 2.005 - HP P/N 614829-003, Model MM1000GBKAL, HP firmware HPGE

They all returned green/ok, but triggered a service request like before.

I swapped bay7 and bay8, still green/ok and this time NO service request was sent.
Then I swapped back, still green/ok and again NO service request.
This was not as expected.

Next up, finding the needle in the haystack.
Even though the server reports nothing wrong, something is triggering the service request.
But for now I can't figure out how/what/when/why.
My best bet is that it has something to do with the tray revision. 2.005 might just be buggy? Or the combination of tray revisions?

Evan · Jul 1, 2017

Gen8/gen9 smart trays right...

I have hundreds or thousands at work, if you want me to check anything I can on Wednesday.
I wonder if the ssd and disk trays are different or anything like that, I can't imagine trays are any different than another and a guess versions is just a minor spec change in production.
They may be called smart disks but I thought the trays were essentially stupid.

Tim · Jul 2, 2017

Thanks, I really appreciate any help here.
I've tried to reproduce the bug today to no avail.
There's no apparent pattern other than the random service request as long as all 8 disks are present.

Yes it's gen8/gen9 smart trays.

Since there's no connection between the disks/trays I guess the only "smart" thing about the trays is that they have a chip to controll all the leds blinking according to what the P440ar tells them. But I'm not using those features as the controller is in HBA/AHCI mode and not any sort of raid setup.

There's no error and no service request sent while only the ssd's are in the trays.
The only time I've got this error is when all 8 bays are populated.

I really don't know what testing I could ask for.
The only thing I can think of is that mixing tray revisions triggers this bug and possible only if all bays are populated. With tray revision 2.xxx and 4.xxx with a mix of ssd and sata hdd and the controller in AHCI mode.
So the only test I can think of if to test this, but I can't reproduce the bug so I guess it's hard to tell just from a random test.

Since nothing seems wrong, all I need is for this service request bug to stop.
It won't tell me what's wrong and every test shows that everything is ok.
So I guess it's a software/firmware bug in iLO or the P440ar controller or a signal fault along the path.

Evan · Jul 2, 2017

For security reasons we don't report anything back to HP but would not surprise me if it's a bug, I know I have certainly seen with spinning disk the odd error that's not an error at all.

Nearly all the SSD we have are connected to H240 HBA and not the P440ar internal controllers. Have not seen any issues with then but they are all SAS3. All the SATA systems are just 2 boot disks so none with all 8 bays populated.

Very very strange.

Tim · Jul 2, 2017

Deregister the server is an option although it shouldn't be needed.
If I do I'll miss out on the SPP and such as the only support agreement that I've got is the one that came with the server. Since this seems to be a bug and not a real failure of any hardware I guess this is the best option for now.

What do you do instead with these reports?
Do you have a dedicated server that collects them for internal use?
Is it better to use insight remote support than insight online direct connect?
Will this let me keep the server registered and let me block the reports?

Yeah, I was going for the H240ar but the offer that was given was with the P440ar.
I might replace the P440ar with a new H240ar just to see if that's better supported for my setup.
Can I reuse the cables that's already in the server used by the P440ar? Just replace the units?

I've rebooted several times now and let the server run for about 30 minutes between each reboot.
Swapping some disks around every time. Even pulled the power a few times to give it a fresh start.
Still no sight of this bug.

I guess HPE support calls me in about 10 hours from now, I'll report back.

EDIT
20 minutes after I posted this I got the service request bug again.
I really don't get what's trigging it.

Tim · Jul 3, 2017

As expected HPE support had nothing new on this bug so they requested an AHS log.
I uploaded it and used AHS Viewer (on their web page) but it has no information on this bug.
There's no sight of the service request and the fault detection analytics returns with "No server faults detected!".
The viewer is very basic so they might get more out of the 17MB (last 3 days) log then me.

Is there an offline tool to read the complete log?

I'll wait for their answer before I disable the Insight Online direct connect.
Would like to be "registered" so that I can download SPP and such, but not reporting.

As far as I'm concerned there's no hardware error, this is just a bug and I can't be bothered with all the calls from support.

Still open for suggestions.

Evan · Jul 4, 2017

Tim said:
Yeah, I was going for the H240ar but the offer that was given was with the P440ar.
I might replace the P440ar with a new H240ar just to see if that's better supported for my setup.
Can I reuse the cables that's already in the server used by the P440ar? Just replace the units?

786092-B21 HP DL380 Gen9 8SFF H240 Cable Kit

That's what we have been using when using the H240's to replace the onboard p440ar's.
The existing cables have the bends in the wrong place and they are rather stiff, don't know if they will otherwise fit. If I get a chance tomorrow I will take a look at a server tomorrow to see if it looks like the existing cables can be used.

Can't say I love the cable kit mentioned above but it does work if a tight fit.

Tim · Jul 4, 2017

Thank you for the cable information.
If the existing cables can be used that would be great if only to avoid the trouble to fit the new ones.

I've not heard back from HPE today so I guess I'll wait a few more days before I decide what to do.

I forgot to mention that the tray revisions should not be the cause of error according to support.
Everything is at the lastest firmware as far as I know too.

Tim · Jul 6, 2017

Support came back with nothing from the AHS log.
They requested the ADU report, waiting for the result.

In the meantime I've used ESXi some more. I'm using this image:
VMware-ESXi-6.5.0-OS-Release-5146846-HPE-650.9.6.5.27-May2017.iso
This is 6.5.0b as far as I cen tell, wondering when 6.5.0d will be available from HPE.
Or is it safe to use the HPE custom release and jost do a regular upgrade?

And now this is no longer just a service request reporting bug with no apparent hardware failure.
iLO shows that everything is green/ok and no service request are sent.
ESXi reports this as well.
But ESXi also reports under "Host - Monitor - Hardware - Storage"
"The physical element is failing" on 1-2 random hdd's if I refresh the list.
Never the SSD's and not all of the hdd's so far.

A random service request is "ok", but I need to trust this if not I have no idea if/when I really have a failing disk.
So now I'm thinking that this might be more than e bug in the iLO reporting back to HPE.
This might be a real hardware/firmware failure in one/more of the hdd's, trays, box3, cables or P440ar.
But why would iLO and ESXi report all green/ok and generate no service request, still ESXi flags random disks as failing?
This bug drives me crazy.

Evan · Jul 6, 2017

Sound crazy!

I took photos of the cables the other day, assuming your using bay3 (right side as you look at server) the cables should be able to be re-used no issues, if you wanted to proceed to can actually cable it to check if you wanted.
Would be useful for my own info although generally I book ESX on 2 x 300gb in bay3 and put all the vSAN disks on bay 2 and 1 with 1 or 3 H240's

Tim · Jul 6, 2017

Thank you, really appreciate all this info.
But don't do it unless you need to know for your self.

I'll wait for the outcome of this bug hunt, it may sort it self out if HPE replaces any faulty hardware.
If not, I'll test the H240ar (cheaper to replace that instead of 6 drives w/trays).
Hoping that it's not the box3 that's at fault here. Hunting down component by component would be expensive.
Or worse, a bug on the motherboard causing this erroneous reporting.

This was exactly the kind of errors I was hoping to avoid going with professional equippment instead of a DIY sollution.
I guess bugs are everywhere.

Tim · Jul 6, 2017

One step closer to a plausible theory.
I've done some more testing and I think I've found the pattern.

I've ruled out third party ssd's since every tray and disk now is 100% original HP parts with latest firmware.

At all times every hardware status in iLO and ESXi is green/ok/normal.
Even when the SFCBD in ESXi reports failing drives (1-2 at random).
Also in the cases where SFCBD doesn't report back any data at all.
And iLO only sends the service request when all 8 bays are populated.

ESXi SFCBD failes to report the P440ar sometimes, while still reporting the disks.

iLO always reports the disk under the P440ar controller in physical view.
But only random under the Wellsburg AHCI controller (why does it report them under that controller anyway?)

My theory at this point is that these patterns reveals that there's nothing wrong with the drives and related hardware.
Also since the system always reports the hardware to be ok, only the "master monitoring" firmware returns errors.
Therefore the fault must lie in that monitoring firmware and that it gets trigged on false positives at random.
Also the fact that the service request only gets sent if all 8 bays are populated seems like a bug in that system.

Seems plausible?

I'm not sure what part is to be replaced as I don't know where this monitoring firmware is located.
I guess the whole motherboard needs to be replaced to fix this bug, or at least a firmware update.

I'll report back when I hear from HP support again.
Most likely after the weekend.

EDIT:
HPE support just called and said they will send out a technician on monday to change the backplane.
Fingers crossed that this solves the problem.

Tim · Jul 10, 2017

Replaced the backplane, the cables and the p440ar. And as I predicted, the bug is still here.

Everything is green/ok on the server and in ESXi. But in the monitoring part in ESXi random bays get marked for failure. And after a second reboot of the server it sent a new service request. The strangest thing about this bug is that it's only present while all 8 bays are used.

Waiting for a reply from HPE support but I guess the motherboard will be replaced this time.

Tim · Jul 19, 2017

Solved!
See the recap in post #1

To sum up this has been nearly 4 weeks of "downtime" with problemsolving.
I've learnt a great deal about the server and HPE support in a positive way.
They have been very active both calling me and sending me emails.
But nearly 4 weeks of "downtime" is not a good thing.
They should have understood this at an earlier stage, at least 12 days ago.
That's when we figured out what most probably caused this, but they decided
to test it step by step and I don't blame them, that's the cheapest way.

I put "downtime" in quotes, because even though the server has been up
and I've been able to use it, the constant hunting for the bug has made it
impossible to have a stable uptime for services and development.
Also, since the bug only was a false positive I was able to use the server
between the testing. Glad this only is my home server for getting experience
with HPE servers at this point.

The solution:
HPE support first replaced the backplane, cables and p440ar.
That didn't change anything.
Now they've replaced the motherboard and one of the hdd's.
I don't know why they replaced the disk as it was clearly not faulty as the bug
was present with other drives in the box3 as well.

Anyway the problem seems to be fixed and I'm happy.

Search

[SOLVED] - DL380 Gen9, iLO_300_DriveStatusChanged_Failed

Tim

Member

Tim

Member

darklight

New Member

Tim

Member

Tim

Member

Evan

Well-Known Member

Tim

Member

Evan

Well-Known Member

Tim

Member

Tim

Member

Evan

Well-Known Member

Tim

Member

Tim

Member

Evan

Well-Known Member

Tim

Member

Tim

Member

Tim

Member

Tim

Member