[Solved] LSI 9341-8i L2/L3 Cache Error.

Apil

New Member
Feb 14, 2017
11
0
1
33
Hi! :)

I have just experienced what i presume is a hardware malfunctioning on my Sas9341-8i raid card.
While server was running, the raid suddenly got alot of I/O errors from a program running, that was writing to the raid, and then the raid disappeared in windows.
After reboot, i now get I/O error in device manager in Windows Server 2012 R2.
And after POST, it tells me:L2/L3 Cache error was detected on the RAID controller.
"Please contact technical support to resolve this issue. Press 'X' to continue or else power off the system, replace the controller and reboot.

From that message i presume that the card has to be replaced to get my raid up and running again.
Is there any way to get this card running again?
If not, then i have a few questions:
Could a FW upgrade possible solve this problem ? Since it seems that everything els is fine (it detects all disks in the MegaRaid Config etc)
Will I be able to simply just replace the card, with no loose of data etc?
Does it need to be the exact same model / FW version etc, for me to get the raid up again, incase i replace the raid.?
Any other way you can help me, or advise on this situation?
Thank you for your time and help.
Best regards
Apil
 

Tom5051

Active Member
Jan 18, 2017
236
29
28
42
These cards are generally pretty reliable when they report a problem like this. I would suggest replacing the controller with another one that is known to work, hopefully you can borrow one off a friend?
Otherwise replace the controller, a firmware update is unlikely to be successful nor cure the card.
Has it got the correct airflow over the card, they get pretty hot.

Also you didn't say what level of RAID the array was built with. It's possible that with a replacement controller, the array will still be optimal but there is always the chance that it has degraded or failed.
You may need backups.
The replacement controller will attempt to get the array configuration from the disks if it is still not corrupt.
Quite often I move an array of 8 disks between servers and the RAID cards pick up the array config and boots no problem.
 

Apil

New Member
Feb 14, 2017
11
0
1
33
Hi Tom.

Thanks for your reply.

The only option i have to replace it, is to buy a brand new one, so its not that easy.
Here i have some doubts, of what will happen to the excisting raid, if i replace the card ?

Its a raid 5 with 8 disks, and all disks seems to be fine.
But again, i cannot open the Mega Raid software in windows anymore.
Though in the Megaraid config <ctrl - p> during boot, it says that the array is optimal etc, and findes all disks.

What hits me as strange, with this "L2/L3 cache error", is that the card dosnt have any Cache ?

The card from what i read is known to run hot normaly, and it has been, at around 90' degree celcius.
There has been dedicated fan straight on the card, but plenty of cabin air flow, that passes the card, which i though was sufficient.
I have now though set a 120 mm fan straight on the card, but it is probably to late.

Something still tells me, that it would be strange if this is a hardware fault, and permenent damage ?
Espesically since the card does not have any cache ?

Thanks again for any help, it is very much appreciated! since my raid is currently down :-(

Best regards
Apil
 

Tom5051

Active Member
Jan 18, 2017
236
29
28
42
That is a strange error, google has nothing about any LSI cards with this error.
Are you able to check if there is a read cache enabled on the raid card? if so, try disabling. Same for write cache which should already be disabled since you don't have BBU.
 

Apil

New Member
Feb 14, 2017
11
0
1
33
That is a strange error, google has nothing about any LSI cards with this error.
Are you able to check if there is a read cache enabled on the raid card? if so, try disabling. Same for write cache which should already be disabled since you don't have BBU.
I agree, the first thing i did, was to try and google it, and got absolutely nothing.

Do you have any idea, how i check if the read/write cache is enabled/disabled ?
Is is a jumper on the board, or a bios setting ?

Any ideas :) ?

Best regards
Apil
 

Tom5051

Active Member
Jan 18, 2017
236
29
28
42
Sorry, in the RAID card bios. I think you said you could still get in there?
Also tell us a bit more about the rest of the hardware specs in this server if possible. Motherboard, cpu, any other pci-e cards.
Have you updated anything recently? Motherboard settings, bios updates, new pci-e network card?
 

Tom5051

Active Member
Jan 18, 2017
236
29
28
42
Most RAID cards that I have experienced over the years have the ability to enable the on-board write cache even if the backup battery (or capacitors) are not present, usually an optional extra. Likewise the on-board read cache is also enabled by default.
You can also turn off each disks read cache but I doubt this will help with your problem.
I think it's either the on-board read/write cache or possibly the RAID card's dedicated CPU has some sort of L2/L3 cache.
There isn't much difference between a processor CPU and a dedicated controller CPU, they just build the controller CPU to do a specific task rather than being a general purpose CPU.
Does that make sense?
 

Apil

New Member
Feb 14, 2017
11
0
1
33
Sorry, in the RAID card bios. I think you said you could still get in there?
Also tell us a bit more about the rest of the hardware specs in this server if possible. Motherboard, cpu, any other pci-e cards.
Have you updated anything recently? Motherboard settings, bios updates, new pci-e network card?
Yea i can still access the <ctrl - p> after this error, i was looking around there yesterday, and didnt find anything interesting.
Do you know what option in there, to disable/enable ?
Els ill try to have a look around again :)

The server is a Dell T20 that i pull out, and placed in a custom rack mounted case, with added case fans.
An Xeon E3- 1225 v2/3 (cant remember), An Intel Dual Gig NiC, and 8 x WD Red 3 TB disks, with a Kingston 120 SSD as System disk.

I did update the bios and firmware of the motherboard, and Raid card, when i build it 6-12 months ago, because i was having problems getting the card to work, (classic cannot start hardware Error 10 in device manager), seemed to be because of that the card does not have any ram/cache, so i had to disable/enable some settings in the bios of the motherboard, to get it to start, and since then it has been running flawlessly, untill now.
 

Apil

New Member
Feb 14, 2017
11
0
1
33
Most RAID cards that I have experienced over the years have the ability to enable the on-board write cache even if the backup battery (or capacitors) are not present, usually an optional extra. Likewise the on-board read cache is also enabled by default.
You can also turn off each disks read cache but I doubt this will help with your problem.
I think it's either the on-board read/write cache or possibly the RAID card's dedicated CPU has some sort of L2/L3 cache.
There isn't much difference between a processor CPU and a dedicated controller CPU, they just build the controller CPU to do a specific task rather than being a general purpose CPU.
Does that make sense?
Yea that makes sense, i just thought that since this is the 9341 version, and not the 9361, then there was no ram/cache on the board, and there for it utilized the ram as memeory or the cache of the cpu, since it has no dedicated memeory.

By the way, sorry for my bad english, and lack of correct terms.
 

vanfawx

Active Member
Jan 4, 2015
359
67
28
41
Vancouver, Canada
Unfortunately I think it's talking about the on-board L2/L3 cache of the raid card CPU, not the on-board RAM cache. If the CPU L2/L3 cache has failed, then it's a sign the CPU itself might be failing on the raid card.

Hope that helps.
 

Apil

New Member
Feb 14, 2017
11
0
1
33
I just tried to update the Firmware on the raid card, and it seemed to have done something.
Now the error dosnt come anymore, and i get the raid in windows.
So that is something.
Though now MegaRaid is giving me this :


Any thoughts ?
 

Apil

New Member
Feb 14, 2017
11
0
1
33
Tried another reboot.
And everything seem to work now, access the raid now, browse the files etc, but except im getting this:


Maybe i should try to downgrade the FW ?
[Edit] Trying to update the driver in windows now.

-Apil
 
Last edited:

Apil

New Member
Feb 14, 2017
11
0
1
33
Yay, after newest driver is installed, and reboot, then no more "Pop up" from MegaRAID with errors, and everything seems fine!
:):):):):):)
 

Tom5051

Active Member
Jan 18, 2017
236
29
28
42
Nice work fixing it. Weird error for sure.
Your right about no cache on that card, from your settings you can see it is set to write through.
If the cache was available it would have the option for write back or write back with backup battery protection.
 

stin9ray

New Member
Jan 5, 2018
1
0
1
51
Hi everybody,

thank you for posting the above. It helped to figure out what was going on.

And I have some good news as well: In my case I did not even have to re-flash the firmware. Here is a description of what happened to hopefully help others, but also for myself in case this happens again ;-)

Setup: I am using the controller for a FreeNAS VM running on ESXi with the controller handed through to the VM. As preferred for zfs usage of course I use JBOD only, so there was no controller level raid that I had to worry about. In my case the controller is a 3008 SAS on the mobo.

Situation: shutting down the FreeNAS VM hard reset or purple screened the ESXi server. On the next boot vSphere would restart the VM and I'd be back to square one. Disabling vSphere HA helped to finally get into ESXi maintenance mode. However, somewhere in the half a dozen crashes or so I am guessing that the configuration stored on the controller got corrupted.

In FreeNAS I saw this in the system log:

Jan 6 12:32:39 fns mfi0: <Fury> port 0xb000-0xb0ff mem 0xfcef0000-0xfcefffff,0xfcd00000-0xfcdfffff irq 17 at device 0.0 on pci28
Jan 6 12:32:39 fns mfi0: Using MSI
Jan 6 12:32:39 fns mfi0: Megaraid SAS driver Ver 4.23
Jan 6 12:32:39 fns mfi0: Firmware fault
Jan 6 12:32:39 fns mfi0: Firmware not in READY state, error 6
Jan 6 12:32:39 fns device_attach: mfi0 attach returned 6
Jan 6 12:32:39 fns mfi0: <Fury> port 0xb000-0xb0ff mem 0xfcef0000-0xfcefffff,0xfcd00000-0xfcdfffff irq 17 at device 0.0 on pci28
Jan 6 12:32:39 fns mfi0: Using MSI
Jan 6 12:32:39 fns mfi0: Megaraid SAS driver Ver 4.23
Jan 6 12:32:39 fns mfi0: Firmware fault
Jan 6 12:32:39 fns mfi0: Firmware not in READY state, error 6
Jan 6 12:32:39 fns device_attach: mfi0 attach returned 6


To make the nested setup work, in the intel mobo BIOS I had the Oprom Control for the controller disabled. After I went into the bios and re-enabled the oprom:
  • F2 on boot to get into BIOS
  • "Setup Menu"
  • "Advanced"
  • "PCI Configuration"
  • "PCIe Port Oprom Control"
  • "Enabled" on all entries

On boot I got exactly the same error during boot as Apil posted at the beginning of the thread:

L2L3_cache_error.jpg

Pressing X to continue and crtl-r to get into the raid controller bios I set the controller to factory defaults:
  • Ctrl-n twoce to get to the "Ctrl Mgmt" page
  • lots of tab to get to "Set Factory Defaults"
  • Ctrl-s to save
  • lots of esc to get all the way out to the prompt that tells you to use Alt-Crtl-Del

factory_reset.jpg

On the next boot the error was not there any more and it listed the connected physical (jbod) drives instead as per normal. Yes.

Clean-up: back into the mobo bios to disable the oprom for the controller

After booting ESXi, turning vSphere HA back on and booting the FreeNAS VM the controller, all the disks, and the zfs mirrored pool were back as if nothing had ever happened.

:)

Update 2018-09-08: I am glad I made this post because it just saved my bacon again. Somebody (kids) stacked boxes in front of my home server rack and I am assuming the controller overheated being cooked by all the disks. The LSI controller probably got into an inconsistent state when it did a thermally triggered emergency shut down, and I can't really blame it for that. Anyhow, with my own instructions I got everything back up and running, but boy is it scary when your disks go missing.
 
Last edited: