Ceph with a failed raid controller

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Franks-Arous

New Member
Mar 17, 2022
7
0
1
Dear All Members,

Hope you all doing great.

Actually we have a Ceph cluster have five Huawei servers.
Each server have a 10 OSDs, one of the server at this morninng issued a failed hardware raid contoller even this server is available
via SSH.

So if i replace the raid card, it will back online.

Or there's any procedure to respect?

Thanks in advance.

Rgards,
Franks
 

Franks-Arous

New Member
Mar 17, 2022
7
0
1
That... seems logical? SSH has nothing to do with your failed RAID controller.
Thanks Wasmachineman_NL,

The data in that server will move to the other servers automatically or no when this issue happen?


Any procedures to respect when removing old card & install the new card to avoid lost configuration & get things back working properly?
Your help is really appreciated.

Thanks in advance.
 

Wasmachineman_NL

Wittgenstein the Supercomputer FTW!
Aug 7, 2019
1,880
620
113
Thanks Wasmachineman_NL,

The data in that server will move to the other servers automatically or no when this issue happen?


Any procedures to respect when removing old card & install the new card to avoid lost configuration & get things back working properly?
Your help is really appreciated.

Thanks in advance.
Can't say I have any experience with hardware RAID or Ceph in general so I wouldn't know unfortunately.
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,342
816
113
Each server have a 10 OSDs, one of the server at this morninng issued a failed hardware raid contoller even this server is available
via SSH.

So if i replace the raid card, it will back online.
That depends.

Are you using actual RAID Controllers in RAID Mode (which is a rather bad idea with Ceph) or HBAs?
 

Sean Ho

seanho.com
Nov 19, 2019
774
357
63
Vancouver, BC
seanho.com
Which raid controller? You'd want to ensure your replacement has compatible firmware (ideally, same model and same firmware), but it should be able to resurrect your array.

Is there a reason you were using ceph on top of hardware raid? That's not a typical implementation of ceph.
 

Franks-Arous

New Member
Mar 17, 2022
7
0
1
Thanks Sean Ho,

Below the raid card installed
03024JMYManufactured Board,SR450C-M 2G,BC11RLCB,SR450C-M 2G SAS/SATA RAID Card MR,RAID0,1,5,6,10,50,60,2GB Cache(Avago3508),Support SuperCap and Sideband Management,Board ID 0X2a,2*2


So if we replace this bad card with another the same mode and frimware will work. So only replcament unplug and plug no action will be take in befor the unplug.

If we can't find another same card is risky to lost the node, any procedure to get survive it?

Thanks in advance
 

Sean Ho

seanho.com
Nov 19, 2019
774
357
63
Vancouver, BC
seanho.com
ooh, it also looks to be a proprietary form factor for a storage slot on the Huawei? Yes, don't touch the OSD data dirs (/var/lib/ceph or similar) on that node's root, don't remove any OSDs (it's ok that they're marked down and out), and don't like pull the drives to try to mount them on another system. Once you get your replacement RAID controller, import the "foreign" raid config (i.e., ask the controller to scan the disks to determine raid topology), then the OSD should just come right back up and start the global recovery event.

If you can't find an identical RAID card, you might have luck with any other card using SAS3508 chip and standard LSI/Avago firmware.
 

Franks-Arous

New Member
Mar 17, 2022
7
0
1
ooh, it also looks to be a proprietary form factor for a storage slot on the Huawei? Yes, don't touch the OSD data dirs (/var/lib/ceph or similar) on that node's root, don't remove any OSDs (it's ok that they're marked down and out), and don't like pull the drives to try to mount them on another system. Once you get your replacement RAID controller, import the "foreign" raid config (i.e., ask the controller to scan the disks to determine raid topology), then the OSD should just come right back up and start the global recovery event.

If you can't find an identical RAID card, you might have luck with any other card using SAS3508 chip and standard LSI/Avago firmware.
ooh, it also looks to be a proprietary form factor for a storage slot on the Huawei? Yes, don't touch the OSD data dirs (/var/lib/ceph or similar) on that node's root, don't remove any OSDs (it's ok that they're marked down and out), and don't like pull the drives to try to mount them on another system. Once you get your replacement RAID controller, import the "foreign" raid config (i.e., ask the controller to scan the disks to determine raid topology), then the OSD should just come right back up and start the global recovery event.

If you can't find an identical RAID card, you might have luck with any other card using SAS3508 chip and standard LSI/Avago firmware.
Thanks Sean Ho for the clarification. Really appreciate it.

Please we found an avgo raid card but we not sure about the firmware, do you have any idea how to know the firmware of this card Before to install it on the server?
 

Sean Ho

seanho.com
Nov 19, 2019
774
357
63
Vancouver, BC
seanho.com
Do you know the firmware version of the old RAID controller? Is it the same as in other nodes; can you check them? In theory, the on-disk header format of MegaRAID shouldn't change. In practise, it does sometimes depend on firmware version, so you may want to flash your new card to the same version. megacli should help you.
 
  • Like
Reactions: Franks-Arous