C6100 boards biting the dust at a fast clip...

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.
Sep 24, 2014
3
0
1
62
I have 10 C6100 chassis [40 nodes total] running a render farm. I ordered 8 spare boards, RAM and CPUs when I got the original order. This setup is obviously on a heavy load, maxing out the CPU for hours at a time. These are located in a server room with sufficient air-conditioning, with room temperature usually 70°, with directed venting pointed to the front panel of the rack which is consistently at 60°.

Boards are failing left and right. I'm down to 28 nodes online, and that's after I've used all of my spare boards. These systems will not POST, but when the board is inserted into the chassis the drive lights come on and the IPMI appears to be working. BIOS is set to defaults except for the virtualization options, so IMPI is sharing a NIC and not using the dedicated one. IPMI is not acquiring a DHCP address when connected to any port.

When connected to a switch both of the system NICs right-side LEDs light up as solid and don't flash. When the dedicated IPMI NIC is connected to a switch both LEDs light up and flash. Additionally a motherboard LED labeled CR24 flashes steadily [in a working powered-on board it is constantly lit]. See here for LED position — Motherboard

A
ny suggestions before I order replacement boards?
 
Last edited:

Naeblis

Active Member
Oct 22, 2015
168
123
43
Folsom, CA
Considered checking your power?

<lol> or maybe north Korea trying to get a sneak peek at your movies and using powerful microwaves</lol>
 

Mech

New Member
Dec 8, 2015
22
11
3
55
I'd try monitoring the IPMI on the working nodes at idle, when a big render drops, and when it finishes...
Pay particular attention to temperatures AND voltages.

Having 40 machines suddenly go from idle to 100% might be causing problems for your power...
I know the top 2 nodes (node 1, node3) of both my chassis 'lost' the BMC after a power fail -
nodes booted but BMC had to be reflashed multiple times to get it 'seen' again.

You can also check the BMC logs on the working nodes to see if there are any events corresponding with the node failures

Good Luck!
 

Chuckleb

Moderator
Mar 5, 2013
1,017
331
83
Minnesota
We have about 3 racks of C6100s and haven't had the failure rate you showed. The biggest failure we had were 1) user failures due to students inserting CPUs wrong and bending pins... sigh, and 2) memory slot failures. We're monitoring this more often now to see if there is any pattern. This is used in an HPC environment so they run 100% load all the time, with X5675 CPUs.

Let me ask my folks.
 
Sep 24, 2014
3
0
1
62
Considered checking your power?

<lol> or maybe north Korea trying to get a sneak peek at your movies and using powerful microwaves</lol>
Power is good. Rack is supplied with 208V/50A going into an APC SmartUPS RT 8000. Stepdown transformers deliver 120V /15A with two C6100s plunged into each circuit. I've never had a problem with the circuit breakers tripping and the load on the APC has never been over 75%.
 

sag

Member
Apr 26, 2013
34
6
8
I had an issue where some drives went bad and certain nodes wouldn't boot. I removed all the drives and the nodes booted just fine. Have you tried removing the drives?