I received a couple of new 32GB LRDIMMs today which I wanted to test in the Cisco Server, but things went horribly wrong.
I removed the previously installed memory, installed the new DIMMs and then ran memtest86. Everything was fine.
I then proceeded to add back the previously installed DIMMs in addition to the new ones, but somehow ended up with one memory riser (or SMI2 channels to be exact) going bonkers.
The Server successfully recognised the installed memory, but showed on POST....
and gave me this look:
No amount of rebooting, swapping memory risers or swapping DIMMs between risers while the server was off brought me any closer to a fully working server...
I already reckoned on a dead CPU (because the problem was always on memory riser slot 4, even when I swapped memory risers)... But I managed to solve the issue without replacing a CPU.
This is what I've done:
I guess this is a Firmware / Silicon Bug in the Intel Jordan Creek Scalable Memory Buffer...
I removed the previously installed memory, installed the new DIMMs and then ran memtest86. Everything was fine.
I then proceeded to add back the previously installed DIMMs in addition to the new ones, but somehow ended up with one memory riser (or SMI2 channels to be exact) going bonkers.
The Server successfully recognised the installed memory, but showed on POST....
Total Memory = 512GB Effective Memory = 448GB
and gave me this look:
No amount of rebooting, swapping memory risers or swapping DIMMs between risers while the server was off brought me any closer to a fully working server...
I already reckoned on a dead CPU (because the problem was always on memory riser slot 4, even when I swapped memory risers)... But I managed to solve the issue without replacing a CPU.
This is what I've done:
- Power off the server (no need to de-energise it)
- Remove all memory risers
- Remove all DIMMs on the faulted riser except one
- Choose one random memory riser (but not the that faulted!) and plug it into slot 1
- Boot the server to windows
- Hot-Add the remaining memory risers in random order (but not the one that faulted), verify that everything is working
- Procedure:
- Plug memory riser into empty slot
- Press "Attention" (ATTN) Button for ~1sec
- A couple of seconds later the green POWER LED should become lit (solid green)
- The Amber Attention (ATTN) LED should start blinking
- At some point the Amber Attention (ATTN) LED turns off and the green POWER LED starts blinking
- After about 1m30s in total since the POWER LED became initially lit it should stop blinking and show a solid green light
- NOTE: It's perfectly normal that Windows seems to be frozen during the memory training (Riser blinking)!
- NOTE: Windows Server 2019 does not seem to recognize hot-added memory
- Procedure:
- Power down the server
- Remove the initially faulted riser
- Restore the initially faulted riser to the desired memory configuration
- Boot to windows
- Hot-Add the initially faulted riser as described above
- Reboot
- Be welcomed with
Total Memory = 512GB Effective Memory = 512GB
I guess this is a Firmware / Silicon Bug in the Intel Jordan Creek Scalable Memory Buffer...