X10QBI and v3/v4 cpus (e.g. supermicro sys-4048b-trft)

angel_bee

Member
Jul 24, 2020
62
28
18
I assume you have tried the A1,B1,C1,D1 config with different DIMMs to rule out a faulty one?

Resetting CMOS would've been my next idea, there are some LRDIMM specific settings which might not have mattered when you used RDIMMs before.

Which BIOS version are you using?

The X10QBi is a nasty diva, I always pray that I don't get and crude VM memory errors when rebooting the machine
it seems i pre-emptively answered your questions a few minutes ago! im using the very latest BIOS and redfish IPMI firmware

1617193746848.png

P.S. yes, I am absolutely sure none of the new 64GB sticks are faulty. im running on all 24 of them in Ubuntu right now. the problem is just the DIMMC DIMMD thing
 

NablaSquaredG

Active Member
Aug 17, 2020
276
101
43
Hmm... That's bad. Honestly, I'm out of ideas.

My troubleshooting approach would be
  • Start with 1 memory board in P1M1, single DIMM, check working
  • Increase DIMM count in the memory board from 1 to 4 (A1, B1, C1, D1) in that order, check working
  • If it doesn't work: Try with a different memory board
  • If it still doesn't work: Swap DIMMs
  • Perhaps swap CPUs to rule out a damaged CPU
Note I've had some hefty issues with P1M1 and various different errors ("TMSE TXC DQS DCC", "VMSE Train read failure", "VMSE Train write failure", "VMSE Read fine failure"), the P1M1 is a bit wonky... I tried everything and the final shitfix, aka heating up the slot with a hair dryer, worked and still works.

BTW, I wouldn't bother contacting Supermicro support if I was you. I've had a LENGTHY argument with them regarding memory compatibility (I've also had some issues), in which the international support has impressively demonstrated their gross incompetence and inability to comprehend complex questions and issues.

I can send you the email chain via PM if you're interested.
 
  • Haha
Reactions: angel_bee

angel_bee

Member
Jul 24, 2020
62
28
18
yes, I already increased DIMM count gradually. For 4 memboards, it works completely fine with DIMMA and DIMMB but as soon as I put anything in DIMMC it complains. Swapped memory boards too. Here's the thing: the training failure/VMSE failures are ALWAYS at DIMMC of one or multiple boards. So it's really obviously pointing to something that fails as soon as the memory training moves on from channel B to channel C. And plus it works fine with 64 sticks in all channels slot 1 and 2.

at this point im just trying to test if my current configuration is stable. if it is, im just going to leave it be. I don't see myself needing more than 1.5TB of ram.

thanks for your suggestions anyways. if there are further developments i'll post again here :D

P.S. oh im so glad someone else voiced this too. supermicro tech support is the worst. actually the first time I emailed them asking for the 8-pin GPU power pinouts they were so nice and gave me all the detailed blueprints. after that, I think they had a change of management maybe? and these new people are so bloody incompetent, condescending and dismissive. but i already emailed them tonight.

also gona call Samsung tomorrow too. fun times. not.
 

angel_bee

Member
Jul 24, 2020
62
28
18
This has since been worked through. but im keeping this for archive purposes.

Update: I just did an entire full day of rebooting and I have managed to barely get 4 memboards with all channels populated, with random mixing of Q and non-Q RAMs.

However, I am still concerned about the stability.

What happened was that I noticed that "VMSE DC detect failure" is actually relatively benign, and if you get only these messages, you just have to wait for it to reboot and have another crack at training the rams (maybe with looser timings, i dont know). However, "DDR training failure" is bad. VERY bad. If you get this, it means there's something incompatible between the memboard and the slot that's showing the DDR training failure. You must remove the memboard from the complaining slot before you do anything else.

I noticed that:
a) PxM2 slots cannot be filled using my rams. Now I'm no longer sure whether it's due to the presence of Q version RAMs (i do not have enough non-Q to test). However, I do know that using purely Q version RAMs will not work in PxM2 slots. All memboards must be placed in P(1-4)M1 or you get the stupid DDR training failure.

b) I have conclusively shown it is impossible to use all 8 memboards and all 4 channels at this stage. i get DDR training failure regardless of the combination of memboards. It seems like it's a stability issue. The more boards I put into the system, the less stable everything gets and the more "DC detect failures" i have to cycle thru before it POSTs.

c) With my current configuration with only M1 slots filled, I can boot into the OS. However, my geekbench5 scores (xeon E7 8880-v3 here) are SO LOW. I used to get ~32k and now i'm only getting 15k on the multicore test. Is this normal going from 8GB RDIMMs to 64GB LRDIMMs? Did this happen to anyone as well??

At this stage I don't even know whether my performance is severely degraded due to the Q version rams or because 8rx4 LRDIMMs are inherently much slower... I'm seriously considering buying actual non-Q version to see if it indeed works seamlessly.

i dont know
 
Last edited:

angel_bee

Member
Jul 24, 2020
62
28
18
UPDATE:

it's the CPUs.
somehow.

scratching my previous posts about Q vs. non Q variants. @NablaSquaredG you were right. you actually can mix them. I put in 4x E7 4820 V2s and it instantly worked.

Prior to this, I was running 4x 8880 v3 with MEM1 Rev. 1.01 boards and it was working fine with 64 x 8 GB RDIMMs so I took the compatibility for granted... sigh. there must be something about LRDIMMs that makes the system increasingly unstable the more channels that are populated.

The best way would probably be to update to a new BIOS that supports this configuration or buy some MEM1 Rev 2.00 boards. but no new BIOSes have been released since 2019 and MEM1 rev 2.00 boards are impossible to find.

UPDATE #2 4/4/2021 (it's midnight now): wrapping up my experience with v3 CPUs and 64GB LRDIMM - a guide for using 64 GB LRDIMMs with MEM1 rev. 1.01 getting VMSE DC detect failure/VMSE DDR training failure.

After another day of frustration, I've got my system working in pretty much the best way as I feasibly can. Hopefully the lessons I've learnt can help others in the future.

As others have previously mentioned, the Jordan Creek memory buffer is different from MEM1 rev. 1.01, MEM1 rev. 2.00 and MEM2. Others in this forum have only described usage with up to 32 GB DIMMs, but never with 64GB LRDIMMs. As a general observation, I think the issues I ran into may be specific to 8rx4 LRDIMMs being run on MEM1 rev. 1.01 boards, and this issue does not show with RDIMMs with lower rank or capacity.

I managed to find out that in lockstep 1:1 mode, everything seems to work perfectly. It is only with 2:1 performance mode that the issue emerges. This led me to think that maybe the memory buffer in MEM1 rev. 1.01 does not have a high enough bandwidth to support these massive LRDIMMs? So maybe when it tries to double the data rate with 2:1 performance mode, it chokes.

So you can use ALL 4 channels in ALL memboards with MEM1 rev 1.01 provided you must use lockstep mode. However, because lockstep mode literally halves the DRAM performance, it is highly undesirable because you'd be using LRDIMMs for large in-memory computes anyways.It results in a maximum effective 4 sockets x 2 memboards per socket x 2 channels per memboard = 16 RAM channels because, as I understand, DIMMA is tied to DIMMC and DIMMB is tied to DIMMD for each board for lockstep. So is there a way to get more than 16 effective channels?

It turns out I also managed to discover that you can fill in a maximum of 6 of the 8 available channels for every socket in 2:1 performance mode. Beyond this, I get the DDR training failure and the show needs to be stopped. Technically, I think it means that the MEM1 rev. 1.01 with V3/V4 CPUs somehow can only handle the bandwidth equivalent to at most 6 "doubled" physical channels per socket (or equivalently, 6 independent logical channels per socket). If anyone has a better explanation, I'd like to know.

SO.... this means for each socket, for example, processor #1, you fill in DIMMA1, DIMMB1, DIMMC1 and DIMMD1 for P1M1 (all 4 channels occupied). Then for P1M2, you only fill in DIMMA1 and DIMMB1 (channels 5 and 6). I'm sure that you can fill out more slots per channel e.g. DIMMA2, DIMMA3... but I don't have enough LRDIMMs to find out.

Populating all 8 memboards in this 4/2/4/2/4/2/4/2 configuration in 2:1 performance mode actually POSTS. time to celebrate. What this means is the total effective channels in this configuration will be 4 sockets x 6 independent channels per socket = 24 effective channels

A test on Geekbench confirmed the RAM speed indeed improved, but as expected, the score is consistent with a 25% RAM performance drop from missing the last 2 physical channels on every socket.

In summary, as a general recommendation, using 64GB 8rx4 LRDIMMs brought out the underlying limitations associated with using v3/v4 CPUs in the unsupported configuration (i.e. MEM1 rev. 1.01), despite fully healthy memboards and fully working LRDIMMs which were retrieved straight from the Supermicro compatibility website. Since MEM1 rev. 2.00 boards are practically impossible to find, this is a workaround that enables using cheap DDR3 LRDIMMs, but at the same time it sacrifices 25% of the peak RAM performance and when it POSTS, you'll have to wait a longer time for it to gradually work through the VMSE DC detect failures (which are relatively benign).
 
Last edited:

synchrocats

New Member
Oct 20, 2020
9
5
3
Just to note: If you somewhere in the night stuck with f*cked up bios and cannot recover it with IPMI use Supermicro Update Tool:
/SUM -i 192.168.0.107 -u ADMIN -p ADMIN -c UpdateBios --file <PATH TO BIOS FILE> --force_update --reboot
Also, bios chip for x10qbi is mx25l12835fmi-10g
 

synchrocats

New Member
Oct 20, 2020
9
5
3
Also, does anybody got a working pci bifurcation? I saw it in manual and in bios firmware (with other interesting features like TDP control via AMIBCP 5.02), but I can't see it in bios
 

NablaSquaredG

Active Member
Aug 17, 2020
276
101
43
My X10QBi is probably dead.

After I was able to temporarily fix the issue with P1M1, I'm now getting complete system hangups after max. 10m. Both Windows and Linux, even in the Linux setup (So no heavy load or anything). Once I got a Timeout: Not all CPUs entered broadcast exception handler kernel panic.

I'll test with P1M1 out asap and see what happens.

Honestly I'm underwhelmed by the X10QBi. That board should NEVER freeze. This whole platform is built around reliability (with the possibility to hot swap memory on other servers). Random freezes are definitely something I don't wanna see....
 

angel_bee

Member
Jul 24, 2020
62
28
18
My X10QBi is probably dead.

After I was able to temporarily fix the issue with P1M1, I'm now getting complete system hangups after max. 10m. Both Windows and Linux, even in the Linux setup (So no heavy load or anything). Once I got a Timeout: Not all CPUs entered broadcast exception handler kernel panic.

I'll test with P1M1 out asap and see what happens.

Honestly I'm underwhelmed by the X10QBi. That board should NEVER freeze. This whole platform is built around reliability (with the possibility to hot swap memory on other servers). Random freezes are definitely something I don't wanna see....
that sounds horrible :( i hope that is fixable

would it be a good idea to run with memory mirroring and rank sparing to see whether it's a ram issue?
 

NablaSquaredG

Active Member
Aug 17, 2020
276
101
43
i hope that is fixable
Probably not.
I have the suspicion that I damaged the board when I disassembled the server to add GPU power cables...

would it be a good idea to run with memory mirroring and rank sparing to see whether it's a ram issue?
Thanks for the suggestion!
I will first check without the P1M1. If that doesn't work, I'll try memory mirroring and rank sparing