Dual Xeon-D

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

whbeers

Member
Jul 11, 2020
42
46
18
Sounds the same as my setup on the second node, m.2 to PCI-E, into a Quadro for the Proxmox install. Then I have dual Sata SSDs for Proxmox and three WD 4TB drives for LXC/VMs. Running smoothly here for 4 days straight. So it sounds like a hardware issue, might be an idea to reset the Bios just to make sure?
I definitely took inspiration from your approach after realizing that installing proxmox over serial wasn't going to work - thanks for the ideas!

This also seems like a pretty ideal use case for these boards - I've been running my two working nodes whenever I'm not worried about noise in my home office and they've been rock solid. I was worried that live migration with zfs as a backing store would be slow, but nvme as an l2arc + 10G and proxmox built-in periodic zfs replication makes it faster than vsan.
 
  • Like
Reactions: bob_dvb

bob_dvb

Active Member
Sep 7, 2018
214
116
43
Not quite London
www.orbit.me.uk
I definitely took inspiration from your approach after realizing that installing proxmox over serial wasn't going to work - thanks for the ideas!

This also seems like a pretty ideal use case for these boards - I've been running my two working nodes whenever I'm not worried about noise in my home office and they've been rock solid. I was worried that live migration with zfs as a backing store would be slow, but nvme as an l2arc + 10G and proxmox built-in periodic zfs replication makes it faster than vsan.
I haven't got my first node (with the bad 10Gs) running yet, just the second. I am torn between FreeNAS and another Proxmox instance as cluster, inspired by you I might do Proxmox because I already have a NAS.

Thanks for the information on the tracing and pins, that's interesting.

I also figured out what the PORT80 connectors are, they are 80h POST connectors for diagnostics. Could be either for factory diagnostics or maybe there is a remote management board that wasn't provided.
 

whbeers

Member
Jul 11, 2020
42
46
18
I haven't got my first node (with the bad 10Gs) running yet, just the second. I am torn between FreeNAS and another Proxmox instance as cluster, inspired by you I might do Proxmox because I already have a NAS.

Thanks for the information on the tracing and pins, that's interesting.

I also figured out what the PORT80 connectors are, they are 80h POST connectors for diagnostics. Could be either for factory diagnostics or maybe there is a remote management board that wasn't provided.
These can also be used to attach TPMs, but given that this board already has TPMs under the epoxy blobs, debugging seems more likely. I also found various port80/lpc debug boards with an asrock-compatible connection, but couldn't find much in the way of documentation of what they allow you to debug beyond what you already get with the 7-segment display.

Otherwise, I spent some quality time with the SFP+ spec last night and refreshed on the specs of my cheap usb logic analyzer (...and saw prices on MSOs that operate in the ~8-10GS/s range (!)). So, analyzing the low speed signalling to compare what's going on with a working node vs the broken one looks like a viable troubleshooting step. I'm still not ruling out that the FPGA is putting it into a tx disabled state for some reason.

I have to keep reminding myself that this board has jtag ports too... something else I've been meaning to learn about, and which I recently acquired some tools to enable (jtagulator and blackmagic probe).
 
Last edited:
  • Like
Reactions: bob_dvb

whbeers

Member
Jul 11, 2020
42
46
18
Spent time with a logic analyzer yesterday (digilent digital discovery):
- all low-speed sfp lines (tx fault, output disabled, rx loss of signal, rate 0/1) behave identically across the working / not working nodes.
- unfortunately managed to kill this board before I was able to get good data from i2c - had some bursts of traffic, but it appeared to be addressing unrelated devices - according to SFF-8472 all transceivers use the address A0h and A2h
- I'm not sure if I accidentally flicked a tiny resistor off the back of it with the soldering iron or solder was wicking into the SFP connector to form a short, but I wasn't wanting to debug a non-booting board when it got into this state. So, while I might come back to it and clean up my solder / check for shorts at some point, for now I've ordered one last board to play the odds.

(The micrograbber probes I was using were unable to stay attached to the tiny bits of the SFP connector exposed on the back of the board, so I soldered on short lengths of 30ga magnet wire instead.)
 
Last edited:
  • Like
Reactions: n17ikh and bob_dvb

faninx

New Member
Feb 28, 2020
15
4
3
Does anyone what this happened?
I plug in the power cable, only the green LED and SPF LED light. FAN not works, Debug LED not display.

IMG_2543.jpg
 

whbeers

Member
Jul 11, 2020
42
46
18
replacement board came and worked without a hitch - no logic analyzer or soldering necessary this time :)
...it does seem to be a bit more picky about power, and starts/stops 5-6 times before actually booting though. it's reliable once it boots though, so my 3U4N box is racked and working!

@faninx what I'd try:
- just in case: be sure you're waiting 10-20 seconds after applying power - there's an inconsistent startup delay after power-up.
- wiggle the battery holders to see if they're making good contact. I had one board that had a loose one and it wouldn't boot until it seated into position
- check the orientation of the socketed bios / fpga firmware chips
- check jumpers to make sure the clear CMOS jumpers aren't set (the others that are visible in the image you shared look good)
- make sure you're using RDIMMs (not LRDIMMs), and try moving the dimms to other slots.
- if you're set up for it, flash the bioses with the 1.20 image provided by @fake-name earlier in the thread - I've had sufficiently good luck with that image that when my replacement board arrived I swapped out the bios chips for freshly-flashed ones from a dead board before doing anything else.
- visually inspect the board for any cut traces or scuffs - I had bad luck with the boards that were damaged in obvious ways.

I did have one board that was in a similar state to what you describe that I never got working. In my case the SFP LEDs would continuously flicker as soon as power was applied, so there might still be hope for yours if there's no flicker
 
Last edited:

faninx

New Member
Feb 28, 2020
15
4
3
replacement board came and worked without a hitch - no logic analyzer or soldering necessary this time :)
...it does seem to be a bit more picky about power, and starts/stops 5-6 times before actually booting though. it's reliable once it boots though, so my 3U4N box is racked and working!

@faninx what I'd try:
- just in case: be sure you're waiting 10-20 seconds after applying power - there's an inconsistent startup delay after power-up.
- wiggle the battery holders to see if they're making good contact. I had one board that had a loose one and it wouldn't boot until it seated into position
- check the orientation of the socketed bios / fpga firmware chips
- check jumpers to make sure the clear CMOS jumpers aren't set (the others that are visible in the image you shared look good)
- make sure you're using RDIMMs (not LRDIMMs), and try moving the dimms to other slots.
- if you're set up for it, flash the bioses with the 1.20 image provided by @fake-name earlier in the thread - I've had sufficiently good luck with that image that when my replacement board arrived I swapped out the bios chips for freshly-flashed ones from a dead board before doing anything else.
- visually inspect the board for any cut traces or scuffs - I had bad luck with the boards that were damaged in obvious ways.

I did have one board that was in a similar state to what you description that I never got working. In my case the SFP LEDs would continuously flicker as soon as power was applied, so there might still be hope for yours if there's no flicker
Only ECC memory works? I used non ECC memory. But it's no difference plug in RAM or not.
 

whbeers

Member
Jul 11, 2020
42
46
18
I'm having trouble finding a definitive reference, but other motherboards with integrated D1541s do support non-ECC memory - I haven't tried these boards with any though (and I only have ECC ddr4 to test with).

The most similar retail board is likely the D1541D4U-2O8R - which indicates the max non-ECC capacity is 64GB. if you're using larger than 16GB dimms that could also be a source of trouble.
 

whbeers

Member
Jul 11, 2020
42
46
18
if you tried the other options and the board still doesn't boot, you might just have a dud :(

For the record, my success rate here was 2.5 out of 5:
*Purchased two initially:
- board #1: good, but required me to swap in the L1.20 firmware before it worked well.
- board #2: other board good except for the SFP port issue. only discovered SFP port issue after replacing a damaged USB port.

*Purchased another two while I was fixing the USB port on the second one, as a hedge:
- board #3: DOA (flickering SFP LEDs and no other signs of life). this one had very obvious physical damage
- board #4: appeared good, but one node had random memory corruption (other node is perfect). this board also had some physical damage, but not as obvious as the first one

*Went back to trying to revive the SFP port on board #2 after I discovered the memory corruption issue on board #4, ultimately killed it

*One last purchase:
- board #5: zero signs of damage. replaced bios with pre-flashed L1.20 chips from another board before anything else, and it booted right up with no issues.

boards #1 and #5 are installed/racked and have been stable for the last ~24hrs.


....writing this up reminded me that I still have a third mostly-intact board to play with. Maybe I'll poke at it with jtag for fun if I get bored of other projects :)
 
Last edited:

bob_dvb

Active Member
Sep 7, 2018
214
116
43
Not quite London
www.orbit.me.uk
Does anyone what this happened?
I plug in the power cable, only the green LED and SPF LED light. FAN not works, Debug LED not display.

View attachment 16052
Check that you have the correct orientation for the CPU power connectors. They are too easy to put in the wrong way and my guess would be you have shorted the 12V rail. I did the same thing and my board still works.
 

n17ikh

Member
Jul 12, 2019
62
61
18
I picked up four Transcend SSD370S 128GB SSDs to use as boot drives. At $15/ea they were cheaper than SATADOMs and I had the space in the chassis.
Installed Proxmox on each drive using a spare laptop, corrected /etc/network/interfaces, and added the console lines to /etc/default/grub. Worked great.
I present the world's worst quad-node Xeon-D machine:
20201013_235546r.jpg

"Molex to SATA, lose all your what now?"

20201013_235621r.jpg

The SSDs definitely aren't just jammed in the space between the PSUs and the side of the chassis.

I discovered something interesting tonight while putting this together. On the board that had to cycle 4 or 5 times before powering up, I noticed it was complaining in a brief message at boot time about not having the CMOS clock set, so I checked the batteries. Both batteries were totally flat, zero volts. Upon replacing the batteries with new ones, the board wouldn't boot anymore, just sitting there forever displaying "3" on the LCDs. Pulled the batteries, booted without them (after the typical several power cycling), powered down, put the new batteries back in - and now it boots up first time every time. Could be the second part of that was due to a loose CMOS socket, but the dead batteries seem to be what cause the early boot power cycling.
 

craig5571

Member
May 31, 2020
60
6
8
For the RJ-45 ports that sit on top of each other what are they for ? are they console ports? serial? when are they live?
everytime i have tried to connect to them i get no data out of them

my board was stuck on "3" on the led.. then i removed the CMOS battery in both nodes and the boards booted up.. i put new cmos batteries in and then nothing worked again. the led was stuck at "3" again. i removed the cmos batteries and node 0 is back to funcitioning ( minus the sfp port) but node 1 is booting but not really acting right. it freezes..

i still get the power stalls when it is starting up, it takes about 3 times before it stays on...

the sfp ports don't work in either node.. but i got ESXi installed on them ( before i installed and remvoed the cmos batteries) . for network access i used the usb ports connected to usb nics. any way to troubleshoot the sfp ports, they don't show up in the bios. at all. i have another board that everything works fine on.

funny thing is when i plug the sfp ports into my switch , the link light on the switch lights up...

 
Last edited:

craig5571

Member
May 31, 2020
60
6
8
Ciscoo pin-out serial console . Its back a ways in the thread.
thanks for the reply is it standard 9600 n 8 1?

the other n0 and n1 ports that are horizontal.. and use the usb cord .. work fine.. but after the system loads up they seem to turn off. are both sets of ports the same thing? just different form factors?

i have a cisco console cable.. but for some reason cant get any data out of them... its a prolific usb console cable and works on my other devices..
 

itronin

Well-Known Member
Nov 24, 2018
1,234
793
113
Denver, Colorado
thanks for the reply is it standard 9600 n 8 1?

the other n0 and n1 ports that are horizontal.. and use the usb cord .. work fine.. but after the system loads up they seem to turn off. are both sets of ports the same thing? just different form factors?

i have a cisco console cable.. but for some reason cant get any data out of them... its a prolific usb console cable and works on my other devices..
IIRC correctly usb is ttys1 and the rj45 is ttys0.

BIOS shows up on the USB port. If you are booting an OS configured with a serial console and have it spec'ed to go to ttys0 then that transition should happen after post and the bootloader are up. serial settings for your OS I guess will be specified in the bootloder. at a guess probably 38400, 57600, or 11520 8 n 1 - you could try each and just hit enter a few times to see if something appears.
 
  • Like
Reactions: craig5571

bob_dvb

Active Member
Sep 7, 2018
214
116
43
Not quite London
www.orbit.me.uk
For the RJ-45 ports that sit on top of each other what are they for ? are they console ports? serial? when are they live?
everytime i have tried to connect to them i get no data out of them

my board was stuck on "3" on the led.. then i removed the CMOS battery in both nodes and the boards booted up.. i put new cmos batteries in and then nothing worked again. the led was stuck at "3" again. i removed the cmos batteries and node 0 is back to funcitioning ( minus the sfp port) but node 1 is booting but not really acting right. it freezes..

i still get the power stalls when it is starting up, it takes about 3 times before it stays on...

the sfp ports don't work in either node.. but i got ESXi installed on them ( before i installed and remvoed the cmos batteries) . for network access i used the usb ports connected to usb nics. any way to troubleshoot the sfp ports, they don't show up in the bios. at all. i have another board that everything works fine on.

funny thing is when i plug the sfp ports into my switch , the link light on the switch lights up...

I would suggest a full CMOS reset on both nodes, not just removing the battery. It might be that they are in some bad config?

Only one of my nodes has SFP+ issues, but you might find that the SFP+ doesn't like SFPs? I've read that can happen, has anyone else used non 10G SFPs?