Dual Xeon-D

bob_dvb · Sep 3, 2020

Just found this on Amazon and so I might see if I can make some use of those SFF NVMe connectors... I don't have much use for U.2 drives.

SAS MiniSAS SFF-8643 to PCI-e Gen 3 x4 Lanes slot Adapter - Read 2369MB/s/Write 1058MB/s !!!: Amazon.co.uk: Computers & Accessories

www.amazon.co.uk

bob_dvb · Sep 7, 2020

Ran two tests today:
1)Does an nVidia P600 GPU work when plugged into an M.2 NVMe to PCIe x16 riser? Why yes it does, it looks slightly ridiculous hanging in the case but for the purposes of testing, and I had to be careful, I got text on screen. It also showed me why Proxmox wasn't booting which was useful. Just in case I want to do this permanently I've ordered a low-profile bracket for the GPU so that it can be suspended above the motherboard in a comical way.

2) What happens when you plug in an OCP ethernet adapter? I am guessing I am the first to plug in an OCP adapter so far because otherwise it would have got more attention.... either the Mellenox card I got is unique or... the OCP connector is backwards! The SFP connector faces IN to the motherboard. Good thing was the lights came on on the adapter and blinked correctly!

Clearly these OCP connectors were designed for SATA or some other internal magic. Also in my EATX case you couldn't fit anything in the Node 2 OCP because it's too close to the edge of the board and thus the case.

itronin · Sep 7, 2020

bob_dvb said:
Ran two tests today:
1)Does an nVidia P600 GPU work when plugged into an M.2 NVMe to PCIe x16 riser? Why yes it does, it looks slightly ridiculous hanging in the case but for the purposes of testing, and I had to be careful, I got text on screen. It also showed me why Proxmox wasn't booting which was useful. Just in case I want to do this permanently I've ordered a low-profile bracket for the GPU so that it can be suspended above the motherboard in a comical way.

2) What happens when you plug in an OCP ethernet adapter? I am guessing I am the first to plug in an OCP adapter so far because otherwise it would have got more attention.... either the Mellenox card I got is unique or... the OCP connector is backwards! The SFP connector faces IN to the motherboard. Good thing was the lights came on on the adapter and blinked correctly!

Clearly these OCP connectors were designed for SATA or some other internal magic. Also in my EATX case you couldn't fit anything in the Node 2 OCP because it's too close to the edge of the board and thus the case.

pictures or it didn't happen

<joking>

good to know. I have a pair of ASRock OCP to SAS 3008 cards that I'm going to plug in.

bob_dvb · Sep 7, 2020

itronin said:
pictures or it didn't happen <joking>

good to know. I have a pair of ASRock OCP to SAS 3008 cards that I'm going to plug in.

Challenge accepted... At least for the OCP part. Not going to insert a photo of my crimes against GPUs.

fake-name · Sep 7, 2020

fake-name said:
The mezzanine connectors are weird too. Just following a few of their pins makes me think they're pinned out for the standard OCP 2.0 mezzanine interface, but the headers are *backwards*. The connectors would be pointed towards the CPUs if you install mezz card that fits the OCP standard.

bob_dvb said:
Clearly these OCP connectors were designed for SATA or some other internal magic. Also in my EATX case you couldn't fit anything in the Node 2 OCP because it's too close to the edge of the board and thus the case.

Hah! I thought they looked weird!

I wonder what their actual use-case was. I wonder if they had custom OCP mezzanine cards for this..

bob_dvb · Sep 11, 2020

While I am logging odd issues with my board....
My board has always taken four attempts to boot, it spins up then dies, spins up, then dies... does this four times then boots normally. I assumed it was the draw on the 12V rail from my cheap no-name PSU and I saw a Corsair HX650 going cheap on ebay, so nabbed it because it has three 12V rails. I plugged it in with no drives, just the motherboard and.... it still does this weird boot power cycling when it turns on.

Anyone else seen this?

Oh, and the Mini-SAS to PCIe adapter arrived, it was the wrong connector, it was SFF-8639 instead of SFF-8643 but something in the back of my mind had already told me not to buy the cables before the adapter board arrived, so I am not out of pocket. However the 8639 cables are slightly more expensive, so I have gone to AliExpress and will have to wait a month instead.

Current plan with this unit is that my case is a 4U EATX rack case, so I am thinking of putting in an acrylic sheet as a mezzanine level in the case for PCI-E extenders supporting half-height cards. It's not classy, but it's interesting, if there was an exact case for this board then I would probably buy it, but I didn't see anyone suggesting there was one.

whbeers · Sep 12, 2020

For anyone else playing with these boards, I've been wrestling for a couple days after some initially positive tests on two boards. After flashing to L0.20 (thanks @fake-name !), I ended up with two boards consistently hanging at code "53".

What I'vefinally realized is that the 32GB modules I filled the boards out with are LRDIMMs - not compatible with the D1541. My initial testing was with 16GB Registered DIMMs from another project...luckily I've got another box I can throw these LRDIMMs into, so not a huge loss.

Hopefully this helps someone else avoid a costly mistake!

whbeers · Sep 13, 2020

Current progress - two boards / three working nodes - waiting for parts to repair one of the USB ports. My eventual plan is to mount these stacked in a 3U chassis for a 3U4N esxi cluster.

Under load a single board as configured (six 860qvo ssds, two m.2 970evoplus ssds, two each of 120mm 40mm fans, and four 10GBase-T transceivers) peaks around 205W, so one of my 800W 3U PSUs should be able to power both boards comfortably ...with some sketchy-looking splitters I have on the way (Amazon.com: EPS 8 Pin Splitter,TeamProfitcom ATX CPU 8 Pin Female to Dual 8(4+4) Pin Male 12V for Motherboard Power Adapter Cable Braided Sleeved EPS Y-Splitter 8 Pin EPS Extension Cable 9 inches: Industrial & Scientific and ATX 24Pin 1 to 2 port Power Supply Extension cable PSU Male to Female Y Splitter 6963228384465 | eBay).

For testing, I was able to get ubuntu server installed over serial by adding "console=tty1 console=ttyS1,115200,n8" to the kernel parameters in grub. If anyone knows how to get the esxi installer to output to serial, please let me know (output seemed to stop after all the modules loaded).

I briefly looked into what it would take to build an 8-lane pcie breakout adapter from the OCP connector... beyond my skill level, but I'm guessing not too difficult for someone with custom pcb design experience.
- relevant OCP spec: https://www.opencompute.org/documents/ocp-mezz-2-0-rev1-1-20200103-nocb-pdf
- this ought to work for the mezzanine card connector: 61083-124402LF Amphenol ICC (FCI) | Connectors, Interconnects | DigiKey

n17ikh · Sep 14, 2020

whbeers said:
Current progress - two boards / three working nodes - waiting for parts to repair one of the USB ports. My eventual plan is to mount these stacked in a 3U chassis for a 3U4N esxi cluster.

Under load a single board as configured (six 860qvo ssds, two m.2 970evoplus ssds, two each of 120mm 40mm fans, and four 10GBase-T transceivers) peaks around 205W, so one of my 800W 3U PSUs should be able to power both boards comfortably ...with some sketchy-looking splitters I have on the way (Amazon.com: EPS 8 Pin Splitter,TeamProfitcom ATX CPU 8 Pin Female to Dual 8(4+4) Pin Male 12V for Motherboard Power Adapter Cable Braided Sleeved EPS Y-Splitter 8 Pin EPS Extension Cable 9 inches: Industrial & Scientific and ATX 24Pin 1 to 2 port Power Supply Extension cable PSU Male to Female Y Splitter 6963228384465 | eBay).

For testing, I was able to get ubuntu server installed over serial by adding "console=tty1 console=ttyS1,115200,n8" to the kernel parameters in grub. If anyone knows how to get the esxi installer to output to serial, please let me know (output seemed to stop after all the modules loaded).

I briefly looked into what it would take to build an 8-lane pcie breakout adapter from the OCP connector... beyond my skill level, but I'm guessing not too difficult for someone with custom pcb design experience.
- relevant OCP spec: https://www.opencompute.org/documents/ocp-mezz-2-0-rev1-1-20200103-nocb-pdf
- this ought to work for the mezzanine card connector: 61083-124402LF Amphenol ICC (FCI) | Connectors, Interconnects | DigiKey

View attachment 15731

That looks like a really slick setup. I saw the seller has some of these in stock again; after missing out the first two times I've put in an offer for two boards. I don't have any DDR4 though, looks like that's gonna cost me quite a bit depending on how much I want to feed into these.

bob_dvb · Sep 19, 2020

I see someone in NJ selling theirs...

ASRock AK-D1541 Server motherboard with dual (2) Intel Xeon D-1541 CPU's | eBay

This is a custom board that is made by ASRock Rack. Model on the board says AK-D1541. Has two XEON D 1541 CPU's with each one acting as a separate and independent node. Has a serial to USB bios, and two RJ45's consoles.

www.ebay.com

bob_dvb · Sep 19, 2020

I've ordered four OCP to dual PCI-E 4x boards from China which I couldn't find anywhere other than Taobao. But thanks to another thread here I finally located the adapters.

I plan on making the adapters available to anyone on Europe because they don't seem to be available anywhere else. Not retailing, I just only think I need one, so I bought four!

天貓淘寶海外，花更少，買到寶！

天貓淘寶海外作爲面向華人的跨境電商平台，覆蓋200多個國家和地區的消費者，其中核心站點包括：中國香港、中國澳門、中國台灣、新加坡、馬來西亞、澳洲、加拿大。

m.intl.taobao.com

whbeers · Sep 19, 2020

For the BU8 epoxy blobs: I had an extra board I couldn't get working, so chipped away carefully at the epoxy.

Looks like they're TPMs - Infineon SLB9670VQ2 as best I can tell.

I also noticed traces on the bottom of the board going from the FPGA to the SFP+ connectors, which makes me hopeful that there might be some form of remote management built into it. When I get some time I'll investigate if they're chatty at all during boot.

n17ikh · Sep 21, 2020

My two boards have arrived. Nothing broken (as far as I could tell). Spent a bit wrangling the serial console but finally got it to cooperate this evening. One of the two boards, however, exhibits the same behavior that @bob_dvb is seeing, where it tries to boot four times and then boots. Same power supply in my test setup for both boards, so I don't know that that's the issue. I did notice that the 7-segment LED indicator stays powered up and indicating the whole time it's power cycling, which is pretty odd. Also on the "bad" board someone has taken the stickers off the two BIOS chips. Maybe it had the BIOS ICs in backwards at some point. The BIOS version is P0.18 on both nodes. I didn't notice what the problem-free board had for BIOS versions, I'll take a look the next time I've got it hooked up to power.

I suppose I'm lucky in that all the LAN ports seem to work without any issue on all nodes.

fake-name · Sep 24, 2020

Well I caved and bought another. I'm a sucker.

I want to see if I can stuff 2 boards into 2U.

n17ikh · Sep 25, 2020

fake-name said:
Well I caved and bought another. I'm a sucker.

I want to see if I can stuff 2 boards into 2U.

I think it's quite possible to fit two into 2U; in fact it's what I wanted to do but I haven't been able to dig up a suitable chassis at a reasonable price yet. That feeling of being a sucker didn't start up for me until I started adding RAM and SSDs and shopping for a switch with enough 10GbE ports.

For now, I was test fitting in a 4U Rosewill chassis that has been gathering dust:

The disadvantage of this kind of setup is that it makes it pretty inconvenient to work on the bottom system. Maybe I should be shopping for a pair of 1U cases instead.

whbeers · Sep 25, 2020

As an update on my end: the only working second board I had ended up with random and (luckily) frequent data corruption, enough that it was difficult to get an OS installed. I managed to rule out the USB port and memory as culprits by moving drives and memory from one (working) node to the problematic one, only to slowly eat away at the rootfs when getting a first round of updates.

I also couldn't manage to get both boards to start when powering them off a single PSU after mounting them in a 3U chassis (I could have sworn I tested this previously?) - not sure if the addition of drives and extra memory bumped it over a threshold or what, but I have two 1U FSP PSUs on the way as a backup plan.

When I have a gap between other projects that have taken priority I'll poke at one of the boards with a bad NIC to see if I can bring it back to life... otherwise @MONTREAL-COMPUTERS seems to be getting low and the price per board has gone up

n17ikh · Sep 25, 2020

whbeers said:
As an update on my end: the only working second board I had ended up with random and (luckily) frequent data corruption, enough that it was difficult to get an OS installed. I managed to rule out the USB port and memory as culprits by moving drives and memory from one (working) node to the problematic one, only to slowly eat away at the rootfs when getting a first round of updates.

I also couldn't manage to get both boards to start when powering them off a single PSU after mounting them in a 3U chassis (I could have sworn I tested this previously?) - not sure if the addition of drives and extra memory bumped it over a threshold or what, but I have two 1U FSP PSUs on the way as a backup plan.

When I have a gap between other projects that have taken priority I'll poke at one of the boards with a bad NIC to see if I can bring it back to life... otherwise @MONTREAL-COMPUTERS seems to be getting low and the price per board has gone up

Is your rootfs on the NVME drives? I wonder if you're running into PCIe latency/sleep issues.

whbeers · Sep 25, 2020

n17ikh said:
Is your rootfs on the NVME drives? I wonder if you're running into PCIe latency/sleep issues.

No nvme during install

This is proxmox on a ZFS root (spread across 6x sata ssds). After install on the working nodes I added an nvme drive as an l2arc, but during install I'm using an m.2->pcie adapter for a gpu so that's ruled out too.

bob_dvb · Sep 26, 2020

whbeers said:
No nvme during install

This is proxmox on a ZFS root (spread across 6x sata ssds). After install on the working nodes I added an nvme drive as an l2arc, but during install I'm using an m.2->pcie adapter for a gpu so that's ruled out too.

Sounds the same as my setup on the second node, m.2 to PCI-E, into a Quadro for the Proxmox install. Then I have dual Sata SSDs for Proxmox and three WD 4TB drives for LXC/VMs. Running smoothly here for 4 days straight. So it sounds like a hardware issue, might be an idea to reset the Bios just to make sure?

whbeers · Sep 26, 2020

spent a bit more time on the board with a bad nic, still not working but adding some notes here.

to be more precise, the issue is that both 10G interfaces on node1 of this board do not negotiate a connection (node2 is fine).
- activity/connectivity LEDs do not light up, except (steady) for a few seconds when the board is first powered up
- bios configuration is identical on both nodes, jumper configuration is identical to another working board
- probed all the pins of the SFP+ cage, all voltages were comparable across the two nodes, with (predictably) more variability on the differential rx/tx pairs of the working node
- probed a handful of voltages around the Inphi PHY, all seem to be comparable to a working node
- infrared camera shows similar temperatures for all components around the transceiver cages / PHYs
- I realized that the SFP cages/connectors are snap-fit and non-soldered, so was also able to rule out the connection between the transceiver and the board by swapping the connectors and trying the same connector/transceiver pairing across two nodes (kinda already ruled them out as a failure point by probing the sfp pins though)
- the OS does see two 10G network interfaces and seems to be able to interact with them in a limited capacity
- `ethtool -p` is not able to blink interface LEDs, `ethtool -r` does not have an effect
- I can get an eeprom dump using `ethtool -e`, and it's mostly similar to one from a working node.
- most surprising (and encouraging!) to me, `ethtool -m` can actually interact with the transceiver and get real data. nothing different from a working node with the same 10GBaseT transceiver outside of SN and temperature/voltage readings.
- in case you're wondering: yes, I was just going down the list of options in `ethtool -h` to see what I could do with it at this point

I also followed traces on the bottom of the board between the SFP pins and the FPGA - if I'm remembering correctly the only exposed traces are running to 2 (tx fault), 3 (tx disable) and 8 (receiver loss of signal). There's clearly a lot of buried traces going to other pins, so I can't rule out additional connectivity to the FPGA - I still plan to watch for chattiness at boot at some point.

I keep coming back to a theory that the FPGA is for some reason setting the tx disable / fault pins, but the measured voltages don't support that theory. So, unsure if I'll get much further without finally teaching myself to use a logic analyzer and inspecting protocol data on the RX/TX lines...which does sound fun, but not exactly where I was hoping this project would take me

Dual Xeon-D

Active Member

Active Member

Well-Known Member

Active Member

Active Member

Active Member

Member

Member

Member

Active Member

Active Member

Member

Member

Active Member

Member

Member

Member

Member

Active Member

Member