4P board with 8 double width card backplane(unobtanium?)

RobertFontaine · Dec 22, 2015

The SM X9-DRG-Qf or X10-DRG-Qf boards seem like a perfect set up for a 4 Xeon Phi's workstation/server and I have managed to grab and X9-DRG for cheap.

Which lead me to looking at 4P boards and thinking of 8 Xeon Phi's because more.

All the GPGPU boards with an 8 card backplane that I am seeing are 2P. The Xeon Phi's strength is that it can be used both in distributed and smp models as a general processor as well as a fp64 coprocessor. For offloading that has a lot of chatter between the mother ship and the Phi card you want all the pci bandwidth you can have. A 2P board is probably optimal for a server with 5 or six cards or a workstation with 4 and a video card for Quake/raid card ).

I've never really looked at commercial boards with backplanes before. Did/does SM make a X9/X10 board that does this kind of hpc (I don't see on on their site)? Am I even going to find such a beast on the junk market or is this stepping into custom board territory?

Thanks,
Robert

Patriot · Dec 22, 2015

Why are yall wanting Phi's so bad? The GPGPU performance is abysmal compared to gfx cards. Next gen should be much better... but currently they suck.

RobertFontaine · Dec 22, 2015

If you want to use a PHI to calculate a matrix operation then they suck compared to a 4k tesla. (although these are $125 a video card would still be better 7990.).

If you want a Teraflop of general purpose compute that you can either offload or distribute then the PHI architecture crushes video cards.

Knight's landing is also NOT a video card and will also suck at $4k if you compare it to the next generation of Teslas IFF you are rendering pictures.

But then again I am not rendering video.

Patriot · Dec 22, 2015

I am not rendering video's either... I am scaling out a Neural network. Still not competitive lol.
I have 3 5110Ps btw. lol.

Neural nets don't require double precision... so even the older K10s I have do 4.5Tflops/200w
But yes, if you want dual precision, which not many things need then I guess the PHI's are not the worst choice... They are just deprecated, knights landing will be nothing like the current ones.

RobertFontaine · Dec 22, 2015

Picking an algorithm that works in fp32 and does not require general compute and then saying look my 980ti is faster is great advertising for nvidia. You should take note that the only parallel compute they pushed was dnn. FP64 was crippled since Fermi and a Cuda core is not capable of general computation. I digress. Gpgpu is absolutely the fastest at what it does but it is a one trick pony.

RobertFontaine · Dec 22, 2015

When I can afford Knight's Landing for my basement lab I might jump in (likely not). I keep hoping they will have a developers program or that the US government will block them from shipping to China (Tianhe) after they have filled their warehouse with them again. Till then the PHI is perfect for my pile of purposes; Kaggle Box, Work on making GnuBg a little Smarterer, Data Mining, Machine Learning, General Mathbox, Studying OpenMP/AC Programming and a couple of other interesting ideas that I have wanted to explore but am not ready to chat about. C++/R/Python(anaconda)/R/MathLab etc. The programming model for the PHI and KNIGHT's LANDING is the same. KNIGHT's landing will use a more powerful set of cpu cores and will have added a nice collection of 512 wide instructions but from a development perspective the code that works well on my phi box will scale well on a KNIGHT's LANDING cluster. I think that when I have a model that I am happy with on my little box that the best thing to do is buy AWS time to run the job or find a sponsor with heavy metal. 3 Nodes is actually a pretty good number. It isn't 1 or 2 so it is effectively many. With more than 3 PHI's the smart lab thing to do is probably create a second node and put a 40Gb fibre cable between them. So much of optimizing these programs is latency; pci-e, qpi, cache. Amdahl's law is a pain in the butt but in practice it's tough to get past all the wait times.

Patriot · Dec 22, 2015

ivankreso/caffe-xeon-phi · GitHub Memory is the biggest limitation generally for those things... you want to stay on card as much as possible. or use DMA on the fiber to mirror card sets memory.

tjsgifan · Jan 12, 2016

RobertFontaine said:
The SM X9-DRG-Qf or X10-DRG-Qf boards seem like a perfect set up for a 4 Xeon Phi's workstation/server and I have managed to grab and X9-DRG for cheap.

Which lead me to looking at 4P boards and thinking of 8 Xeon Phi's because more.

All the GPGPU boards with an 8 card backplane that I am seeing are 2P. The Xeon Phi's strength is that it can be used both in distributed and smp models as a general processor as well as a fp64 coprocessor. For offloading that has a lot of chatter between the mother ship and the Phi card you want all the pci bandwidth you can have. A 2P board is probably optimal for a server with 5 or six cards or a workstation with 4 and a video card for Quake/raid card ).

I've never really looked at commercial boards with backplanes before. Did/does SM make a X9/X10 board that does this kind of hpc (I don't see on on their site)? Am I even going to find such a beast on the junk market or is this stepping into custom board territory?

Thanks,
Robert

These solutions do have PCIe switching in place due to the number of PCIe lanes available, so the theory is that your bottleneck could be if you have 2 offloads going across the same PCIe switch port at the same time, but with PCIe3.0 x16 that's still a fast bus. If you pare it with some 18 Core E5 v3 CPU's that still gives you a lot of power.

The Supermicro 8 way GPU such as the Supermicro | Products | SuperServer | 4U | 4028GR-TRT is still probably the fastest single image system you can get off the shelf at a reasonable cost.
You could talk to someone like SGI for a single image system that scales to 32 GPU's as a single image but then your budget needs to be much bigger. Look at the SGI UV series for such product.

Or you could look at working outside of the box and getting your 4 way server with lots of PCIe 3 x16 slots, and then using an external PCI Expansion cabinet such as this :
Quantum EXR3600 - Exxact Corp to connect all your cards to your server.

We have done some interesting solutions using this type of product and can always find something that will fit the job....

RobertFontaine · Jan 12, 2016

Don't get too excited my rack is built with plywood.

I'm thinking risers with a plx chip for models where offload isn't the constraint. The Phis are only pcie-2 so the constraint would be the switching overhead of the plx chip.

Sometimes that could be perfectly fine.
The v3s should drop in price (hopefully in early 17. 16 core v3s however probably won't be in the bargain bin till 19.

Will probably build 3 - 2P - 4 phi compute nodes and stop as that gives me n nodes in a networked environment to write codes against. The 8 core v1's are a bargain. 1400watts per node may annoy my wife if I don't pay the power bill though.

William · Jan 12, 2016

I ran a review on a Supermicro 7048GR-TR about a year ago, impressive machine.
Supermicro 7048GR-TR (Intel C612) Workstation Tower System Review

You can see that when running LinPack on everything it pulled about 1,600watts, running 8x Phi's would put you close to 30amps. Not many homes can handle that.

I only had those cards for a very short time and had to do a crash course on operations, right in the middle of Christmas/New Years, so as far as I got was getting LinPack to run on them. Its not an easy task, for me anyway. But if you know what you are doing you should have no problems.

RobertFontaine · Jan 12, 2016

I have no idea what I'm doing but it's entirely a skunkworks project and I am perfectly happy to learn how to build it out as well as write the code.

If it turns into something then it would be nice but if not it's a hobby that keeps me properly engaged.

William · Jan 12, 2016

In that link I posted I went over how to get communications with a card started under windows, from there you can upload programs and run them. It should be rather simple to do but if you have any questions fire away.

Basically each Phi is a separate Linux based computer on a PCIe slot that "can" communicate with the host or other cards or over the network. It can get rather complex pretty fast.

They were fun to mess around with and I would have liked to get a few other benchmarks running on them but just ran out of time. Intel only allowed us to have the 4x Cards for 30 days and I had to deep dive very fast on how to get these up and running, it was no easy task for me LOL

RobertFontaine · Jan 12, 2016

Thanks, with a little luck the power supply in my back pack will get me to first boot this evening. Kvm/centos for this one... Will be an all purpose machine for a couple of months while I build a cheap workstation and a file./data sewer.

Deslok · Jan 12, 2016

What about dell's R9xx series systems? taking a look at them a 930 could hold 7 double with cards i'm not sure about the older models off hand

William · Jan 12, 2016

Well those are beasts that's for sure. I can picture lots of noise and 30amp line needed.
If you can budget the systems and power use sure

RobertFontaine · Jan 12, 2016

alphacool out of Germany makes a watercooling block for the 31S1P so it is entirely possible to build a wife safe rig. I have two of them and the fit is adequate.

With 3 compute nodes running I should be able make espresso. A xeon phi draws about the same power as gtx580, 270watts ish per card. Essentially the same as running 4 video cards.

220 and a proper power system will probably be an issue but I will burn that bridge when I come to it. I might find a victim, I mean sponsor, with rackspace before the divorce papers are filed.

For this week I will be happy if I can get kvm/centos/phis booting and a development vm running. Purely version. 01

Deslok · Jan 12, 2016

Does that water cooler actually reduce it to a single slot card? if so that should certainly help with 8 of them in a 4p system

RobertFontaine · Jan 12, 2016

No it's very double wide.
The x9drg-qf is set up nicely for 4 double wide cards. 2 on each Cpu...

Except the ram blocks the 5th pcie slot. I'm going to need pcie extension cable to get my infiniband cart on the node with 4 phis.

RobertFontaine · Jan 12, 2016

I have bios

Search

4P board with 8 double width card backplane(unobtanium?)

RobertFontaine

Active Member

Patriot

Moderator

RobertFontaine

Active Member

Patriot

Moderator

RobertFontaine

Active Member

RobertFontaine

Active Member

Patriot

Moderator

tjsgifan

Member

RobertFontaine

Active Member

William

Well-Known Member

RobertFontaine

Active Member

William

Well-Known Member

RobertFontaine

Active Member

Deslok

Well-Known Member

William

Well-Known Member

RobertFontaine

Active Member

Deslok

Well-Known Member

RobertFontaine

Active Member

RobertFontaine

Active Member