Assembled the installation for 8 GPUs from the ASUS RS720-E6 server and Tesla S2050 boxes

falanger

Member
Jan 30, 2018
70
13
8
37
Good day everyone. If anyone is interested in assembling an inexpensive system for 8 GPUs from old components for professional computing, I'm ready to tell you what and how it was done.
When assembling the system, there are features that are difficult to solve without experience, I'm talking about boxes for the GPU Tesla S2050 server.
I am building this rig to work in deep learning neural networks.
Sorry in advance for my bad English, I am from Russia and I have poor knowledge of the language, I am writing through the Google translator.
 
  • Like
Reactions: familybot

falanger

Member
Jan 30, 2018
70
13
8
37
If anyone is interested in this topic - write, I will tell you in detail and take a photo of the equipment.
 

jverschoor

New Member
Mar 12, 2021
15
0
1
Hi Falanger, yes we are interested. What we did is we purchased a Nvidia S1070 and are seeking to put in 4 Nvidia K80's.

We wont use the gen2 HIC connectors, rather we are trying to route the gen3 PCIe extension cables from the rear interposer on the C6220ii's directly to the K80's in S1070.

So we are using the S1070 exclusively to power and cool the K80's. (We calculated that the S1070 should have sufficient power supply to power the K80's as the S1070 wont have to power the PCIe slots (as they will be powered from the C6220ii's), which saves 4 x 75W. Only cooling maybe an issue, but the 4 K80's are very unlikely to be maxed out simultaneously.

So yes, we would be very interested in your layout. (we have watched your video's on the S2050, which were very helpful).

Kind regards, JJ
 

falanger

Member
Jan 30, 2018
70
13
8
37
Trying to connect cards directly to boxes C2050-C1070 directly from the 6220, you will most likely simply damage the equipment. Even if it does not burn out for you - due to the fact that the cables will not be made of high quality and the electronics of the card boxes working on their own will interfere - you will not be able to use the devices.
Since you have already bought these boxes, the most reasonable thing would be to buy K20m 5 GB cards for them with a consumption of 225 W, ensuring that the coolers work like I do at home. Either K20X or K40 with a consumption of 230-250 W and programmatically in the nvidia-media script reduce their power limit to 225 W. You will have a 16x gen 2 bus, but you will be able to fully use the cards without risks. And for the K80, buy an EEB server board with 4 16x slots, flexible 11.8-inch extenders from Thermaltek from Amazon state of separate tape cables and a 2 kW ATX power supply for mining on video cards.
 

jverschoor

New Member
Mar 12, 2021
15
0
1
Hi Falanger,

Thanks for this:

- We are using those high quality 3M/Dell PCIe extenders (the same ones you have in your video when you connect the K80's to the C8220 mobo), so I think cable quality wise it will be ok.
- When you say "electronics of the card boxes working on their own", do you mean that there is no "emergency throttling cable" installed? We just figured that as long as we provide sufficient power and cooling to the K80's, there would be no need for throttling.
- Coincidentally we have a person from Ukraine joining in May, so I will ask him whether he speaks Russian and whether he is willing to help out now already.

I will let you know. Thanks so much for all this already.

Kind regards, JJ
 

falanger

Member
Jan 30, 2018
70
13
8
37
Boxes S1070 and S2050 have their own electronics that controls 16x slots, determines which cards are in them, compresses and generates signals for transmission via special cables to P797 HIC cards. If you take a riser cable like mine, cut it off and solder it directly to the 16x slot inside the card case, it probably won't work. And there is a big risk that your server board and box board will burn out. I understand that you want 16x gene 3, but in this box it is not used. In fact, for most tasks, 16x gene 2 is enough, if you have not decided to compete with Google and allow millions of users to visit you.
For 99% of the tasks that I perform on my system when calculating the astrodynamics of star clusters and deep learning of neural networks, recognition of images of text, video and audio, calculations of hydro and gas dingamics - a complete reload of the memory of K20X 6 GB cards occurs only 2 times per second. 16x gene 2 is enough and without the risk of trying to make a "DIY" and burn expensive equipment.
I really want to help you, but you have to understand that soldering the leads to the equipment is not the way it was designed - you can destroy it.
In addition, in the S1070 / S2050 boxes there are special PLX communicator chips that split each 16x line into 2. This can be seen by making a lpssey comeage with Linux and look for lines (sudo lspci ) from the PLX there. If you just solder the wires to the slot from the box and the slot in your C6220, you will not get two working 16x slots at once. Either one slot will work, or two slots will malfunction without understanding which slot the signal is sent to.
In addition, you will have to pay more money to the engineer than if you put K20 / K40 in your box and left them to work. Buy for your K80 a server board ASUS Z9PE-D16/2L for 4 slots 16x gen 3 a powerful 2 kW ATX power supply and flexible Thermaltek risers that work exactly and connect them together.
 

jverschoor

New Member
Mar 12, 2021
15
0
1
Hi Falanger,

Once again thank you for your helpful insights, but I didn't explain our intent very well.

See,

- We don't intend to solder the connections, rather we intend to take the HIC cards out of the S1070 in their entirety (and as the HIC cards are integrated with the PCIe slots, we are essentially removing the PCIe slots from the S1070).
- Then (in the space that removing the HIC cards freed up) we intend to route the GPGPU expansion cables, so that the PCIe ports on the GPGPU expansion cable are located there where the HIC PCIe slots used to be (and the other end of the GPGPU expansion cable is attached to the GPGPU rear slot on the C6220ii, so the GPGPU expansion cable essentially forms a bridge between the C6000 chassis and the S1070 chassis).


So the entire layout is more like what you have with on your C8220x videos, where you:

- connected a K80 to the C8220x Mobo via a flex riser
- cooled K80 independently via the axial fan
- powered the K80 independently


Its just that we are trying to use the S1070 as a convenient 1U solution to:

- house 4 K80 cards
- power the 4 K80 cards
- cool the K80 cards

Because:

- we are space constrained (alternative would be to get a 3U mining chassis to house the GPU's)
- we can only have 1 C6220ii Node address 1 GPU max (this is due to software limitations).


So your videos on the C8220x and S2050 are very very helpful in making this solution work.

I guess only question remaining is whether the front fans on the S1070 will continue to function when the HIC cards are completely removed (and the S1070 doesn't see any GPU installed), but I reckon that removing the 2 blue wires from the power connector on the S1070 (like what you did on the S2050) bypasses this check as well.

I will have some time this weekend to think it through a bit better and let you know.

Kind regards, JJ
 

falanger

Member
Jan 30, 2018
70
13
8
37
I realized that you want to disconnect the narrow boards with 16x slots and p797 built into them. But you must understand the risk that the voltage from the box can get into your C6220 / K80 and disable it. Personally, on my assembly with a pair of K80, trying to power the cards from 2 different ATX 850 + 1000 W, I had problems when + 12V fell on the second power supply and did not allow it to be turned off. In addition, the protection worked all the time, I decided not to risk equipment that was very expensive for me and worth my earnings for 1 year and bought one 1600 W power supply. The problems have disappeared and I have been using the machine for calculations for a long time, I just did not shoot a video.
 

jverschoor

New Member
Mar 12, 2021
15
0
1
Hi Falanger,

Thanks once again, and I think I understand what you are trying to tell me now, i.e:

You were using 2 different power supplies to power your Mobo and your GPU's independently, namely:

- 1 X ATX 1000W (ChiefTec Protom 1KW) to power both K80's (2 x 225W = 450W)
- 1 x ATX 850W (ThermalTake RGB850W) to power mobo + 3 PCIe ports + peripheral equipment, like:

- E5-2650LV2 (2 x 70W = 140W)
- 8 x 16 GB DDR3 1600 MHz (8 x 2.5W = 20W)
- 3 PCIe ports (2 x 75W + 1 x 4W = 154W)
- Bunch of disks (4 x 2W = 8W)
- fans (6 x 7W = 42W)
- which leaves (850W - 364W =) 486W for mobo + whatever else is left

So both PSU's individually were each sufficient to power all equipment they were connected to, but whenever you lost power on the 1000W and the K80 GPU's no longer received power to the 8pin power connectors, the mobo was unaware and kept sending signals to the K80's.
However, when you moved to the 1600W single PSU (SIRIUS LW1600PG) you solved this problem, as when the power failed both mobo as well as K80's would shutdown simultaneously.

I think I am willing to take thats risk as the S1070 and C6000 both reside in a data center where power is very reliable + I have never seen a PSU fail. We are using

- 2 x 1400W PSU's to power the 4 x C6220ii nodes in the C6000 chassis, and
- 1 x 1400W to power the 4 x K80's in the S1070 chassis. (The S1070 powersupply is a Eltek Valere C1250C1-NV, which I think is 1400W,)

If I am willing to take that split power risk, the risk remaining are:

- whether 1400W is sufficient to power 4 K80's: We do think so as 4 x 225W = 900W and the S1070 fans + board wont consume much power. (Currently there are 4 x Tesla M2090's in the S1070, which are rated at 250W, whilst the K80's are just 50W higher in rating, i.e. 300W. So we would need 4 x 50W = 200W more Watt to power the K80's, however if we get power for the PCIe ports from the C6000 PSU's rather than from the S1070 PSU, we would reduce the S1070 power supply requirements by 4 x 75 = 300W. So net-net we would remain within the power budget of the S1070.)

- whether the S1070 has sufficient cooling capacity for 4 K80 GPU's: but I presume that when taking out the blue pins from the S1070 power supply, that the fans will run at max continuously, which likely provides sufficient cooling as long as not all K80's are at max simultaneously. (the nVidia Tesla s870 supposedly doesn't check for GPU's, so we would not have to pull the blue pins and fan control would like be retained.... but I just cant find any S870 anywhere online.)

- whether the C6220ii Mobo is powered sufficiently to provide the additional 75W to the edge slot without using additional 12V power on the nearby 4pin connector.


I think the only way to know is to try it and see whether it works.
I will let you know, but, for now, thanks again so much for all your help and online directions.

Kind regards, JJ
 

falanger

Member
Jan 30, 2018
70
13
8
37
I'm glad I could help you. I hope you won't have any problems. About the risks of using two power supplies, I meant that when powered from two sources, there can be two situations dangerous for your equipment. 1. Inside the Tesla K80, in the system for forming the phases of the processor's and memory's power supply, current can begin to flow from one of your power sources - the card box to another - the C6220 chassis. A voltage difference of 0.1V is enough for the current to be up to 10A, which can damage the device. In mining systems, different cards are powered from 2 different sources ALWAYS completely slot + card connectors, a slot from one power supply, and connectors from another - they do not do that, communication with the board is only through the card control lines and the Common chassis wire. When the riser is powered from one source, and the card itself from another, the card may burn out. 2. In the same way, the power system inside the card is connected, and the current from the card can start flowing into your C6220 and damage the board components if the power supply voltage of the box is even 0.1 V higher, 0.5-1 V higher - it is deadly. Or vice versa, current from your C6220 may start flowing towards the box and disable it. I hope you will be fine, but be sure to very accurately control that the supply voltages of your C6220 units and the box match, and be prepared to turn off the equipment very quickly if there is a smell of hot plastic. And turn it off by pulling out the power cables, if you turn it off with the button, you may not have time. Better yet, control the heating of the systems with a good thermal imager and if overheating is more than 70 degrees Celsius, turn everything off immediately. And never let it heat up more than 70 degrees Celsius - the internal structure of processors and memory begins to deteriorate. I am telling you this as a specialist in the field of computer repair. Of course, I was taught on pirated copies of DEC LSI-11, but the principles are the same as then and now. And in fact, if I were you, I would use boxes together with Tesla K20X 6 GB or K40 12 GB, and 2 Tesla K40 together would give you large crystal frequencies by 200-300 MHz and more by 200-300 CUDA cores, in K80 has reduced the number of cores and frequencies so that they heat up less. 2 K40 = 1 K80 +. And for the K80, I would buy a 3-unit box that you talked about and on the Asus server board that I mentioned - I assembled a system with 4 K80s. And I would have received not one, but several systems. Because of our fool-Putin who "is at war with all the world", when it is necessary to negotiate and trade, my income fell and I cannot afford to buy GPU cards and assemble a system with 4 K80 cards as I suggest you do. If you would give me a pair of K40s - or one K80s - I would be very grateful to you. Even the old K20M / K20X would help me. One K20X burned out, the fixing screw got on the contacts of the power controller and I did not notice it, and no one can repair them here.
 

jverschoor

New Member
Mar 12, 2021
15
0
1
Hi Falanger,

I do again thank you for your insights, but I am sorry to say that we don't have any K20's nor K40''s, and the K80's we need ourselves.

In reply: it looks like the GPGPU riser cable (DP/N 0DJC89) is an isolated powered riser that would prevent these cross current/ground loop problems. (In the C8000 chassis both the mobo as well as the riser + GPU's are powered by the same PSU backbone, but still all risers are isolated, i.e. the front risers, as well as the rear GPGPU flex riser. I reckon Dell just wanted to err on the side of caution).

Accordingly, we were thinking to power:

- the riser as well as the GPU from one PSU (the one in the S1070 chassis), and
- the Mobo from another PSU (the one in the C6000 chassis).

So, I do hope the problems you encountered were the result of your thermaltake riser not being isolated, else we will indeed face the same problems as you have.

Further, I hear what you say about a 3U system with 4 K80's, but due to software limitations we can only have 1 K80 per mobo. So, what we are trying to do now is to:

- put the 4 K80's in the nVidia S1070 chassis
- route the GPGPU riser cables from the rear C6220ii PCIe slots directly into the S1070
- take the 2 HIC cards out of the S1070
- solder some cables to the S1070 mid plane power connectors (these connectors used to provide power to the HIC cards; we think 2 of them are 12V and 2 of them are ground)
- attach those cables to a 4-pin connector that we can connect to the DJC89 riser power sockets.

I have no idea whether it will work, but I will let you know.

Kind regards, JJ