Render Farm Network Design sanity check

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

ycp

Member
Jun 22, 2014
234
17
18
Hey,

I am designing a network for our small temple renderfarm, Our temple is producing many 3d animations and we need to upgrade our network.

We have 60 render nodes, each with gigabit network cards.
Our server has 2 X 2TB samsung 850 pro ssds. in a Raid 0 configuration.
also in the server there is a 6 X 6TB Hitachi Nas Drive RAID 10 array.

So we were thinking to use 2 teamed 10gbps network connections from the server to our main switch.
We are planning to purchase a Dell X series x1052 switch which has 4 10gbps sfp+ ports.
we are planning to puchase an intel x710-da2 nic card
and then LACP the nic ports to 2 of the switch sfp+ ports for a total of 20gbps.
The render nodes would then be connected to the dell switch.

We are a small non-profit organization so budget is a major concern.
So if anyone has any better ideas for our planned network please help.
 

ycp

Member
Jun 22, 2014
234
17
18
Our server will be using Windows server 2012 r2. Also can the Mellanox cards be teamed?
 

ycp

Member
Jun 22, 2014
234
17
18
Our render nodes are just regular tower pcs with i7 cpus inside, nothing fancy. at the beginning of a render all the render nodes needed to be feed data in order to render. So when 60 pcs are trying to get around 30gb of data copied at once it strains the server.

In the beggining we had only a 1gbps connection from the server and then we added 4 nics so then with teaming we were at 4gbps network connection which really speeded things up. with our current 4gbps setup we can see that the network utilization is 99% when a render job is submitted.

Now that we have 2 ssd's in a Raid 0 we think that the bottleneck would be the network connection. So by going with 10gbps we are getting the data to the render nodes faster.

If we are wrong with our logic please explain.
 

Jeggs101

Well-Known Member
Dec 29, 2010
1,529
241
63
Our render nodes are just regular tower pcs with i7 cpus inside, nothing fancy. at the beginning of a render all the render nodes needed to be feed data in order to render. So when 60 pcs are trying to get around 30gb of data copied at once it strains the server.

In the beggining we had only a 1gbps connection from the server and then we added 4 nics so then with teaming we were at 4gbps network connection which really speeded things up. with our current 4gbps setup we can see that the network utilization is 99% when a render job is submitted.

Now that we have 2 ssd's in a Raid 0 we think that the bottleneck would be the network connection. So by going with 10gbps we are getting the data to the render nodes faster.

If we are wrong with our logic please explain.
125MBps (gigabit) * 60 clients =7.5GBps
1000MBps (10gigabit) * 60 clients = 60.0GBps
550MBps (one SSD) * 2 SSDs = 1.1GBps

I used rounded figures there an maybe a bit high but on a good network you might hit that. I think you're right on a faster NIC would help, but if you truly have 60 simultaneous connections then the SSDs are going to be the bottleneck unless you're serving from RAM in which case the SSDs don't matter.

But I do think we need pictures! I love seeing render farms
 

RTM

Well-Known Member
Jan 26, 2014
956
359
63
As you are only considering using two ports on the switch, you could consider buying a cheaper switch than the Dell x1052, such as the Quanta LB4M, there are a few members who have bought those. Of course if you do this, you should look into whether you can activate LACP on it beforehand.
 
  • Like
Reactions: ycp

Chuckleb

Moderator
Mar 5, 2013
1,017
331
83
Minnesota
He is suggesting going with Infiniband technologies, which can do 40/56Gbps and very large jumbo frames (65k) in connected mode. These nodes can see each other over an IP network technology called IPoIB. There are a few posts about that throughout. The cost per card can be in the $80 range for ConnectX-2 VPI cards. The QDR switches are generally under $500 for 36 ports. You can connect the switches in any topolgy to make the length work. You can also go fiber for more length.
 
  • Like
Reactions: ycp and Zankza

markpower28

Active Member
Apr 9, 2013
413
104
43
Infiniband solution will be very cost effective for you setup. It will bring the best performance for you as well.

1. You are using 2012 R2 which utilize SMB 3. SMB 3 is great but it's only half of the story. The other half rely on RDMA, see link below for the performance difference . http://www.servethehome.com/custom-firmware-mellanox-oem-infiniband-rdma-windows-server-2012/
2. Cost, it will be cheaper than anything out there (ebay) if budget is a concern.
3. Setup. You can keep your existing 1 GB network by simply add the Infiniband network. SMB 3 muti-channel will utilize RDMA based Infiniband storage network without any configuration.

I like to see how you push jobs across 60 nodes.
 
  • Like
Reactions: ycp

ycp

Member
Jun 22, 2014
234
17
18
Hey, guys thanks for the help, but infiniband technology is new to me, so is there a resource online in which i can learn more about it.
I still have a few questions regarding this
1. Do i have to connect all the render nodes using infiniband or can i just connect the server and the infiniband switch together. ?
2. If just the server and switch are connected how is the rest of my existing 1gb ethernet network connect to the infiniband network?

Also regarding how we push jobs across the 60 nodes. We use a render manager software called renderpal. This is probably the most cost effective render manger based on our research. Other than this i dont completely understand what you mean by how we push jobs across 60 nodes.
 

Chuckleb

Moderator
Mar 5, 2013
1,017
331
83
Minnesota
You'd want to have a whole separate network for the Infiniband, each machine directly connected to the switches as well as the storage server. Bridging between the two fabrics is a) slower potentially, and b) hard to do especially when you are inexperienced. I would just do a whole separate fabric.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,514
5,805
113
With that big of a render cluster, I might be inclined to agree on the IB idea.

Just to give you an idea, large storage and high performance computing applications use IB. As they can never get enough low latency fast bandwidth, they upgrade and the older generations hit ebay very inexpensively.

As @Chuckleb mentioned: Use gigabit Ethernet for connecting machines to the outside world (e.g. if you needed patches downloaded from the Internet.) Use IB for an internal only network for the jobs.
 

TechIsCool

Active Member
Feb 8, 2012
263
117
43
Clinton, WA
techiscool.com
Either store the 30GB distribution file in ECC memory and then push it out or a peer hosting solution would work as well so that speed increased every time another host got the files.
 

abstractalgebra

Active Member
Dec 3, 2013
182
26
28
MA, USA
Agreed that 40gb (QDR) / 56gb (FDR) Infiniband (IB) is a great fast solution and rather cost effective but how much better is it than 10GBE? Curious and since I am thinking this was designed some time ago, you might have a lot of energy savings options.
  • How many hours a week/month are you rendering and what is the hardware like in the rendering nodes?
  • Have you measured the power draw so you know each render costs you $x for electricity (and cooling the room)?
  • How much time is spent transferring data to nodes vs. rendering?
This will allow you to dollarize the problem and then report & justify based on going cost, energy, speed improvements. You can use the cheap kill-a-watt device (~$20) to measure power draw for one node or here is a nice power strip:
Amazon.com: P3 P4330 Kill A Watt Ps 10 Surge Protector: Electronics ~$76

For the server moving to a PCIe SSD Drive should be an easy upgrade for possibly 3-4x more performance.
Intel 750 PCIe SSD 400GB that hits sequential read of 2,200MB/s $369 Anandtech Review

What is the total cost for the complete solution for 60 nodes and would it show much difference between 10GBE?

Infiniband


Server:
Intel 750 PCIe SSD 400GB that hits sequential read of 2,200MB/s $369 Anandtech Review
2 x Dual port PCIe Infiniband QDR - $80 each
MHQH29B-XTR MELLANOX INFINIBAND 4X QDR QSFP CONNECTX-2 VPI DUAL PORT 40GB/s HCA

Nodes:
60 x Dual port PCIe Infiniband QDR - $80 each (did not yet find better deal on a single port VPI Card)
MHQH29B-XTR MELLANOX INFINIBAND 4X QDR QSFP CONNECTX-2 VPI DUAL PORT 40GB/s HCA

2 x NEW F/S Mellanox MIS5025Q-1SFC 36 port QDR Infiniband switch 2.88Tb/s IS5025 40Gb/s per port QSFP RAILS, QoS, NEW RETAIL BOX, SEALED $699 each
https://forums.servethehome.com/index.php?threads/mellanox-8-port-40gb-qdr-ib-switches.6575/
New F s Mellanox MIS5025Q 1SFC 36 Port QDR Infiniband Switch 2 88TB s IS5025 | eBay

? Cables - Estimate $15 each

Really rough Estimate $7700
You get 40 Gb IB to each node

--------------
10GB Ethernet

How about doing dual 10 GBe in the server to a Mikrotik CRS226 to distribute out to your existing switches. What are your existing switches? I'm not sure if it is worth it to do 3 dual port cards to three switches or getting bigger switch(s). Another option is to go dual gigabit ethernet to each node but not sure if the cost is worth it.

Server:
Intel 750 PCIe SSD 400GB that hits sequential read of 2,200MB/s $369
1 x Dual port PCIe 10GBe - $99
Chelsio T420-SO-CR Dual Port 10GbE SFP+
1 x Mikrotik CRS226 24port Gigabit Ethernet Switch with 2 SFP+ 10GBe uplinks - $250
4 SFP+ Cables to Switch - $15 each

Estimate $800
You only fix the server bottleneck, each node is still 1gbe.
 
Last edited:

abstractalgebra

Active Member
Dec 3, 2013
182
26
28
MA, USA
Either store the 30GB distribution file in ECC memory and then push it out or a peer hosting solution would work as well so that speed increased every time another host got the files.
Great idea that really kills the server Disk IO bottleneck. Now the question is what is the server hardware and how cheap is it to add RAM.
How does the peer hosting solution work? Would you designate say node 10/20/30/40/50/60 to have dual gigabit ethernet and also host the distribution file?