Seek suggestions for server room set up

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

jena

New Member
May 30, 2020
28
6
3
Hi all,

I am working for a research lab at a University and we start to have more deep-learning and general computing needs.
I am seeking suggestions for finalize the set up.
Basically for this project, I volunteered as an "armature" IT consultant.
I marked my questions with emojio_O

The need
1. A storage sever
2. A few CPU + GPU server for computation
3. 10G networking to regular DELL OptiPlex/Precision workstations

Plans
Two types of server: a storage server and a few Proxmox node.

1. o_OStorage server: Supermicro 540P-E1CTR45L, 4U, 45 Drive
Single Intel® Xeon® Gold 6314U Processor 32-Core 2.3GHz​
8 x 32GB or 4 x 64GB (4 DIMM can allow future Octane upgrade)​
TrueNAS (really hope by Oct 2021, TrueNAS scale main functions can be stable)​
Mirror boot drive: D3-S4510 240GB x 2​
1 vdev for now, 9 x 12T drive RAID-Z3; 4 vdev in future upgrade​
Single P5800x 400G SLOG (too expensive to buy two)​
No L2ARC (it seems with enough ram, L2ARC might not be that useful)​
2. First Proxmox mode (a more workstation-ish server)
Origin PC prebuild​
Threadripper 3970x​
256GB RAM​
RTX2080Ti x 1​
Plan to upgrade PSU to Corsair 1600W and add second RTX 3090​
3. o_OSecond Proxmox node (we'd like to get more professional gears, maybe a 2022 purchase)
2 x EPYC 32 core (high frequency SKU, maybe 7543)​
Candidates:​
Supermicro 4124GS-TNR 4U (the product webpage is confusing, 2+2 2000W PSU, I assume it means 4000W total system power)​
Gigabyte G292-Z44 2U​
Maybe ASUS or Tyan model​

4. o_ORouter
Not sure it is worth the effort to go with pfsense, since we will not be needing advanced features.​
We will be under University's network. I assume we just need a simple router (currently using Ubiquiti EdgeRouter-X) to have DHCP for VMs.​
If there is complication, we could also just make a LAN that is not connected to outside.​
Then keep a separate link for necessary internet traffic (install updates and packages)​

5. o_OSwitch, Mikrotik CRS326-24S+2Q+RM or Unifi Switch Pro Aggregation?
Basically we need 10G SPF+ port, plus a few port (maybe 25G or 40G) for uplink and connection to storage server​

6. The Rack and Power design
PDU-V: vertical 0U PDU, Tripp Lite PDUMV20HV-36, or APC AP6002A​
SM: Supermicro​
W#: wall receptacle​

We will have renovation for our space and add in 208V 30A service.​
It will be a 27U rack. We want to keep it low profile, and doesn't want to go 42U full size.​
It is from our prospective that we don't want to push too big and become the center of attention.​
o_OIt seems to be much cheaper to buy two 2U 2200V(10A max) or 3000VA(13A max) UPS (UPS2 and 3) for the second Proxmox host (which consumes 3000-4000W), than 4U 6000VA model. Looking at APC SMX2200RMHV2U.​
o_OThis design assumes the second Proxmox server will be Supermicro 4124GS-TNR.​
It has 2 + 2 2000W PSU (1800W at 208V, PSU-A B C D).​
In my diagram, I assumed that PSU-A and PSU-B are one redundant pair and PSU-C and PSU-D are the other redundant pair.​
In reality, they might have more intelligent load balancing.​
UPS 2 and UPS 3 supply the second Proxmox server directly without PDU (save cost by omitting PDU-V2B and V3B).​
My guess is that no matter what, PSU-A(UPS2) and PSU-B(PDU-V2A) combined will not draw more than 10A;​
so will PSU-C(UPS3) and PSU-D(SM-D, PDU-V3A).​
PDU-V2A has 14A left available; so does PDU-V3A.
UPS 1 supplies PDU (UPS1 208V 1A) on the left, which supplies storage server (PSU-A), router and switch​
PDU-V1A (Group2) supplies storage server (PSU-B)​
RackDesign_v02.PNG
 
Last edited:

jena

New Member
May 30, 2020
28
6
3
Reserve for future updates.

The server room:
Closed 20 sq ft room with heavy duty door, its own thermostat, supply and return air duct.
No one work in it for extended time.

The building:
12 floors, 5 elevators, new HVAC, renovated 3 years ago.
No generator, but building power are very stable and I don't recall any outages in the past 3 years.
It has three phase service (to have elevators).
We lean towards to get single phase 208V from two of the three 120V.

The department higher ups don't have objections to have a mini server room (27U racks) so far.
It is from our prospective that we don't want to push too big and become the center of attention.
 
Last edited:
Jul 14, 2017
60
16
8
54
PFsense isn't difficult or complicated to get set up. If you've got any spare computers that aren't complete antiques, it's a good way to use an old computer, if you have one.

It's a firewall though and not intended to be used as a router. Obviously it can serve as one to an extent, but particularly for any high bandwidth application you are better off with a hardware router. There are some other software options I've seen mentioned if you do want to go that way though.

As to the switch, you might also want to look at used professional gear on ebay. From your language, I'm guessing you're not a native english language speaker, unless you volunteered to be part of a dynamo (you wanted Amateur), so if you aren't in the US that might not be as viable an option.

One consideration should be, where is this all going to be located? Gear like this tends to be very loud and noisy. You don't usually want it anywhere you are going to be spending any significant amount of time.

You might want to elaborate on exactly what it is you need out of the storage server and other hardware. Obviously you've put a significant amount of time and thought into what to get and the hardware all seems very capable. So really the questions you are asking should be "what do I need to do X" or "will this do X".

The PSU for the 4124GS-TNR promox server has 4 power supplies, two pairs of redundant (2+2) power supplies. So that up to 2 PSU can fail and the computer will still run.
 
  • Like
Reactions: jena

Blinky 42

Active Member
Aug 6, 2015
615
232
43
48
PA, USA
Do you have any budget in mind?
What expansion plans do you have for the equipment in the next year or so, or more accurately what do you want to be prepared for?
How much runtime do you need the UPS's to support? Is there any other power backup available (like a generator etc) and you just need 5 minutes to be able to hold until the generator spins up and/or enough time for a "clean" shutdown of your processing on the servers?
If you have 208V service (instead of 230/240) then you are pulling power from somewhere else with 3-phase power? If you don't have 3-phase to work with just do single-phase 230/240 (depending on where you are located). All of the major UPS brands you will want to look at can be configured to support any of the voltages.

In general I would connect the UPS's directly to the outlets for the power feeds into the room and not use extra PDUs before the UPS. They would be either hard-wired, or have L6-30 plugs and be setup for L6-30R outlets with a dedicated breaker for each of them.
I would also go with just A+B feeds and bigger UPS's (5-6KvA). Why build in 3x 6kVA of power with the 3x L6-30 outlets if your UPSs are only set up for 3x 2kVA of capacity? If you are going to be using a lot of power eventually, doing small 2kVA units is a waste unless they are already bought because they will need to be swapped out or you have to add even more little UPS units over time.If you do have 3-phase power, look into a proper UPS for that and get outputs configured to support what you want in the rack.
The runtime part is important - if you need to have the UPS actually cover for power outages by themselves, you will want to look at adding extra batteries to get longer runtime, and that consumes more space.

27U is really small .... do you need to move the rack around the room? That isn't very safe with several hundred lbs of equipment in the rack + the weight of the rack itself. If it tips over it will crush/kill people really easy.

If access to the equipment is needed, proper chassis have slide mount rails and you can pull the equipment out of the rack fully to work on it if your connect the cables up to allow for it or disconnect them before sliding the servers out. But having a 150lb weight cantilevered out of the front of the rack by 1m will cause the rack to tip over which is why they are usually bolted down or have other measures in place to prevent movment.

Also to consider if you are building up a dedicated server room for equipment vs just adding a rack to the corner of a larger room
- Cooling in the new space
- Get a used room-sized UPS from a real vendor vs individual rack mount ones. Often can be expanded with additional capacity and runtime when you need and doesn't suck up rack space. You can get 3rd party vendor to support and maintain the units as well over time
- Fire suppression?
 
  • Like
Reactions: Amrhn and jena

jena

New Member
May 30, 2020
28
6
3
PFsense isn't difficult or complicated to get set up. If you've got any spare computers that aren't complete antiques, it's a good way to use an old computer, if you have one.

It's a firewall though and not intended to be used as a router. Obviously it can serve as one to an extent, but particularly for any high bandwidth application you are better off with a hardware router. There are some other software options I've seen mentioned if you do want to go that way though.

As to the switch, you might also want to look at used professional gear on ebay. From your language, I'm guessing you're not a native english language speaker, unless you volunteered to be part of a dynamo (you wanted Amateur), so if you aren't in the US that might not be as viable an option.

One consideration should be, where is this all going to be located? Gear like this tends to be very loud and noisy. You don't usually want it anywhere you are going to be spending any significant amount of time.

You might want to elaborate on exactly what it is you need out of the storage server and other hardware. Obviously you've put a significant amount of time and thought into what to get and the hardware all seems very capable. So really the questions you are asking should be "what do I need to do X" or "will this do X".

The PSU for the 4124GS-TNR promox server has 4 power supplies, two pairs of redundant (2+2) power supplies. So that up to 2 PSU can fail and the computer will still run.
In general, I am seeking suggestions and validation of my design to avoid any major design flaws.

From your comment, I guess we don't need firewall like pfsense since the firewall is already provided by the University.

To allow VMs get IP addresses without affecting existing University network, a working around is to have a sub network with DHCP (either completely isolated LAN or attached under University network). I would need to work with University IT to come up with a solution and get their approval.
It sounds like a fast hardware router like EdgeRouter (maybe 10G uplink model) should be sufficient.

I am in US. I used the word "amateur" mainly because we don't want to spend money on huge upcharges and overhead to let University IT or University-hired third party consultant to built all these and also get locked-in with DELL stuff.
We have some funds to cover the hardware: the storage server and 2nd Proxmox server. We cannot buy used gear from eBay.
We have a room that has its own HVAC supply and return air, which should be sufficient for just a couple server.

The idea is not to have fully build "mini" data center.
The needs are: convenient and rapid prototyping of project ideas and daily research computation needs.
In stead of buying several "so-so" ("rip-off") Dell-Intel prebuild workstation and only running Windows or Linux one at a time, we thought virtualizing is the way to go.
The problems with several separate Dell workstations are:
  • Windows-Ubuntu dual boot often has varies problems, due to secure boot, Windows bitlocker and other vaires quirks.
  • Windows-based VM solution like virtual box cannot passthrough GPU
  • After setting up dependencies on one Linux machine , it is time consuming to clone to other machines. Also cannot switch between different versions
The first Threadripper-based Proxmox "server" already shows its value in our workflow.

The need for a storage server is to store all data that the Proxmox servers use (with a daily backup script to University provided storage).
We have huge amount of data, currently about 10TB and increasing at rate of 0.5TB/month. It is time consuming to upload all those data after experiments to University storage at gigabit ethernet speed.
 

klui

༺༻
Feb 3, 2019
910
519
93
This is a great project either at work if you're paid or at home for self-learning.

Realize that if you create the "solution," you will be on the hook for resolving any issues that arise. The part about "don't want to spend money on huge upcharges and overhead to let University IT or University-hired third party..." would mean you will definitely be called if things go wrong. If you're a student make sure you set limits on what kind of time you can afford so it doesn't intrude in your studies.

Getting a system set up is only part of the solution. What happens if components fail and you need to recover and resume business continuity ASAP? Disaster recovery is more critical when people other than you/your immediate household are affected impacting their studies/projects are impacted.
 
  • Like
Reactions: Amrhn and jena

jena

New Member
May 30, 2020
28
6
3
Thank you very much of the detailed suggestions.
Here is my comments and rationale.

In my drawing, all wall receptacles are on dedicated circuit and its own breaker.
Things that I need to decide and tell the contractor early next week is:
  • How many wall receptacles (what voltage) and its dedicated breaker?
  • What style of wall receptacle?
My plan is three 208V 30A, L6-30R and one 120V 30A L5-30R.
Is my plan OK? (I have detailed reply below)

There is already another 120V 20A regular receptacles in that room.

Do you have any budget in mind?
What expansion plans do you have for the equipment in the next year or so, or more accurately what do you want to be prepared for?
Not exactly clear yet. We should have at least 25k to cover the initial stage of storage server (9 drives) and maybe the 2nd Proxmox server.

I think we will not exceed these in the next 5 years:
Three Proxmox server (each with 2-4 RTX 3090 level GPU). We already have one.
One storage server (which should have a service life of 10 years)

If we ever outgrow this current plan, we are mostly likely have enough funds to move to a newer space with new renovation. At that time, we will hire professional to plan the server room.

How much runtime do you need the UPS's to support? Is there any other power backup available (like a generator etc) and you just need 5 minutes to be able to hold until the generator spins up and/or enough time for a "clean" shutdown of your processing on the servers?
No generator. But our building power are very stable and I don't recall any outages in the past 3 years.
For storage server, the UPS just need to have enough runtime to do a clean shut down initiated by standard UPS-USB link.

For the Proxmox computation server, it doesn't even need UPS. Because having the first proxmox server is already a huge improvement for us.
If we are still using our own DELL Optiplex (with no UPS. UPS are not standard equip from University IT), we would have lose current working data any way.
So I consider the runtime is less important for the Proxmox server.
Our computation so far only last a day or so at most.

If you have 208V service (instead of 230/240) then you are pulling power from somewhere else with 3-phase power? If you don't have 3-phase to work with just do single-phase 230/240 (depending on where you are located). All of the major UPS brands you will want to look at can be configured to support any of the voltages.
We are in a commercial wired building. The renovation contractor told me that we can provide 208V. I think it is three phase 120V, which can supply single-phase 208V (when using two of the three 120V at 120 degree).
I want to stay within 208V single phase for simplicity, because I am afraid that three phase will cause a lot more buzz and might not get approved.

In general I would connect the UPS's directly to the outlets for the power feeds into the room and not use extra PDUs before the UPS. They would be either hard-wired, or have L6-30 plugs and be setup for L6-30R outlets with a dedicated breaker for each of them.
I would also go with just A+B feeds and bigger UPS's (5-6KvA). Why build in 3x 6kVA of power with the 3x L6-30 outlets if your UPSs are only set up for 3x 2kVA of capacity? If you are going to be using a lot of power eventually, doing small 2kVA units is a waste unless they are already bought because they will need to be swapped out or you have to add even more little UPS units over time.If you do have 3-phase power, look into a proper UPS for that and get outputs configured to support what you want in the rack.
The runtime part is important - if you need to have the UPS actually cover for power outages by themselves, you will want to look at adding extra batteries to get longer runtime, and that consumes more space.
My original thought is:
We will purchase the storage server first for sure. It is only one CPU plus a bunch of disks, which will only draw 1000W-ish max (6A at 208V) even when all 45 drives are populated.

There is currently no definitive plan or specific model for the second Proxmox server (we will add as the need arise). We might go with a Threadripper Pro server by Puget system for using RTX3090. Because the 4U rack is really not ideal for open-air RTX card. It is also common in computer science department to use RTX cards for deep learning for much lower cost.

If we can only afford 2200 or 3000VA (10-13A) UPS one at a time, connecting such a small UPS directly to a 30A wall receptacle is not fully utilizing the amp capacity. The renovation project will charge per wall receptacle. More receptacles also need more breakers in the panels (which might cause some issues, like department approval).
I agree that a big 6kVA (26A) unit might be better, but it just so expensive at $6000 or more and also takes 4U or more. That is why I though to use two 3kVA model at $2000 each to supply Supermicro GPU server (if we purchase one in 2022; but we might go with Threadripper based puget system build, not sure yet).


27U is really small .... do you need to move the rack around the room? That isn't very safe with several hundred lbs of equipment in the rack + the weight of the rack itself. If it tips over it will crush/kill people really easy.
If access to the equipment is needed, proper chassis have slide mount rails and you can pull the equipment out of the rack fully to work on it if your connect the cables up to allow for it or disconnect them before sliding the servers out. But having a 150lb weight cantilevered out of the front of the rack by 1m will cause the rack to tip over which is why they are usually bolted down or have other measures in place to prevent movment.
Very good point!
I will contact contractor to see how could they come up with anti-tipping solution. Definitely bolt it down if the building allows, or side holding anti-tipping bracket (any suggestions)?
The side holding bracket will require less construction.

We don't want to go with full 42U rack to draw too much attentions and people might have objections. Currently we are approved to have a 27U server rack (from initial plan of 18U rack, we already pushed a bit more).

If we outgrow 27U, I can move the network gears to a small lightweight 12U rack. If we further outgrow that, we need a new space for it and dedicated design for cooling, power etc.

Also to consider if you are building up a dedicated server room for equipment vs just adding a rack to the corner of a larger room
- Cooling in the new space
- Get a used room-sized UPS from a real vendor vs individual rack mount ones. Often can be expanded with additional capacity and runtime when you need and doesn't suck up rack space. You can get 3rd party vendor to support and maintain the units as well over time
- Fire suppression?
"just adding a rack to the corner of a larger room"
Yes. That is what we are doing.
It is a 20 sq ft room, with its own thermostat, supply and return air duct. I might install a curtain to further direct air flow.

Fire suppression?
Standard fire suppression the building has.
We might get a couple IOT-enabled fire alarm sensor.

I guess that our server itself should be reliable and not a fire hazard?
I think offsite backup is the way to minimize this risk. Because, if God forbids, all of our other computers will be gone.
 

jena

New Member
May 30, 2020
28
6
3
This is a great project either at work if you're paid or at home for self-learning.

Realize that if you create the "solution," you will be on the hook for resolving any issues that arise. The part about "don't want to spend money on huge upcharges and overhead to let University IT or University-hired third party..." would mean you will definitely be called if things go wrong. If you're a student make sure you set limits on what kind of time you can afford so it doesn't intrude in your studies.

Getting a system set up is only part of the solution. What happens if components fail and you need to recover and resume business continuity ASAP? Disaster recovery is more critical when people other than you/your immediate household are affected impacting their studies/projects are impacted.
I agree 100%.

I am off the hook in a year as I will graduate next year. We have a new long-term PhD student who is seasoned in IT.
Our current headache is that when using University's cluster, the wait time is long, data space is limited, we could easily wait a month just to get a package installed, diagnose some dependency problem can drag even longer with emails, and there is less freedom in terms of working on prototype projects. We have one specialty server hosted at University data center, but the IT support speed is meh~ and get charged a few hundred $ per month. Basically, the actual real-life downtime is even more if we let University IT manages it.

As a backup solution in case storage server is down, I plan to have a rsync script to sync critical working data by hour to a couple USB external hard drives, which can be unplugged and plugged to a regular PC.

If my professor sees the need, he will figure out a more professional solution. My suggestion to that is rent AWS or Linode, or something similar. At this point, we are OK with some server down time.

I really learn or solved most quirks in building the first Proxmox server and hopefully the second one will just be install and copy over VMs.
;)
 
Last edited:

newabc

Well-Known Member
Jan 20, 2019
477
250
63
For 4.
(1) Do you need 10Gbps from the firewall/router to the LAN?
(2) Do you need IDS/IPS on the firewall/router? VPN?

If stick with Ubiquiti routers/firewalls, I think the Dream Machine Pro or future UXG-Pro can do over 500Mbps up and down(both ways at the same time) with its level 5 ruleset of IDS. Their only issue is the CPU is only 1.7Ghz 4-core ARM. But the good is the users don't need to pay extra for updating Ubiquiti's IDS/IPS ruleset.

If the budget for the router/firewall hardware is more, Netgate's 5100 or the coming 6100, or even 7100 should be good. But pfSense+ is a paid service after the 1st year, or come back to the free community edition. Both 6100 and 7100 have 2 SFP+ interfaces.

I appreciate the GUI of Suricata IDS/IPS on pfSense.

Personally, if I can use 2nd-hand hardware on my homelab, I will use Wyse 5070 extended or HP T730 with a dual port SFP+ NIC. Both CPUs have around 3000 passmark. The HP T740 has much more CPU power. If I need 10Gbps IDS/IPS or gigabit VPN, I will go with HP T740.
 
  • Like
Reactions: jena

Blinky 42

Active Member
Aug 6, 2015
615
232
43
48
PA, USA
Thanks, the extra details help fill in the picture.

I totally agree that you need to be careful if you are not staff there and getting roped into how to spend the money on the project. You do want to get buy in from the long term staff who will be responsible for keeping what is build up and running are leaving. You don't want to be known as the person who is blamed for "___" especially in a university scenario where your circle of contacts with those people or people who know those people will continue for decades after you have left. The kicker is "___" is totally not known until after you leave ;)

What types of experiments are you working on? ML/AI? Graphics related things? I mention it because in a university setting, I expect the projects people will work on will cycle over time as the students rotate through, and what research projects each do could vary a lot in the hardware needs for it. Unless the group is explicitly focused on one type of research to sort of lock you in on tool chains and things for 5+ years you may want to stick with spending $10-20k at a time on awesome workstations that help you get that year's students going vs dropping a lot of $ on a setup that is fun today but meh for the next students and a totally ignored 4 years out. May want to consider buying hardware that you intend to sell off or move around to other departments to fund the latest and greatest for your team before the value of the item drops to zero.

You have 3 different types of systems, the fileserver, the virtualization hosts, and a GPU server. While they will all probably last 10 years, depending on what you are doing with them if they provide the same bang for buck 3 or 5 years down the line is not a guarantee. Would a 3 or 5 year old GPU cluster help you with what the students are doing today? You can always find something to put on a VM server, so that isn't a problem really. Storage is another matter - if you are doing ML type things and you want to get speed, then slow SATA/SAS 3.5" storage probably isn't worth spending new money on (especially now with CHIA f'ing up the market). Unless it is mostly static storage of data you may be underwhelmed without at least SSD's and 25Gb networking in the mix. If you are looking at only ~100TB - Can you get a used 24 bay server from another department that wants to upgrade and swap out the drives and spend more of your budget on compute/gpu/network?

You probably have already considered this, but if what you need are dedicated storage and CPU time, can you buy the hardware and pay to put it in the IT department's existing data center? Adding 10U of hardware to a room that is already setup for it is way easier than building out everything from scratch, and you get to put more $ into the things that you really want to achieve vs the support crap like PDUs, racks and UPSs, HVAC and sq ft.


Some thoughts to ponder having had to fix problems after things were things done the wrong way more than once, building out a few small datacenters for offices over the years and being in lots of "colo" environments of every possible interpretation of the word over the years.

- You don't have a lot of equipment to start, so you will probably be ok at this level but expanding to more GPU servers or more VM servers you get closer to problems. Designing to support that responsibly long term sounds like more than your current budget will allow, so you may need to consider being ruthless and just meet today's core need and consider the whole thing throw-away after the 5year life of the servers.
- You don't want to work anywhere near to a rack that is drawing 6kVA of power on a regular basis, let alone the full power your 3x 30A lines could supply. While the solution might be technically great, if you all go insane because the jobs take 36h to run and you have to work next to the racks, then the project missed the mark big-time.
- A 20 sq. ft room is basically an office sized space - standard building a/c probably isn't going to cut it probably unless the building A/C has an extra 2 ton of cooling capacity and can run 24x7x365 (or at least when you have the rack equipment running if it isn't 24x7). Normal building a/c also isn't designed to run all winter long if you are in a cold area. Also cranking up the a/c for a portion of the building to cool one room will probably make lots of people mad with how cold the other rooms have to be to keep the pressure balanced. If you add more hardware you will probably at least need some sort of mini-split type setup for that room alone to augment the building cooling.
- Fire supression - if it is just building fire suppression with water, plan on loosing it all even if someone pulls a prank (it is a school after all - higher % than office building). Doing it right requires a sealed room and expensive systems that I suspect are out of your budget.

We don't want to go with full 42U rack to draw too much attentions and people might have objections. Currently we are approved to have a 27U server rack (from initial plan of 18U rack, we already pushed a bit more).
This is concerning on a few levels - if the higher ups are worried about a full height rack in the corner, then the noise and added cooling load are probably going to be a huge problem. This is where the buy-in from the departments that do hvac, power, the main IT department and sign-off from the long-term f/t staff there will help CYA. Do you need to budget for the power usage of your setup + cooling as well?

Regarding noise and workplace comfort - As a point of reference, one facility I work in has ~40A @ 240V drawn across 2 racks of servers and switches 24x7 in a room that is ~12x18. When I have to work in there for any length of time I wear noise-cancelling headset and/or full over the ear noise protection (like when I run a chainsaw) so I don't go deaf after a few hours. You also need to semi-yell to talk to people and can't talk on a phone dependingo n where you are because it is too loud. Plus working on the hot side of the racks feels like it strips all the moisture out of your body after a surprisingly short amount of time. Your initial build out diagrammed above isn't going to be that bad thankfully, but if you do add more GPU servers into the mix it will be, and make sure that everyone has the right expectations in place. Running 6kW of servers full throttle 24x7 is a different game than running a workstation with 2 or 3 GPUs for a weekend.

Depending on what the uni is charging you to build out the outlets, just go with the basic of what you need. You don't need 120V for anything, and if you have a laptop or something temporary just use the existing wall power.
If the electricians want you to have a balanced load across the 3 phases then you want to defer to their power people to have something they are happy with. If they are picky about it, less problems down the road if you stay out of that part. Don't get the UPS new, but don't get it off e-bay - contact local company who can install and service a 10kV+ UPS and have them find a used one they will install and support for you and work with the building staff and electricians to get it going and meet your needs. They are expensive pieces of equipment that need routine maintenance and $ yearly to keep them running. Companies go out of business or swap out equipment all the time, and since they are long term equipment you can get them serviced professionally and keep them running for decades no problem.
Even if you do buy 2kVA units on your own, you should budget to maintain them over the time frame the whole system lasts. You will probably need only a few replacement hard drives, but will will need a fresh new battery set on a regular basis if you want the UPSs to do their job. See if you can get the IT department to supply and maintain them for you if nothing else so it is one less problem your team handles.
 
  • Like
Reactions: jena

Sean Ho

seanho.com
Nov 19, 2019
814
383
63
Vancouver, BC
seanho.com
I have witnessed the progression of a similar "shadow IT" project in a research lab within a large CS dept. CS often has specific hw/sw needs that University IT is not suited for; there's nothing wrong with building up your department's own computing infrastructure -- but it's much more than hardware.

To underscore what has been said above, staffing is more important than equipment in the long run. From the perspective of a student, you're in and out in five years or less. From the perspective of the professors and the research program, something sustainable for a much longer term is needed.

What often happens is there are leftover grant funds that need to be spent, so a bunch of fancy hardware is purchased and installed, and things are shiny for a time -- the servers are fast, work gets done, life is good. The group gets accustomed to having its own cluster on-demand, and research workflows adapt. You graduate and leave, taking institutional memory with you. In time (often much less than 5 years), hw/sw failures happen, or the cluster just can't keep up with increasing demands, and some poor 1st-year student gets saddled with fixing a system that they don't understand and didn't design, and they need to do it immediately because an analysis needs to be re-run for a paper submission deadline in 12 hours. The department can't just fall back to the way things were before with University IT, because they've already structured their workflows around their own cluster.

The default repository of institutional memory is the tenured professors. They are, however, incredibly busy getting grants, and do not have the time or interest to spend all-nighters in the lab teaching said 1st-year student about admining the cluster. In the long term, what your department needs is multiple, staggered, 5-year grants that include budget for infrastructure, sufficient to fund a FTE staff position to support the research computing infrastructure. That staffer would then be the repository of institutional memory, managing the system, handling failures/upgrades, coordinating with University IT, documenting, etc. -- all the usual sysadmin tasks. It's a ton of work, and it's one of the reasons why managed IT costs big bucks.
 

jena

New Member
May 30, 2020
28
6
3
For 4.
(1) Do you need 10Gbps from the firewall/router to the LAN?
(2) Do you need IDS/IPS on the firewall/router? VPN?
(1) I wish I could, but the switch runs the whole floor is gigabit.
(2) I assume that "Intrusion Detection" is ready provided by the University firewall.
I can maybe set MAC and IP filtering at the router to only allow our lab computer to access lab-LAN
University has a VPN. I can connect to it and then I could remote to our lab LAN.


If stick with Ubiquiti routers/firewalls, I think the Dream Machine Pro or future UXG-Pro can do over 500Mbps up and down(both ways at the same time) with its level 5 ruleset of IDS. Their only issue is the CPU is only 1.7Ghz 4-core ARM. But the good is the users don't need to pay extra for updating Ubiquiti's IDS/IPS ruleset.
If the budget for the router/firewall hardware is more, Netgate's 5100 or the coming 6100, or even 7100 should be good. But pfSense+ is a paid service after the 1st year, or come back to the free community edition. Both 6100 and 7100 have 2 SFP+ interfaces.
I appreciate the GUI of Suricata IDS/IPS on pfSense.
Personally, if I can use 2nd-hand hardware on my homelab, I will use Wyse 5070 extended or HP T730 with a dual port SFP+ NIC. Both CPUs have around 3000 passmark. The HP T740 has much more CPU power. If I need 10Gbps IDS/IPS or gigabit VPN, I will go with HP T740.
Got it.
I have watched Lawrence system's video and it seems UDM Pro's IDS is good on paper and pfsense is better.
The appealing part of Unifi to me is if we need VLAN, it is much easier to populate the VLAN settings to other Unifi devices.

Thanks for your suggestion.
 

jena

New Member
May 30, 2020
28
6
3
Thanks, the extra details help fill in the picture.
I totally agree that you need to be careful if you are not staff there and getting roped into how to spend the money on the project. You do want to get buy in from the long term staff who will be responsible for keeping what is build up and running are leaving. You don't want to be known as the person who is blamed for "___" especially in a university scenario where your circle of contacts with those people or people who know those people will continue for decades after you have left. The kicker is "___" is totally not known until after you leave ;)
Agree 100%!
I will explicitly tell my professor that he needs long term person to maintain it.
Also I will tell the lab members to have a realistic expectation.
These things that I built is a supplement and convenience for the lab member to increase their work efficiency.
They should reserve a method to switch-back to their own workstation for computation.

The hardware purchase will not start until the new students arrive in fall to discuss their need.
The new PhD student who I said was seasoned in IT and programing, he did a short intern with us and spun off a few project ideas (like NVIDA jetson, etc.) at that time and only found out that they are not feasible under existing University network. For example, we couldn't even have a switch to extend a few ports when there is only one port per cubicle. Installing a new port costs $300-500 and takes a few month.

What types of experiments are you working on? ML/AI? Graphics related things? I mention it because in a university setting, I expect the projects people will work on will cycle over time as the students rotate through, and what research projects each do could vary a lot in the hardware needs for it. Unless the group is explicitly focused on one type of research to sort of lock you in on tool chains and things for 5+ years you may want to stick with spending $10-20k at a time on awesome workstations that help you get that year's students going vs dropping a lot of $ on a setup that is fun today but meh for the next students and a totally ignored 4 years out. May want to consider buying hardware that you intend to sell off or move around to other departments to fund the latest and greatest for your team before the value of the item drops to zero.

You have 3 different types of systems, the fileserver, the virtualization hosts, and a GPU server. While they will all probably last 10 years, depending on what you are doing with them if they provide the same bang for buck 3 or 5 years down the line is not a guarantee. Would a 3 or 5 year old GPU cluster help you with what the students are doing today? You can always find something to put on a VM server, so that isn't a problem really. Storage is another matter - if you are doing ML type things and you want to get speed, then slow SATA/SAS 3.5" storage probably isn't worth spending new money on (especially now with CHIA f'ing up the market). Unless it is mostly static storage of data you may be underwhelmed without at least SSD's and 25Gb networking in the mix. If you are looking at only ~100TB - Can you get a used 24 bay server from another department that wants to upgrade and swap out the drives and spend more of your budget on compute/gpu/network?

You probably have already considered this, but if what you need are dedicated storage and CPU time, can you buy the hardware and pay to put it in the IT department's existing data center? Adding 10U of hardware to a room that is already setup for it is way easier than building out everything from scratch, and you get to put more $ into the things that you really want to achieve vs the support crap like PDUs, racks and UPSs, HVAC and sq ft.
Currently most our data is on University's network drive (offsite, two locations) and can be access at 100MB/s speed.
Our data sets are mainly large image data (5GB each) and experiment data recordings from instrumentation (1-5GB each).
We cannot afford all flash storage for bulk storage at this time.
What I can do is allocate a pair of enterprise nvme SSD as fast storage at each Proxmox node for temp storage for compuation.

I actually only have two types of server. Sorry for the confusion. I don't think we need dedicated GPU cluster yet. If we do need, I will suggest lab member and professor to use University cluster or commercial cluster.
  • a bulk storage server, which is much faster to access than offsite University network drive.
  • Proxmox nodes, for general computation needs and prototyping code
The first proxmox node (Threadripper), we already used GPU passthrough for one student to do CUDA based simulation

Our computation needs could be very diverse.
CPU based, heavily rely on MATLAB (some can be paralleled) , and varies GUI based image data pre-processing tools in Windows and Linux.
GPU based, CUDA based simulation, deep learning project actually is still at infancy currently.

I will ask to see if we could get surplus server racks, etc.

The main problem is that we only get 1 gigabit Ethernet.
We still need to do some data processing at our office workstations. A few lab member ran out of disk space on those, I helped to add 6TB HHD, but ultimately there is only limited HDD bay. They end up uploaded less used data to University network drive and download them when needed.
Currently on those office workstations, we also don't have RAID redundancy. If a HDD dies, it relies on individual to do backups to University network drive. Upload and download 10-50GBs of data at a time is a pain with gigabit Ethernet.

I would love to host those at data center if the Ethernet infrastructure are better.
For that, we are willing to pay them to upgrade our ethernet ports to 10G. But, it doesn't seems to be feasible.
Our building is under another jurisdiction (affiliated to University) that they will be reluctant to improve the network speed.
It took them almost a year to fix our network speed (after lengthy email and a few meetings) which were running at 100 Mbps (yes 10MB/s) despite all hardware supports gigabit speed.

There are a lot good IT staffs at the University for sure, I have worked with a few, very professional, but they are usually busy with more difficult tasks.
One would be surprised how many inefficient and incompetent other "meh" IT staffs are. For example, in our Win7 to Win10 migration, a IT staff broke the HDD connector on our hard drive, wiped my D-Drive data disk (Thank God, I have backups) when he reimaging my PC not to C-Drive but D-Drive. He was later fired obviously.

The 20 sq. ft room is a sealed room with a heavy duty door and basic sound insulation. It will not have people work in it regularly. I can move my own home lab server (HP DL380p) to there to test for noise isolation. If it is too noisy, I will look into quieter solutions, like 45 Drives (they seems to use Nuctua fans)

Good point! I will double check the room for sprinkler (I don't think it has one.)

I don't think that The department will care about the power usage. Compare to other stuff in the department, 6kW from us is insignificant.
The building (12-floor commercial building) was completed renovated 3 years ago, so the HVAC are all new. We often fell too cold (like 70F) and put on jackets and hoodies.:D

The utilization of the server will not be 100% load all the time for 24x7x365.
One computing session (often just CPU load from me or just GPU load from my lab mate) may last up to a day and then we need time to evaluate the results and make changes before another run. My computation are CPU only but needs a lot of RAM and only last 20 mins each.

I have been hosting our first Proxmox node (threadripper at my home, due to COVID work from home) for more than a year now.
The Supermicro GPU server that I put up there is already the biggest stuff we will ever get (mostly likely we won't). I am leaning towards the Gigabyte 2U.

If we need GPU cluster like 8xA100 in the future, we will definitely host it in university data center or a more cost effective way is to rent computing time from commercial provider like AWS and Linode.

This is concerning on a few levels - if the higher ups are worried about a full height rack in the corner, then the noise and added cooling load are probably going to be a huge problem. This is where the buy-in from the departments that do hvac, power, the main IT department and sign-off from the long-term f/t staff there will help CYA. Do you need to budget for the power usage of your setup + cooling as well?

Regarding noise and workplace comfort - As a point of reference, one facility I work in has ~40A @ 240V drawn across 2 racks of servers and switches 24x7 in a room that is ~12x18. When I have to work in there for any length of time I wear noise-cancelling headset and/or full over the ear noise protection (like when I run a chainsaw) so I don't go deaf after a few hours. You also need to semi-yell to talk to people and can't talk on a phone dependingo n where you are because it is too loud. Plus working on the hot side of the racks feels like it strips all the moisture out of your body after a surprisingly short amount of time. Your initial build out diagrammed above isn't going to be that bad thankfully, but if you do add more GPU servers into the mix it will be, and make sure that everyone has the right expectations in place. Running 6kW of servers full throttle 24x7 is a different game than running a workstation with 2 or 3 GPUs for a weekend.

Depending on what the uni is charging you to build out the outlets, just go with the basic of what you need. You don't need 120V for anything, and if you have a laptop or something temporary just use the existing wall power.
If the electricians want you to have a balanced load across the 3 phases then you want to defer to their power people to have something they are happy with. If they are picky about it, less problems down the road if you stay out of that part. Don't get the UPS new, but don't get it off e-bay - contact local company who can install and service a 10kV+ UPS and have them find a used one they will install and support for you and work with the building staff and electricians to get it going and meet your needs. They are expensive pieces of equipment that need routine maintenance and $ yearly to keep them running. Companies go out of business or swap out equipment all the time, and since they are long term equipment you can get them serviced professionally and keep them running for decades no problem.
Even if you do buy 2kVA units on your own, you should budget to maintain them over the time frame the whole system lasts. You will probably need only a few replacement hard drives, but will will need a fresh new battery set on a regular basis if you want the UPSs to do their job. See if you can get the IT department to supply and maintain them for you if nothing else so it is one less problem your team handles.
The department higher ups don't have objections so far. It is from us that we don't want to push too big and become the center of attention.
We are the only engineering-focused lab in the department and have been very successful in our research and bring the department funding.
If we grow enough on that path, department will consider hire more people to support us.

Depending on what the uni is charging you to build out the outlets, just go with the basic of what you need. You don't need 120V for anything, and if you have a laptop or something temporary just use the existing wall power.
If the electricians want you to have a balanced load across the 3 phases then you want to defer to their power people to have something they are happy with. If they are picky about it, less problems down the road if you stay out of that part. Don't get the UPS new, but don't get it off e-bay - contact local company who can install and service a 10kV+ UPS and have them find a used one they will install and support for you and work with the building staff and electricians to get it going and meet your needs. They are expensive pieces of equipment that need routine maintenance and $ yearly to keep them running. Companies go out of business or swap out equipment all the time, and since they are long term equipment you can get them serviced professionally and keep them running for decades no problem.
Even if you do buy 2kVA units on your own, you should budget to maintain them over the time frame the whole system lasts. You will probably need only a few replacement hard drives, but will will need a fresh new battery set on a regular basis if you want the UPSs to do their job. See if you can get the IT department to supply and maintain them for you if nothing else so it is one less problem your team handles.
Yes. I suggest tell them that I need three 208V 30A and let them figure out the best way.
Our building is big enough and has five elevators and HVACs. Our 6kW load is probably OK.

For get local a large UPS and service support, it is a very good advise that I will remember for the future (I have other plans that might need this not related to the University).
"See if you can get the IT department to supply and maintain them" I will contact them to see what they have. If they can handle it, that is great.
 
Last edited:

jena

New Member
May 30, 2020
28
6
3
I have witnessed the progression of a similar "shadow IT" project in a research lab within a large CS dept. CS often has specific hw/sw needs that University IT is not suited for; there's nothing wrong with building up your department's own computing infrastructure -- but it's much more than hardware.

To underscore what has been said above, staffing is more important than equipment in the long run. From the perspective of a student, you're in and out in five years or less. From the perspective of the professors and the research program, something sustainable for a much longer term is needed.

What often happens is there are leftover grant funds that need to be spent, so a bunch of fancy hardware is purchased and installed, and things are shiny for a time -- the servers are fast, work gets done, life is good. The group gets accustomed to having its own cluster on-demand, and research workflows adapt. You graduate and leave, taking institutional memory with you. In time (often much less than 5 years), hw/sw failures happen, or the cluster just can't keep up with increasing demands, and some poor 1st-year student gets saddled with fixing a system that they don't understand and didn't design, and they need to do it immediately because an analysis needs to be re-run for a paper submission deadline in 12 hours. The department can't just fall back to the way things were before with University IT, because they've already structured their workflows around their own cluster.

The default repository of institutional memory is the tenured professors. They are, however, incredibly busy getting grants, and do not have the time or interest to spend all-nighters in the lab teaching said 1st-year student about admining the cluster. In the long term, what your department needs is multiple, staggered, 5-year grants that include budget for infrastructure, sufficient to fund a FTE staff position to support the research computing infrastructure. That staffer would then be the repository of institutional memory, managing the system, handling failures/upgrades, coordinating with University IT, documenting, etc. -- all the usual sysadmin tasks. It's a ton of work, and it's one of the reasons why managed IT costs big bucks.
Very good advice.
I will be more upfront about this to my professor and lab members so that they have right levels of expectations.

Maybe 3rd Proxmox server (less powerful) to have Promox running at high availability mode at its final form.
Also try to have most computing needs runnable on regular workstations (just slower).

"It's a ton of work, and it's one of the reasons why managed IT costs big bucks."
Agree!
Just the University general IT is so much lagging behind that we have no hope to get anything done by them any time soon.
 

petree77

New Member
Mar 10, 2015
12
3
3
47
I don't think I saw a response above, but you do need to be worried about Air Conditioning for that room. While you said it has its own dedicated thermostat and return air duct, does the AC in the building run during the winter? Most large building can't provide both AC and Heating at the same time.

You'll need to get explicit confirmation about this. Even if the climate where you are gets nice and cold outside during the winter, if there's no way to bring that cool air into the room in question you're going to have a very hot room for everyone to work in during the winter.