Thanks, the extra details help fill in the picture.
I totally agree that you need to be careful if you are not staff there and getting roped into how to spend the money on the project. You do want to get buy in from the long term staff who will be responsible for keeping what is build up and running are leaving. You don't want to be known as the person who is blamed for "___" especially in a university scenario where your circle of contacts with those people or people who know those people will continue for decades after you have left. The kicker is "___" is totally not known until after you leave
Agree 100%!
I will explicitly tell my professor that he needs long term person to maintain it.
Also I will tell the lab members to have a realistic expectation.
These things that I built is a supplement and convenience for the lab member to increase their work efficiency.
They should reserve a method to switch-back to their own workstation for computation.
The hardware purchase will not start until the new students arrive in fall to discuss their need.
The new PhD student who I said was seasoned in IT and programing, he did a short intern with us and spun off a few project ideas (like NVIDA jetson, etc.) at that time and only found out that they are not feasible under existing University network. For example, we couldn't even have a switch to extend a few ports when there is only one port per cubicle. Installing a new port costs $300-500 and takes a few month.
What types of experiments are you working on? ML/AI? Graphics related things? I mention it because in a university setting, I expect the projects people will work on will cycle over time as the students rotate through, and what research projects each do could vary a lot in the hardware needs for it. Unless the group is explicitly focused on one type of research to sort of lock you in on tool chains and things for 5+ years you may want to stick with spending $10-20k at a time on awesome workstations that help you get that year's students going vs dropping a lot of $ on a setup that is fun today but meh for the next students and a totally ignored 4 years out. May want to consider buying hardware that you intend to sell off or move around to other departments to fund the latest and greatest for your team before the value of the item drops to zero.
You have 3 different types of systems, the fileserver, the virtualization hosts, and a GPU server. While they will all probably last 10 years, depending on what you are doing with them if they provide the same bang for buck 3 or 5 years down the line is not a guarantee. Would a 3 or 5 year old GPU cluster help you with what the students are doing today? You can always find something to put on a VM server, so that isn't a problem really. Storage is another matter - if you are doing ML type things and you want to get speed, then slow SATA/SAS 3.5" storage probably isn't worth spending new money on (especially now with CHIA f'ing up the market). Unless it is mostly static storage of data you may be underwhelmed without at least SSD's and 25Gb networking in the mix. If you are looking at only ~100TB - Can you get a used 24 bay server from another department that wants to upgrade and swap out the drives and spend more of your budget on compute/gpu/network?
You probably have already considered this, but if what you need are dedicated storage and CPU time, can you buy the hardware and pay to put it in the IT department's existing data center? Adding 10U of hardware to a room that is already setup for it is way easier than building out everything from scratch, and you get to put more $ into the things that you really want to achieve vs the support crap like PDUs, racks and UPSs, HVAC and sq ft.
Currently most our data is on University's network drive (offsite, two locations) and can be access at 100MB/s speed.
Our data sets are mainly large image data (5GB each) and experiment data recordings from instrumentation (1-5GB each).
We cannot afford all flash storage for bulk storage at this time.
What I can do is allocate a pair of enterprise nvme SSD as fast storage at each Proxmox node for temp storage for compuation.
I actually only have two types of server. Sorry for the confusion. I don't think we need dedicated GPU cluster yet. If we do need, I will suggest lab member and professor to use University cluster or commercial cluster.
- a bulk storage server, which is much faster to access than offsite University network drive.
- Proxmox nodes, for general computation needs and prototyping code
The first proxmox node (Threadripper), we already used GPU passthrough for one student to do CUDA based simulation
Our computation needs could be very diverse.
CPU based, heavily rely on MATLAB (some can be paralleled) , and varies GUI based image data pre-processing tools in Windows and Linux.
GPU based, CUDA based simulation, deep learning project actually is still at infancy currently.
I will ask to see if we could get surplus server racks, etc.
The main problem is that we only get 1 gigabit Ethernet.
We still need to do some data processing at our office workstations. A few lab member ran out of disk space on those, I helped to add 6TB HHD, but ultimately there is only limited HDD bay. They end up uploaded less used data to University network drive and download them when needed.
Currently on those office workstations, we also don't have RAID redundancy. If a HDD dies, it relies on individual to do backups to University network drive. Upload and download 10-50GBs of data at a time is a pain with gigabit Ethernet.
I would love to host those at data center if the Ethernet infrastructure are better.
For that, we are willing to pay them to upgrade our ethernet ports to 10G. But, it doesn't seems to be feasible.
Our building is under another jurisdiction (affiliated to University) that they will be reluctant to improve the network speed.
It took them almost a year to fix our network speed (after lengthy email and a few meetings) which were running at 100 Mbps (yes 10MB/s) despite all hardware supports gigabit speed.
There are a lot good IT staffs at the University for sure, I have worked with a few, very professional, but they are usually busy with more difficult tasks.
One would be surprised how many inefficient and incompetent other "meh" IT staffs are. For example, in our Win7 to Win10 migration, a IT staff broke the HDD connector on our hard drive, wiped my D-Drive data disk (Thank God, I have backups) when he reimaging my PC not to C-Drive but D-Drive. He was later fired obviously.
The 20 sq. ft room is a sealed room with a heavy duty door and basic sound insulation. It will not have people work in it regularly. I can move my own home lab server (HP DL380p) to there to test for noise isolation. If it is too noisy, I will look into quieter solutions, like 45 Drives (they seems to use Nuctua fans)
Good point! I will double check the room for sprinkler (I don't think it has one.)
I don't think that The department will care about the power usage. Compare to other stuff in the department, 6kW from us is insignificant.
The building (12-floor commercial building) was completed renovated 3 years ago, so the HVAC are all new. We often fell too cold (like 70F) and put on jackets and hoodies.
The utilization of the server will not be 100% load all the time for 24x7x365.
One computing session (often just CPU load from me or just GPU load from my lab mate) may last up to a day and then we need time to evaluate the results and make changes before another run. My computation are CPU only but needs a lot of RAM and only last 20 mins each.
I have been hosting our first Proxmox node (threadripper at my home, due to COVID work from home) for more than a year now.
The Supermicro GPU server that I put up there is already the biggest stuff we will ever get (mostly likely we won't). I am leaning towards the Gigabyte 2U.
If we need GPU cluster like 8xA100 in the future, we will definitely host it in university data center or a more cost effective way is to rent computing time from commercial provider like AWS and Linode.
This is concerning on a few levels - if the higher ups are worried about a full height rack in the corner, then the noise and added cooling load are probably going to be a huge problem. This is where the buy-in from the departments that do hvac, power, the main IT department and sign-off from the long-term f/t staff there will help CYA. Do you need to budget for the power usage of your setup + cooling as well?
Regarding noise and workplace comfort - As a point of reference, one facility I work in has ~40A @ 240V drawn across 2 racks of servers and switches 24x7 in a room that is ~12x18. When I have to work in there for any length of time I wear noise-cancelling headset and/or full over the ear noise protection (like when I run a chainsaw) so I don't go deaf after a few hours. You also need to semi-yell to talk to people and can't talk on a phone dependingo n where you are because it is too loud. Plus working on the hot side of the racks feels like it strips all the moisture out of your body after a surprisingly short amount of time. Your initial build out diagrammed above isn't going to be that bad thankfully, but if you do add more GPU servers into the mix it will be, and make sure that everyone has the right expectations in place. Running 6kW of servers full throttle 24x7 is a different game than running a workstation with 2 or 3 GPUs for a weekend.
Depending on what the uni is charging you to build out the outlets, just go with the basic of what you need. You don't need 120V for anything, and if you have a laptop or something temporary just use the existing wall power.
If the electricians want you to have a balanced load across the 3 phases then you want to defer to their power people to have something they are happy with. If they are picky about it, less problems down the road if you stay out of that part. Don't get the UPS new, but don't get it off e-bay - contact local company who can install and service a 10kV+ UPS and have them find a used one they will install and support for you and work with the building staff and electricians to get it going and meet your needs. They are expensive pieces of equipment that need routine maintenance and $ yearly to keep them running. Companies go out of business or swap out equipment all the time, and since they are long term equipment you can get them serviced professionally and keep them running for decades no problem.
Even if you do buy 2kVA units on your own, you should budget to maintain them over the time frame the whole system lasts. You will probably need only a few replacement hard drives, but will will need a fresh new battery set on a regular basis if you want the UPSs to do their job. See if you can get the IT department to supply and maintain them for you if nothing else so it is one less problem your team handles.
The department higher ups don't have objections so far. It is from us that we don't want to push too big and become the center of attention.
We are the only engineering-focused lab in the department and have been very successful in our research and bring the department funding.
If we grow enough on that path, department will consider hire more people to support us.
Depending on what the uni is charging you to build out the outlets, just go with the basic of what you need. You don't need 120V for anything, and if you have a laptop or something temporary just use the existing wall power.
If the electricians want you to have a balanced load across the 3 phases then you want to defer to their power people to have something they are happy with. If they are picky about it, less problems down the road if you stay out of that part. Don't get the UPS new, but don't get it off e-bay - contact local company who can install and service a 10kV+ UPS and have them find a used one they will install and support for you and work with the building staff and electricians to get it going and meet your needs. They are expensive pieces of equipment that need routine maintenance and $ yearly to keep them running. Companies go out of business or swap out equipment all the time, and since they are long term equipment you can get them serviced professionally and keep them running for decades no problem.
Even if you do buy 2kVA units on your own, you should budget to maintain them over the time frame the whole system lasts. You will probably need only a few replacement hard drives, but will will need a fresh new battery set on a regular basis if you want the UPSs to do their job. See if you can get the IT department to supply and maintain them for you if nothing else so it is one less problem your team handles.
Yes. I suggest tell them that I need three 208V 30A and let them figure out the best way.
Our building is big enough and has five elevators and HVACs. Our 6kW load is probably OK.
For get local a large UPS and service support, it is a very good advise that I will remember for the future (I have other plans that might need this not related to the University).
"See if you can get the IT department to supply and maintain them" I will contact them to see what they have. If they can handle it, that is great.