Learning self hosted AI/machine learning, budget server build questions

jeeperforlife · Mar 28, 2024

I am building a budget server to run AI and I have no experience running AI software. I'm thinking starting with Llama LLM, but would like to get into making AI pictures and videos as well plus who knows what else once I learn more about this. I am just getting into this and have not received the hardware yet but it is ordered. I'm just gathering information so I know how to get started when it gets here.

System specs:

Dual E5 2686 V4 (32 cores, 72 threads total)

128GB ECC RAM

2TB Gen 4 NVME SSD

(4) 1TB SATA SSDs in RAID 0

(4) Tesla P40 24Gb cards (uses the GP102 chip, same as the Titan XP and 1080TI)

I'm planning to run this headless and remote into it. This is just for tinkering at home and I'm not worried if it isn't the fastest system in the world.

What would be the best OS?

What drivers are the best to use with the Tesla P40 cards?

Any other thoughts on this setup, or suggestions?

Do I need to use NV link on the cards in order to use all the VRAM?

I am thinking of using bifurcation and running each card on 8 PCIE gen 3 lanes, Do you think that would cause a bottleneck?

unwind-protect · Mar 28, 2024

Are you sure that you aren't better off with one powerful generic graphics card?

You will lose some RAM, but as you say you don't yet know how much RAM you need with the GPU/accelerator.

Either way I don't think you have to worry about PCIe bottlenecks.

If you are in the US this system with 4 GPUs might draw more power than you can get out of a single wall socket.

jeeperforlife · Mar 28, 2024

Nope, I'm not sure. Lol
I went this route because I bought all 4 of these for $200 less then a 3090. From what I researched VRAM seems to be the biggest limiting factor. 96GB of VRAM should give me some pretty good flexibility for future projects.

Power won't be an issue. When I built my office I ran 4 dedicated 20 amp 120V circuits, and I wired it so that 2 of those can be rewired as a single 20 amp 240V circuit. Currently I have (2) 20 amp circuits unused.

bayleyw · Mar 28, 2024

image and video generation doesn't shard across multiple GPUs so the other three P40s will not be very useful. the 4x P40 thing is for people who want to get really big language models to run on a budget. in theory, you get 1.4 tbytes/sec of bandwidth across four cards on a bandwidth limited use case. in practice, the only frameworks that support linear speedup during inference don't support pascal (it doesn't have tensor cores), so you are stuck with model parallelism (where the model layers are split across the cards) and only one is active at any time.

jeeperforlife · Mar 28, 2024

bayleyw said:
image and video generation doesn't shard across multiple GPUs so the other three P40s will not be very useful. the 4x P40 thing is for people who want to get really big language models to run on a budget. in theory, you get 1.4 tbytes/sec of bandwidth across four cards on a bandwidth limited use case. in practice, the only frameworks that support linear speedup during inference don't support pascal (it doesn't have tensor cores), so you are stuck with model parallelism (where the model layers are split across the cards) and only one is active at any time.

Thank you, that's really good info.

DHamov · Apr 7, 2024

For picture generation i have no clue, but for text llm i have experience.
For anything with llm (training, inference) i can recommend ubuntu 22.04.
For running an llm like gpt local ollama is quite user friendly, as a model the miqu models are very good (in some tasks better than gpt3.5 and some even than gpt4).
The miqu models can be found here miqudev/miqu-1-70b at main
of miqu there are various quantized models. but with 2x24GB VRAM you can run the q5 with not terrible speeds and it is quite good.
smaller mique qant models are faster and less good, but still very good, for their size.
You need pytorch and tensorflow, read very well the instructions and make sure that the nvidia drivers and cuda versions are compatible with the pytorch and tensorflow versions. Based on this you can google find tutorials and how to for the installation, but mentaly prepare it will take some time, especially if you are new to linux and ubuntu.

CyklonDX · Apr 7, 2024

In my exp (mostly in ai graphic generation), you do not need a lot of cpu; but gpu's can make a lot of difference. More vram the better - less likely you will crash. The vram most often isn't pooled during the execution of trained engine - so more vram in single gpu = better. The engine training like stable diff is a bit of a problem in terms of pooling the memory too, but it works to some extent.

You don't need really fast disks either - you will fight temps all the time when you are dealing with long compute; so if you have to get gen4 nvme, i recommend running them in gen3 mode (so they produce less heat).

Now to cards,
I don't recommend P40, it will take too much power - cost too much to drive them (even tho they have lot of vram) they do lack compute power. (single A4000 will run circles around it.)

As one person stated 3090 is great option (vram wise) worth to look at chinese 2nd hand markets and try to find some cards with upgraded memory or contacting someone to get 2080ti or 3080ti upgraded do double-quadruple the memory (i.e solder larger modules, and change firmware - 30 series are a bit of a trouble but there were some people whom supposedly succeeded)

bayleyw · Apr 8, 2024

CyklonDX said:
In my exp (mostly in ai graphic generation), you do not need a lot of cpu; but gpu's can make a lot of difference. More vram the better - less likely you will crash. The vram most often isn't pooled during the execution of trained engine - so more vram in single gpu = better. The engine training like stable diff is a bit of a problem in terms of pooling the memory too, but it works to some extent.

You don't need really fast disks either - you will fight temps all the time when you are dealing with long compute; so if you have to get gen4 nvme, i recommend running them in gen3 mode (so they produce less heat).

Now to cards,
I don't recommend P40, it will take too much power - cost too much to drive them (even tho they have lot of vram) they do lack compute power. (single A4000 will run circles around it.)

As one person stated 3090 is great option (vram wise) worth to look at chinese 2nd hand markets and try to find some cards with upgraded memory or contacting someone to get 2080ti or 3080ti upgraded do double-quadruple the memory (i.e solder larger modules, and change firmware - 30 series are a bit of a trouble but there were some people whom supposedly succeeded)

Language models can be partitioned across multiple GPUs *with the caveat* that only one GPU is active at any one time. This is a huge caveat, because for regular mortals (and even minor startups) this caps your memory bandwidth at about 1 Tbyte/sec and therefore puts an upper limit on your token generation speed. The big boy frameworks (TGI, VLLM, and TensorRT-LLM) support tensor parallel inference which lets you scale generation speed across many GPUs, but they primarily target batched production inference - I have no idea how well they perform at batch size 1. This leaves one standout - MLC-LLM, which is based on Apache TVM - that is focused on batch size 1 and supports some degree of scaling across two GPUs. Unfortunately, MLC is not well maintained and support for newer model architectures is nil (the most interesting model it supports is Mistral-7B), so you are on your own to build the binaries for the models you want.

As far as image generation goes, the answer is an RTX 3090. If you are really pinching pennies, you can put down $429 for a 2080Ti 22G rather than $650 for a 3090, but to me, the 50% boost in performance and generally improved software support is worth the $200. There are no commercially available 2080Ti 44G or 3090 48G, but who knows, that might change in the near future.

CyklonDX · Apr 8, 2024

On the other hand while not stressing the performance hit, zluda been released to public - that allows to run cuda on amd rocm)

AMD Quietly Funded A Drop-In CUDA Implementation Built On ROCm: It's Now Open-Source - Phoronix

www.phoronix.com

GitHub - vosen/ZLUDA: CUDA on AMD GPUs

CUDA on AMD GPUs. Contribute to vosen/ZLUDA development by creating an account on GitHub.

github.com

Mi100 32G HBM2 gcn (avg $900)
v620 32G GDDR6 rdna2 (avg $900) ~ will perform close to Rx6800xt
Mi60 32GB HBM2 (avg $500)

Now that "some" cuda support dropped in, those cards may see some light.
but ... Not sure if anyone wants to drop in 1k just to test if it works.

DHamov · Apr 8, 2024

I agree on this, but it is late the OP already ordered the hardware. A single used A4000 or 3090 will be 600USD minimum at current prices. Maybe a bit more cost efficient are these 2080ti with 22gb on taobao for 400$. used P40 or P100 go for around 160$ in ebay auctions and taobao. But yes they run hot, and not very powerfull. But its better than nothing, and on that budget, its not easy to beat it. To to learn and study it can still be usefull. NVlink only matters if you want to train neural networks, for inference there is no benefit. His questions seems more about OS, Drivers, Software than about the hardware which is already fixed, if i understood correctly.

CyklonDX · Apr 8, 2024

ah good to mention p40 and p100 pcie do not have nvlink while we are at it. (only sxm models do)
OS/Drivers are really just personal choice - its possible to run it on almost anything these days.

If he's just starting, ComfyUI would be nice for graphics - but that again will hardly use more than 1 gpu.
(in most cases its pointless to even have 4x p40 as resources won't be pooled anyway, and most commonly soft gets to use only a single gpu -- while saying that, its possible with some clustering software like apache spark - but you won't have out-of-box experiance with that)

Search

Learning self hosted AI/machine learning, budget server build questions

jeeperforlife

New Member

unwind-protect

Active Member

jeeperforlife

New Member

bayleyw

Active Member

jeeperforlife

New Member

DHamov

Member

CyklonDX

Well-Known Member

bayleyw

Active Member

CyklonDX

Well-Known Member

AMD Quietly Funded A Drop-In CUDA Implementation Built On ROCm: It's Now Open-Source - Phoronix

GitHub - vosen/ZLUDA: CUDA on AMD GPUs

DHamov

Member

CyklonDX

Well-Known Member