How is everyone setting up Local AI in thier home labs? Idea thread.

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

marcoi

Well-Known Member
Apr 6, 2013
1,696
409
83
Gotha Florida
Well I been running a few different things in my home lab to get some local AI knowledge and testing done.
My current setup:

1. Nvidia 4060 with 8GB Vram in a Qnap NAS running Quts in Container Station.
  • Used for Dockers right now primarily for Open-webui with Ollama.
  • I use it for my local AI Chatbot.
  • Running nemotron-3-nano:4b really well. Quick to load, answer and doesnt use a lot of memory.
  • Leaves the card open for other dockers to run like immich ML and jellyfin transcoding.
2. Miniforums N5PRO with 96GB ECC Ram.
  • I got this at launch before ramogendom started.
  • I use NPU with lemonaid server and FastFlowLM - works decent for chatbot.
  • I used the iGPU with LM studio to run some LLM - seems to like qwen3.5 the most. I have igpu using 48GB Ram of the 96GB.
  • I recently added intel arc pro b70 card via deg1 external gpu dock. Its running with LM studio, but LM studio only shows the dgpu and not igpu. So there seems to be a know bug with either the LM or the runtime which is setup to use vulkan. I havent figured out how to run get that fixed to run both igpu and dgpu on lm studio.
  • I setup a virtual box and ubuntu desktop VM to run claude code cli connected to LM studio so it runs locally.
  • I setup Gitea as a docker to store code changes that claude does.
  • It runs fairly well. The igpu is a bit slow but claude will do coding and such.
  • I run claude with full permissions since it inside a VM.
  • Goal is to run NPU, Igpu and dgpu with various models.
3. Alts - I have a two gaming pcs, which have a 3090 and 5090 card in them. Occasionally I keep them on to run LM studio and connect claude code Ubuntu VM on.
  • 5090 is crazy fast, it will suck up full 600 watts and has some coil whine when running full tilt. like 180 t/s
  • 3090 a bit slower in token per second like 120 t/s (i dont recall.)
  • I prefer not to run these cards as it means another system on.

Goals:
  • Keep chat on NAS since its on 24x7 already and card installed.
  • Setup the N5pro to run three models at the same time, using NPU, igpu and dgpu.
  • Setup more virtual machines on my XCP servers to run development tasks using claude code cli.
  • Learn more about ML and AI in general while taking advantage of coding aspect for work projects.

Let me know how everyone else is doing their setup at home. What been working, not working. What OS you like to use and apps etc.
 
  • Like
Reactions: Patrick

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,883
2,219
113
As another user of the 5090 and 3090 there's a reason I got mac studios... and that's POWER and HEAT!

The studios idle so low the meter blips between no power and 15w, the darn 5090 and 3090 systems idle at 10x that :(

What I'm trying to do is test the 5090 and 3090 on old E5 v4 system @ pciex8
 
Last edited:

marcoi

Well-Known Member
Apr 6, 2013
1,696
409
83
Gotha Florida
@T_Minus your power is absurdly insane if you are paying 44C/kWh. I though FL @22c was bad. I already got my house with 63 panels and 33kw battery setup since i work from home and AC running 24x7 and so is my home lab which averages around 700 watts for servers/Nas/switches etc. My desk pc averages like 500w due to specs and monitors but thats only on for 9ish hours a day. If you can get solar running it worth it in my experience. I finally got like 500 extra KWh saved up using net metering which ill burn through during the last few months of the year when the sun isnt out as much.

I would test my 3090 in dell r730 with dual e5-28xx v4 cpu but its a kingpin model and it wont physically fit. But i think you maybe okay with the slower bandwidth pcie slot. Like you said it take longer to load the model but once there it should run similar to higher spec pcie slot. The only other place it will matter/bottleneck is when your trying to run multi gpu to span memory to run bigger LLM, then i could see it slowing things down a lot.
but if you want three 3090 with three different LLM it probably doable. you can always test by pulling the card out of your gaming pc and running it solo in the e5 system. I think time to load will be longer but tokens per sec come close to gaming pc.

how does the mac studio compare to 5090 card. cost of the mac is like 7500 for 256gb mem model. big mula for home lab lol.

Im hoping the intel b70 card software keeps maturing and it gets to at least 3090 tokens per second. Way better to get 4 1k card to get up to the 128GB share memory. Nvidia cards are expensive.

I moved all my llm downloads to NAS and point LM studio to the share folder. Most of the pcs are on 2.5gb or 10gb on the home network now. It was getting crazy having so much storage wasted on multiple machines.
 

Patriot

Moderator
Apr 18, 2011
1,513
834
113
My lab is 99% for learning not continuous uptime.

I run 0 AI continuously in prod... about to re-fire up a basic setup with twin RTX 8000s (turing) 48gb ea. and 512gb ddr4 2666 with a milan 7763.

Though I am tempted to run 4-8 V100S' and power cap them...they are 300w cards but place nice down to 115w caps.
I also have run a hive of Mi100s and would also like to try 2x 4card hives and power cap them... but they don't play as well under cap as the v100s.
Last I had a hive of Mi100s setup I ran Lama3 128B, its neat but I didn't find it terribly much better than 70B.

I want to try some new MOE models especially with mixed layer loading on vram/ram and disk... Optane Pmem is calling my name.
The main problem I run into is, I don't actually care or want to use AI in my daily life... It's just a tool, and I like to... learn new things.
 

foureight84

Well-Known Member
Jun 26, 2018
458
387
63
My server:

Supermicro X11SPI-TF
Intel Xeon 8260M
1TB Optane Series 100 in AppDirect mode
384GB DDR4-2666
4TB Intel SSDPE2KX040T8P
3 RTX 3090
1 Tesla V100
1600W PSU

I'm probably going to get rid of my Tesla V100 and look for another RTX 3090. The 3090s are capped at 240W and frequency capped to 1800. whereas the V100 @ 200W and 1380Mhz max.

I am running ik_llamacpp with Optane as LLM storage and cache mount in DAX mode. I also have opencode and a few MCP servers setup for tooling. Mostly doing web and mobile application development with it. The primary model running now is Qwen 3.6 with max context size. The V100 is running in RPC work mode and serves mostly for context overflow instead of having it dumped to the Optane cache.
 
Last edited:
  • Like
Reactions: T_Minus

TRACKER

Active Member
Jan 14, 2019
340
147
43
My server:

Super Micro X11SPI-TF
Intel Xeon 8260M
1TB Optane Series 100 in AppDirect mode
384GB DDR4-2666
4TB Intel SSDPE2KX040T8P
3 RTX 3090
1 Tesla V100
1600W PSU

I'm probably going to get rid of my Tesla V100 and look for another RTX 3090. The 3090s are capped at 240W and frequency capped to 1800. whereas the V100 @ 200W and 1380Mhz max.

I am running ik_llamacpp with Optane as LLM storage and cache mount in DAX mode. I also have opencode and a few MCP servers setup for tooling. Mostly doing web and mobile application development with it. The primary model running now is Qwen 3.6 with max context size. The V100 is running in RPC work mode and serves mostly for context overflow instead of having it dumped to the Optane cache.
What read/write speed do you get from optane? :)
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,820
659
113
Supermicro 7049GP-TRT
2x Gold 6254
768GB DDR4 2666MHz 3DS
9400-8i8e (24 bay 3.5" sc jbod for 260TB zfs, and 4x 1.6T sas3 ssds zfs fast storage for kvm's)
bifurcation nvme x16 card (micron 3400 1t, hynix p41 1t, toshiba something 1.75t, oculink to nic 10gig - used as caches for zpools)
aoc-slg-4e4t (4x u.2 intel p4600's - faster pool for projects)

asrock creator 7900xtx (blower) ~ most often for windows kvm, sometimes models on int8/fp16 models
intel b60 blower ~ idle, at some point once kernel changes tickle down to LTS -> sr-iov workload for kvm's
3080ti blower ~ kvm's (want to get rid of it at some point and replace with b60)
4080 blower ~ most ai stuff *(I do cap its power to 150W using nvidia-smi when it gets warmer, it doesn't affect the performance all that much.)

I run most of the ai stuff inside dockers> (and often expose them in some capacity through vps vpn bridge masquerade)
ACE-Step ~> music generation
ComfyUI ~> ai image, video generation (mostly used for ai upscaling of old videos pre 720p quality, vhs family tapes etc)
first manual export of scenes using davinci (i.e. each camera change)
then the scene is exported into seperate frames
then each seperate frame is upscaled/or is adding details
(there are sometimes problems where different details are added, so its something to play with... biggest problem is the heat, so its only doable during winter - and its more like 4-6h of material per winter)
llama.cpp used to do llm's, but also bind multiple different gpu's, and even cpu/ram if i want to run larger models. (oddest config ever ran was mi100, 7900xtx, a4000, 3080ti + cpu/ram and it worked, not fast by any mean but it did work, without cpu much faster, but still at slowest gpu speed.)

I was thinking about something for antrophic claude, but so far i've been keeping out of it; out of from laziness to set up secure enviroment that would keep me fully safe and away from my internal home network *i.e. would need to set up vpn tunnel just for it going to vps box, and kvm with added layers of security... lot of work for potentially couple hours of fun.
 
  • Like
Reactions: T_Minus

foureight84

Well-Known Member
Jun 26, 2018
458
387
63
What read/write speed do you get from optane? :)
It's around PCIE 4 mvme speed. But the advantage comes from the latency. Data stored in Optane Appdirect (Dax) are accessible in the nano second. With applications that are DAX aware, you get to bypass having to load from disk to kernel buffer and the to application. It basically becomes an extension of RAM. With llamacpp, DAX is supported so loading the model is pretty fast, especially restoring cache, near instant context checkpoint search and it's especially noticeable when context gets offloaded to disk (nvme will be much slower than on optane). CXL should provide the same type of low latency storage too, I believe.
 
  • Like
Reactions: T_Minus and TRACKER

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,883
2,219
113
It's around PCIE 4 mvme speed. But the advantage comes from the latency. Data stored in Optane Appdirect (Dax) are accessible in the nano second. With applications that are DAX aware, you get to bypass having to load from disk to kernel buffer and the to application. It basically becomes an extension of RAM. With llamacpp, DAX is supported so loading the model is pretty fast, especially restoring cache, near instant context checkpoint search and it's especially noticeable when context gets offloaded to disk (nvme will be much slower than on optane). CXL should provide the same type of low latency storage too, I believe.
what's GPU configuration? how do you fit 4 on that motherboard?
 

marcoi

Well-Known Member
Apr 6, 2013
1,696
409
83
Gotha Florida
what are you guys running to do benchmarks. i want to take a few runs of the intel b70 on the n5pro then move it over to another system which has b580 to see if the higher speed pcie has an impact. I also have an old bitmining rig which ran like 4-5 gpus when i was mining. They are all connected at pciex1 so no idea how bad it impacts cards. so benching them using a standard method would be good.
I prefer windows since all three system have it as OS, and switching to linux can be done on the n5pro since i installed ubuntu 26 on 2nd drive but the other two systems would be a pain.
 
  • Like
Reactions: T_Minus

foureight84

Well-Known Member
Jun 26, 2018
458
387
63
what's GPU configuration? how do you fit 4 on that motherboard?
Slot 6 (x16), 4 (x16), 2 (x8) (3 RTRX 3090), and 1 (x4) for the Telsa V100 to go through PCH. The Tesla V100 is running as an RPC worker and primarily for context.
 
Last edited:
  • Like
Reactions: T_Minus

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,883
2,219
113
Slot 6 (x16), 4 (x16), 2 (x8) (3 RTRX 3090), and 1 (x4) for the Telsa V100 to go through PCH. The Tesla V1 is running as an RPC worker and primarily for context.
Any pictures? Looking at the X11SPI-TF I don't see how you're fitting 3x 3090s and a 4th card on that, my single 3090 TI is huge (not blower)... which 3090s are they and how big, a picture would be really helpeful! my 5090 FE is much more compact than the 3090, but at current cost rather just get a single 6000 vs multiple 5090s, at least with the 3090s can find them here or there affordable lol

I'm trying to decide which setup for my multi-GPU box, sorry to ask so many questions :)
 

foureight84

Well-Known Member
Jun 26, 2018
458
387
63
Any pictures? Looking at the X11SPI-TF I don't see how you're fitting 3x 3090s and a 4th card on that, my single 3090 TI is huge (not blower)... which 3090s are they and how big, a picture would be really helpeful! my 5090 FE is much more compact than the 3090, but at current cost rather just get a single 6000 vs multiple 5090s, at least with the 3090s can find them here or there affordable lol

I'm trying to decide which setup for my multi-GPU box, sorry to ask so many questions :)
Oh lol. Forgot to mention I am using riser cables on a cheap $30 open frame. Here's a photo from a year ago when I started the project (upgraded power supply since).

Screenshot 2026-04-23 at 1.54.02 PM.png
 
  • Like
Reactions: TRACKER and marcoi

foureight84

Well-Known Member
Jun 26, 2018
458
387
63
Any pictures? Looking at the X11SPI-TF I don't see how you're fitting 3x 3090s and a 4th card on that, my single 3090 TI is huge (not blower)... which 3090s are they and how big, a picture would be really helpeful! my 5090 FE is much more compact than the 3090, but at current cost rather just get a single 6000 vs multiple 5090s, at least with the 3090s can find them here or there affordable lol

I'm trying to decide which setup for my multi-GPU box, sorry to ask so many questions :)
If you're planning on using Optane, I would at least stick to the Ice Lake since you'll get the benefit of PCIe 4.0.
 
  • Like
Reactions: T_Minus

Jelle458

Member
Oct 4, 2022
91
46
18
I am more curious what kind of Qnap that is. I like Qnap and run it myself, but I have never seen one with a GPU in it.
And how you got the GPU in there would be interesting to see as well.
My older TS-1679U-RP and TS-1280U-RP has no chance of getting a GPU in, and I have my doubts that QTS has a driver for them.
Would be cool to run an Nvidia A2 in one of them for Plex transcoding.

Not AI I know, sorry :eek:
 

marcoi

Well-Known Member
Apr 6, 2013
1,696
409
83
Gotha Florida
I am more curious what kind of Qnap that is. I like Qnap and run it myself, but I have never seen one with a GPU in it.
And how you got the GPU in there would be interesting to see as well.
My older TS-1679U-RP and TS-1280U-RP has no chance of getting a GPU in, and I have my doubts that QTS has a driver for them.
Would be cool to run an Nvidia A2 in one of them for Plex transcoding.

Not AI I know, sorry :eek:
The qnap is a tvs-h1688x. I upgraded the CPU to w-1290 (non P) and memory to 128GB. I run QuTS hero 5.3.x OS.
The video card is on their approved hardware list and it is a ZOTAC Gaming GeForce RTX 4060 8GB Twin Edge OC DLSS 3 8GB GDDR6 128-bit 17 Gbps PCIE 4.0 Compact Gaming Graphics Card, ZT-D40600H-10M.
Nvidia driver seems to be version 575.64.05, it installed as a qnap app.
I have the video card assigned to container station to run dockers.
1777032673922.png
The card fits perfectly inside the qnap on the side where the drives and psu are located. There is one pcie slot there and psu has power for 8 pins.
The key is the size of the card since the space is really limited. Also the power draw of the card, since the psu like 5 or 600 watts.

In container station I have all my compose files loaded in the application section.
1777032833274.png

then in whatever docker you want the gpu exposed you add the following:
Code:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
 
  • Like
Reactions: Jelle458

marcoi

Well-Known Member
Apr 6, 2013
1,696
409
83
Gotha Florida
I have a dell r730 with 256GB ram and e5-2884v4 cpu. It doesnt really have space to run video cards. I maybe able to get the intel b70 to fit. problem is it only has pcie 3.0 slots. (as far as I know i have reviewed specs in a while. ) I also have a bunch of sas SSD and sata SSD in the 2.5 bays.

I was trying to find a motherboard that accepted the same memory and offered multiple pcie 4.0 slots. getting the board and cpu shouldnt be as much as getting memory for a new board, thus wanted to pull the memory from the dell and re-use it.

Anyone have any ideas if such a board exists?
 

seany

Member
Jul 14, 2021
47
41
18
H12SSL-I with a 7302P and 512gb of 3200. Mostly as a nas, but it has a VM on it with one of my k3s nodes;128gb of ram and a rtx 4000 ada passed through to it. Right now hosting one of the qwen models to deal with simpler things from openclaw, but also hosts an image generation backend that the wife and I can both access from Krita.

If we wern't trying to move at the moment I'd probably sell the 4000 and get a 6000 pro. More vram would be nice.
 
Last edited:
  • Like
Reactions: T_Minus and Patrick

bayleyw

Active Member
Jan 8, 2014
347
125
43
Stuff that I've tried, in no particular order. The dates matter because model sizes and context lengths have changed a lot in the past three years:

  • A single 3060 (circa 2023): loved it. Ran 7B models in q4 at 16K context just fine, super handy for sundry text manipulation work (summarization, translation, ...). Also ran Stable Diffusion 1.5 and XL at good speeds via diffusers without running out of RAM (this mattered back then, because the U-NET based models did not like being offloaded into system RAM)
  • 8x V100 16GB with NVLINK (mid 2024): meh. It was worth the $3K I put into it and it ran Llama-2-70B at 50 tokens per second in q4, but the software for this thing was a real struggle to set up: vLLM was still on the v0 engine in 2024 and as slow as molasses and the alternative tensor parallel frameworks didn't really want to cooperate
  • A single Arc B50 (2026): surprisingly useful. Runs Z-Image-Turbo at 10 sec/MP in 70W. Bad for language models though (I haven't even tried) but maybe you can get something out of four of them doing tensor parallel?
  • Two RTX A5000 with NVLINK (2026): interchangeable with two 3090s. Generally nice *right now* because one of the best "home" models (Qwen3.6-27B) fits neatly in 48GB and vLLM is no longer slow. Falls apart the instant we get bigger models though, Qwen3.6-27B-AWQ barely fits with max context and you really do need max context in 2026 to do interesting things
  • Epyc 7702 with octal channel DDR4 (2023): 200GB/sec but not really. Ran Llama1-65B back when running an LLM at home was a crazy thing to do. I think you could probably get 10 tok/sec out of it, maybe 15, on Qwen3.5-122B-A10B so if you already own an Epyc it doubles up as a nice rig for overnight tasks
  • 8x V100 32GB with NVLINK (mid 2026): somehow Volta aged well. vLLM stopped sucking, some guy named 1CatAI in China backported it to Volta, and sparse models became the norm. It's REALLY loud though, and the high idle (700W) hurts even if you don't care about your power bill, because you need to rebuild your home HVAC to pump the extra heat out
  • Xeon 8468 with octal channel DDR5 (2024): I really had high hopes for this one. Its greatest achievment was pulling off 20 AMX Tflops doing a full parameter fine tune of a 7B model, but other than that it was really underwhelming. I think the Intel ecosystem is better now, but 256GB of DDR5 RDIMMs are something like $5K+ so you're almost better off buying GPUs
  • A laptop 5090 24GB: I've never been more impressed by a $2800 laptop. It manages to run any video/image generation model there is courtesty of fp8 and ComfyUI offloading and even fits Qwen27B with decent context. Plus, you can show off to your friends how your entire AI coding stack fits in your backpack
The real problem is outside of image generation on a laptop or the B50, none of this makes financial sense. Even the octal V100 running Qwen122B would take two billion tokens to break even with free power. The laptop is at least a useful machine outside of AI, and the B50 could work its way as some sort of incidental device on a router or NAS.