I'd grab at least four if I were you, maybe 5. 160GB VRAM let's you load really big models like Qwen3 235B Q4 with lots of room for context. It runs at ~20tps and is really a replacement for chatgptI found a guy in the EU selling Mi50 32GB for 240 euros + shipping.
I am getting 2. If someone in Finland wants to join so we can save on shipping then that is the time![]()
I use Qwen 3 Coder 30B Q8 all the time with C#, python, JS, C++ an Rust. It's not the best but it does really well for most use cases. I use Roo in VS code. You need to be explicit about what you want and how you want it done. If you expect to give couple of lines of high level description and get what you want, you'll be very disappointed.And then do what with such models? I really have a problem understanding the use of AI/LLM. I tried a bunch of models but never really felt they fulfill any necessity for me.
Deepseek is meh.
Qwen coder is rubbish and the code has bugs. I will have to rewrite it.
Seriously, what do people use such models for?
Running such server would burn 150watts idle. 24/7 is shy of 200euros for what?
I honestly just need a single GPU for basic models used by my self hosted services. I am getting a second one as a spare. I am even sure my needs wi be served by a 16GB GPU.
I am just really wondering what practical uses does AI/LLM have?
It is very difficult to give all the details and by the 3rd iteration it slows down significantly and start outputting nonsense.I use Qwen 3 Coder 30B Q8 all the time with C#, python, JS, C++ an Rust. It's not the best but it does really well for most use cases. I use Roo in VS code. You need to be explicit about what you want and how you want it done. If you expect to give couple of lines of high level description and get what you want, you'll be very disappointed.
I can get a day's worth of work done in about one hour using LLMs. Even with writing very thorough descriptions of what I want done it's much faster than doing all the code by hand, and it can do unit tests with practically full coverage with minimal guidance. I can also dump a long stack trace and it will point me exactly to the relevant part.
Qwen3 235B is very good for the cases where 30B is unable to deal with the task. I find 235B also very useful with Linux system administration, something I'm not as experienced in as in Windows. It helped me setup proxmox, setup ZFS, explained LXC containers to me, helped me setup NAS and Gitea containers, etc.
I use gpt-oss-120b to help me learn German and for rubber ducking to convert ideas into detailed project and implementation plans. It can take me from a short idea description to a 15 page project definition and 15 project architecture and implementation plan. It is also our private chatgpt at home where we can ask about random stuff. Of course we double check by googling, but it's still very helpful when you "don't know what you don't know".
BTW, I don't leave my LLM servers running 24/7. That's part of the reason why I use boards with integrated IPMI. I turn them on as needed with ipmi tool and shut them down when I'm done. Sometimes I need only one machine on, sometimes I need all thrre, depending on what I'm doing.
I do each iteration in a new chat.It is very difficult to give all the details and by the 3rd iteration it slows down significantly and start outputting nonsense.
Qwen3 is way way way better than 2.5. I used 2.5 32B Q8 and was borderline useless beyond simple tasks. Qwen 3 30B (MoE) is a whole different level. Qwen3 235B is very close to chatgpt premium.I also have to carefully review what it outputs because it's not always conforming to what I asked. That's with qwen 2.5 coder.
If you know what to do, then googling can be indeed faster. I find it also depends on what google is doing with search. There are days when they insist on showing irrelevant stuff. One thing google doesn't tell me is full commands even when I know how how to do something but not sure about arguments. And because I'm running locally, I'm not worried about sharing personal details like paths, IP adresses, usernames, etc.I work for a company that demands office presence so cannot really uae llm from home. I have been using linux for almost 25 years. It is easy to ask ai for help but google is really faster for most cases.
-s in llama.cpp really makes a difference in reliably getting the same answer again and again. If you like DS's thinking, you'll also like gpt-oss-120b. It really punches above it's wieght. I haven't felt the need to use DS since Qwen3, and even deleted it a few weeks ago. 20t/s in VRAM beats 4-5t/s hybrid.What I really like is following deepseek's line of thoughts but if i repeat a query twice, I get different answers. Too difficult for me to rely on that.
Okaaaaaaay. Now that is quite a trick. You live and learn. Thank youI do each iteration in a new chat.
Start with a description of the idea I had and what I thought about it, and tell the LLM to ask me questions to clarify ambiguities and provide candidate answers for each question. I answer each question either from the candidate answers or write something else. After answering all questions, I tell the LLM to integrate my answers with the original description in a new coherent description without omitting any details (this last nugget actually makes a big difference). I take that and use it as the starting point in a new chat, and copy paste the same prompt I had telling the LLM to ask questions. Rinse, repeat until the questions become of little value or I'm satisfied with the scope/detail. This now becomes the project specification document, and the starting point for working out the architecture. Rinse, repeat the same process until I have an architecture and implementation plan I'm happy with.
I honestly never tried chatgpt. I am a privacy freak and i don't trust those guys but if qwen 3 is way better than 2.5 then i should give that a try.Qwen3 is way way way better than 2.5. I used 2.5 32B Q8 and was borderline useless beyond simple tasks. Qwen 3 30B (MoE) is a whole different level. Qwen3 235B is very close to chatgpt premium.
I just check man pages but sometimes it is still a PITA. I agree.One thing google doesn't tell me is full commands even when I know how how to do something but not sure about arguments.
You're most welcome! This workflow, simple as it is, took a lot of experimenting to get to.Okaaaaaaay. Now that is quite a trick. You live and learn. Thank you![]()
I think you should, be it chatgpt, claude or gemini, if only to get a baseline for capabilities. You don't have to create an account and login. Just use it for random stuff you don't care about.I honestly never tried chatgpt. I am a privacy freak and i don't trust those guys but if qwen 3 is way better than 2.5 then i should give that a try.
Everything can be run on CPU. It all depends on how fast you want your answer or how patient you are.But isn't 30B MoE runnable on the CPU since it's MoE?
You can definitely get a taste for it using CPU if you have more than 128GB RAM. It's a MoE with 22B active. Unsloth's Q4_K_XL GGUF is 134GB.235B is definitely beyond my capabilities to run.
gpt-oss-120B is ~64GB. I run it with plenty of context with three 3090s (72GB VRAM). So, you'll need three 32GB Mi50s if you want to run it. It's 5B active so it's really really fast. On the 3090s using llama.cpp I get ~120t/s up to ~8k context.How much vram does gpt oss need? Qwen 30B should be fine on 32GB but qwen 235B is probably hopeless.
Two GPUs is not worth it. IMO, aim for four and get a QQ89 Cascade Lake ES. 128GB VRAM plus 128GB system RAM will let you run Qwen3 235B probably around 15t/s. You can run gpt-oss-120b on three cards, while running qwen3 Coder 30B on the fourth card with context on the third. You'll need to manually split layers of gpt-oss-120b so that only context is on the 3rd card, same with Qwen3 30B.I found an X11SPi-TF with xeon silver 4114 for 240euros delivered but the cost of populating is with RAM+2 more GPUs? That's 1k more or less. Tempting but not really sure it's worth all this :/
I do hear you but that is 220 for the board + 90 CPU + 300-400 for the RAM + 500 for the GPUs to use a model for 2 hours every week or so?Two GPUs is not worth it. IMO, aim for four and get a QQ89 Cascade Lake ES. 128GB VRAM plus 128GB system RAM will let you run Qwen3 235B probably around 15t/s. You can run gpt-oss-120b on three cards, while running qwen3 Coder 30B on the fourth card with context on the third. You'll need to manually split layers of gpt-oss-120b so that only context is on the 3rd card, same with Qwen3 30B.
They're somewhere in between.I think my problem is I assumed such models are intelligent. Seems I have to treat them as dumb that need everything to be specified.
The model is ~64GB and you still need a few more GBs for context. I'd say the minimum to run it with decent is 72GB.Wouldn't 2xAMD Radeon Mi50 32GB (total 64GB VRAM) enough for gpt-oss-120b ? Seems like it needs >= 60 GB of RAM but yeah, maybe that's just to store the Model and doesn't leave any Space for Prompt Processing ?
Your comment reminded me of that old Dell PowerEdge C410x "GPU shelf". It runs only PCIe Gen 2 but can hold 8 two slot GPUs and connect to the host server via single x16 card. They used to be very cheap on ebay, but I see they've also become unobtainium nowI still need to figure out how to integrate the 20 x AMD Radeon Mi50 16GB. 8 of those will be 128GB VRAM so I guess I could fill 2 Chassis and then have a couple in two Desktop Systems.
OpenWebUI works with anything that exposes an OpenAI compatible API. Most open source VS code extensions also work with anything OpenAI API compatible.How are you integrating / interfacing to it though ? I know there is Open WebUI for Ollama, but I also saw some People using e.g. N8N for Home Automation (there are some Videos on Youtube), Continue (VS Code / Codium Extension) and of course there is also the API / Model Context Protocol (MCP) that you could call from e.g. a Python Script.
Uhm, I don't think it's supported to have 2x32GB + 1x16GB, is it ?The model is ~64GB and you still need a few more GBs for context. I'd say the minimum to run it with decent is 72GB.
I could also get a X9DRX+-F, X10DRX or X10DRG-Q, which would be good for 5x GPUs without Risers.Your comment reminded me of that old Dell PowerEdge C410x "GPU shelf". It runs only PCIe Gen 2 but can hold 8 two slot GPUs and connect to the host server via single x16 card. They used to be very cheap on ebay, but I see they've also become unobtainium now
I actually use Codium, not VS Code, on my Personal Desktop. Different Story at work (Windows + VS Code).OpenWebUI works with anything that exposes an OpenAI compatible API. Most open source VS code extensions also work with anything OpenAI API compatible.
So you don't run Ollama, you just use the most recent changes from the upstream project of Ollama, basically ?For chat, I use OpenWebUI. For inference I use llama.cpp exclusively, and use llama-swap to switch models.
OK I didn't have to build custom Stuff so far, I've just been playing a bit with Ollama.In VS code I use RooCode. I don't use vLLM because model loading takes forever. Llama.cpp has regressions sometomes, but I just keep previous builds to get around that. I have a simple bash script to build it into a directory named after the git tag and then copy the artifacts to a fixed directory where my llama-swap configure points to.
Repeat after me: VRAM is VRAM. the models don't care, and with MoE models rubbing in llama.cpp or any of it's derivatives, you can split however you want with almost no penalty (within reason) because there is no tensor parallelism.Uhm, I don't think it's supported to have 2x32GB + 1x16GB, is it ?
I have some X10DRX that I plan to sell in January, if you're interestedI could also get a X9DRX+-F, X10DRX or X10DRG-Q, which would be good for 5x GPUs without Risers.
IIRC, supports the same plug-ins as the full Spyware VS Code.I actually use Codium, not VS Code
You can even say I actively hate ollama. I fully understand the premise: easy to setup for a noob. I used it for about a week when I was starting. It gets very irritating very quickly the moment you know what you want to do, and the way they leech off llama.cpp without giving anything back (they don't even acknowledge that it's basically a GUI wrapper for llama.cpp) is irritating.So you don't run Ollama, you just use the most recent changes from the upstream project of Ollama, basically ?
No, it's because they make it very painful to do anything beyond using it as-is. Any configuration changes have to be done by polluting your environment with 30 env vars. I also dislike how they mangle model filenames making it practically impossible to use any model outside of ollama, despite them being regular GGUFs, and how they hide quantizatoons and sometimes even deceive by naming distillations to make them seem like the full fat models (ex: deceiving people into thinking DeepSeek 8B is the same model as the 671B).Is that because you don't need a GUI (although I think llama.cpp just received a GUI recently) or need to do more advanced Stuff ?
Ollama tends to be 2-6 weeks behind llama.cpp in terms of features, except when they decide to implement support for some new model (which they never contribute back to llama.cpp).OK I didn't have to build custom Stuff so far, I've just been playing a bit with Ollama.
I built quite a few custom Docker Container so it's not like it scares me, but I guess there are some special Things with llama.cpp and the AMD Radeon Mi50 to know.