EU [WTB] Storage and low end recent gpus

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

MSameer

Member
May 8, 2025
81
69
18
I found a guy in the EU selling Mi50 32GB for 240 euros + shipping.

I am getting 2. If someone in Finland wants to join so we can save on shipping then that is the time :)
 

iraqigeek

Member
Sep 17, 2018
96
65
18
I found a guy in the EU selling Mi50 32GB for 240 euros + shipping.

I am getting 2. If someone in Finland wants to join so we can save on shipping then that is the time :)
I'd grab at least four if I were you, maybe 5. 160GB VRAM let's you load really big models like Qwen3 235B Q4 with lots of room for context. It runs at ~20tps and is really a replacement for chatgpt
 

MSameer

Member
May 8, 2025
81
69
18
And then do what with such models? I really have a problem understanding the use of AI/LLM. I tried a bunch of models but never really felt they fulfill any necessity for me.

Deepseek is meh.

Qwen coder is rubbish and the code has bugs. I will have to rewrite it.

Seriously, what do people use such models for?

Running such server would burn 150watts idle. 24/7 is shy of 200euros for what?

I honestly just need a single GPU for basic models used by my self hosted services. I am getting a second one as a spare. I am even sure my needs wi be served by a 16GB GPU.

I am just really wondering what practical uses does AI/LLM have?
 

iraqigeek

Member
Sep 17, 2018
96
65
18
And then do what with such models? I really have a problem understanding the use of AI/LLM. I tried a bunch of models but never really felt they fulfill any necessity for me.

Deepseek is meh.

Qwen coder is rubbish and the code has bugs. I will have to rewrite it.

Seriously, what do people use such models for?

Running such server would burn 150watts idle. 24/7 is shy of 200euros for what?

I honestly just need a single GPU for basic models used by my self hosted services. I am getting a second one as a spare. I am even sure my needs wi be served by a 16GB GPU.

I am just really wondering what practical uses does AI/LLM have?
I use Qwen 3 Coder 30B Q8 all the time with C#, python, JS, C++ an Rust. It's not the best but it does really well for most use cases. I use Roo in VS code. You need to be explicit about what you want and how you want it done. If you expect to give couple of lines of high level description and get what you want, you'll be very disappointed.

I can get a day's worth of work done in about one hour using LLMs. Even with writing very thorough descriptions of what I want done it's much faster than doing all the code by hand, and it can do unit tests with practically full coverage with minimal guidance. I can also dump a long stack trace and it will point me exactly to the relevant part.

Qwen3 235B is very good for the cases where 30B is unable to deal with the task. I find 235B also very useful with Linux system administration, something I'm not as experienced in as in Windows. It helped me setup proxmox, setup ZFS, explained LXC containers to me, helped me setup NAS and Gitea containers, etc.

I use gpt-oss-120b to help me learn German and for rubber ducking to convert ideas into detailed project and implementation plans. It can take me from a short idea description to a 15 page project definition and 15 project architecture and implementation plan. It is also our private chatgpt at home where we can ask about random stuff. Of course we double check by googling, but it's still very helpful when you "don't know what you don't know".

BTW, I don't leave my LLM servers running 24/7. That's part of the reason why I use boards with integrated IPMI. I turn them on as needed with ipmi tool and shut them down when I'm done. Sometimes I need only one machine on, sometimes I need all thrre, depending on what I'm doing.
 
  • Wow
Reactions: luckylinux

MSameer

Member
May 8, 2025
81
69
18
I use Qwen 3 Coder 30B Q8 all the time with C#, python, JS, C++ an Rust. It's not the best but it does really well for most use cases. I use Roo in VS code. You need to be explicit about what you want and how you want it done. If you expect to give couple of lines of high level description and get what you want, you'll be very disappointed.

I can get a day's worth of work done in about one hour using LLMs. Even with writing very thorough descriptions of what I want done it's much faster than doing all the code by hand, and it can do unit tests with practically full coverage with minimal guidance. I can also dump a long stack trace and it will point me exactly to the relevant part.

Qwen3 235B is very good for the cases where 30B is unable to deal with the task. I find 235B also very useful with Linux system administration, something I'm not as experienced in as in Windows. It helped me setup proxmox, setup ZFS, explained LXC containers to me, helped me setup NAS and Gitea containers, etc.

I use gpt-oss-120b to help me learn German and for rubber ducking to convert ideas into detailed project and implementation plans. It can take me from a short idea description to a 15 page project definition and 15 project architecture and implementation plan. It is also our private chatgpt at home where we can ask about random stuff. Of course we double check by googling, but it's still very helpful when you "don't know what you don't know".

BTW, I don't leave my LLM servers running 24/7. That's part of the reason why I use boards with integrated IPMI. I turn them on as needed with ipmi tool and shut them down when I'm done. Sometimes I need only one machine on, sometimes I need all thrre, depending on what I'm doing.
It is very difficult to give all the details and by the 3rd iteration it slows down significantly and start outputting nonsense.

I also have to carefully review what it outputs because it's not always conforming to what I asked. That's with qwen 2.5 coder.

I work for a company that demands office presence so cannot really uae llm from home. I have been using linux for almost 25 years. It is easy to ask ai for help but google is really faster for most cases.

What I really like is following deepseek's line of thoughts but if i repeat a query twice, I get different answers. Too difficult for me to rely on that.
 
  • Like
Reactions: iraqigeek

iraqigeek

Member
Sep 17, 2018
96
65
18
It is very difficult to give all the details and by the 3rd iteration it slows down significantly and start outputting nonsense.
I do each iteration in a new chat.

Start with a description of the idea I had and what I thought about it, and tell the LLM to ask me questions to clarify ambiguities and provide candidate answers for each question. I answer each question either from the candidate answers or write something else. After answering all questions, I tell the LLM to integrate my answers with the original description in a new coherent description without omitting any details (this last nugget actually makes a big difference). I take that and use it as the starting point in a new chat, and copy paste the same prompt I had telling the LLM to ask questions. Rinse, repeat until the questions become of little value or I'm satisfied with the scope/detail. This now becomes the project specification document, and the starting point for working out the architecture. Rinse, repeat the same process until I have an architecture and implementation plan I'm happy with.

I also have to carefully review what it outputs because it's not always conforming to what I asked. That's with qwen 2.5 coder.
Qwen3 is way way way better than 2.5. I used 2.5 32B Q8 and was borderline useless beyond simple tasks. Qwen 3 30B (MoE) is a whole different level. Qwen3 235B is very close to chatgpt premium.

I work for a company that demands office presence so cannot really uae llm from home. I have been using linux for almost 25 years. It is easy to ask ai for help but google is really faster for most cases.
If you know what to do, then googling can be indeed faster. I find it also depends on what google is doing with search. There are days when they insist on showing irrelevant stuff. One thing google doesn't tell me is full commands even when I know how how to do something but not sure about arguments. And because I'm running locally, I'm not worried about sharing personal details like paths, IP adresses, usernames, etc.

What I really like is following deepseek's line of thoughts but if i repeat a query twice, I get different answers. Too difficult for me to rely on that.
-s in llama.cpp really makes a difference in reliably getting the same answer again and again. If you like DS's thinking, you'll also like gpt-oss-120b. It really punches above it's wieght. I haven't felt the need to use DS since Qwen3, and even deleted it a few weeks ago. 20t/s in VRAM beats 4-5t/s hybrid.

Apart from Qwen3 235B, and gpt-oss-120b, I run all other models at Q8. gpt-oss-120b is native Q4.

FYI, Mistral models are also very good, and so is Google's Gemma. Mistral just released a bunch of new models today, including a 675B MoE.
 

MSameer

Member
May 8, 2025
81
69
18
I do each iteration in a new chat.

Start with a description of the idea I had and what I thought about it, and tell the LLM to ask me questions to clarify ambiguities and provide candidate answers for each question. I answer each question either from the candidate answers or write something else. After answering all questions, I tell the LLM to integrate my answers with the original description in a new coherent description without omitting any details (this last nugget actually makes a big difference). I take that and use it as the starting point in a new chat, and copy paste the same prompt I had telling the LLM to ask questions. Rinse, repeat until the questions become of little value or I'm satisfied with the scope/detail. This now becomes the project specification document, and the starting point for working out the architecture. Rinse, repeat the same process until I have an architecture and implementation plan I'm happy with.
Okaaaaaaay. Now that is quite a trick. You live and learn. Thank you :)


Qwen3 is way way way better than 2.5. I used 2.5 32B Q8 and was borderline useless beyond simple tasks. Qwen 3 30B (MoE) is a whole different level. Qwen3 235B is very close to chatgpt premium.
I honestly never tried chatgpt. I am a privacy freak and i don't trust those guys but if qwen 3 is way better than 2.5 then i should give that a try.

But isn't 30B MoE runnable on the CPU since it's MoE?

235B is definitely beyond my capabilities to run.


One thing google doesn't tell me is full commands even when I know how how to do something but not sure about arguments.
I just check man pages but sometimes it is still a PITA. I agree.

How much vram does gpt oss need? Qwen 30B should be fine on 32GB but qwen 235B is probably hopeless.

I found an X11SPi-TF with xeon silver 4114 for 240euros delivered but the cost of populating is with RAM+2 more GPUs? That's 1k more or less. Tempting but not really sure it's worth all this :/
 

iraqigeek

Member
Sep 17, 2018
96
65
18
Okaaaaaaay. Now that is quite a trick. You live and learn. Thank you :)
You're most welcome! This workflow, simple as it is, took a lot of experimenting to get to.
Even if you have fast hardware, models still "lose focus" in complex and long multi-turn tasks. Moving to a new chat helps keep them focused. However, they're really great at poking holes at your ideas, and just as good at giving suggestions for what to do. Between those two, you can really take simple idea descriptions into full fledged projects. Bigger models also do significantly better in this and ask surprisingly good, non-trivial, questions.

This is more of a personal preference: your prompt should also include style guidelines. Ex: they like to spit out tables and use emojis. So, I tell it: no tables and no emojis, free text and bullet lists only. I also use the same titles and questions in my answers to make sure to anchor each answer to the question it asked.

I honestly never tried chatgpt. I am a privacy freak and i don't trust those guys but if qwen 3 is way better than 2.5 then i should give that a try.
I think you should, be it chatgpt, claude or gemini, if only to get a baseline for capabilities. You don't have to create an account and login. Just use it for random stuff you don't care about.

But isn't 30B MoE runnable on the CPU since it's MoE?
Everything can be run on CPU. It all depends on how fast you want your answer or how patient you are.

Prompt processing will always be slow on CPU (assuming you don't have a $10k CPU). I get 1100t/s PP on 3090 and ~400 on Mi50. PP speed makes a big difference when coding, even with prompt caching.

235B is definitely beyond my capabilities to run.
You can definitely get a taste for it using CPU if you have more than 128GB RAM. It's a MoE with 22B active. Unsloth's Q4_K_XL GGUF is 134GB.

How much vram does gpt oss need? Qwen 30B should be fine on 32GB but qwen 235B is probably hopeless.
gpt-oss-120B is ~64GB. I run it with plenty of context with three 3090s (72GB VRAM). So, you'll need three 32GB Mi50s if you want to run it. It's 5B active so it's really really fast. On the 3090s using llama.cpp I get ~120t/s up to ~8k context.

I found an X11SPi-TF with xeon silver 4114 for 240euros delivered but the cost of populating is with RAM+2 more GPUs? That's 1k more or less. Tempting but not really sure it's worth all this :/
Two GPUs is not worth it. IMO, aim for four and get a QQ89 Cascade Lake ES. 128GB VRAM plus 128GB system RAM will let you run Qwen3 235B probably around 15t/s. You can run gpt-oss-120b on three cards, while running qwen3 Coder 30B on the fourth card with context on the third. You'll need to manually split layers of gpt-oss-120b so that only context is on the 3rd card, same with Qwen3 30B.

gpt-oss-120b is a really nice all rounder that can do most things well, including coding with python, js and even C++. Qwen Coder is faster and better at less common coding tasks. Mistral's Devstral 24B is also quite good. I use Roo Code in VS Code for all. Qwen3 235B is good at solving complex tasks that smaller models fail at. I use it mainly for debugging and when gpt-oss-120b can't solve the problem.
 

MSameer

Member
May 8, 2025
81
69
18
Previously chatgpt needed an account thus I refrained.

I think my problem is I assumed such models are intelligent. Seems I have to treat them as dumb that need everything to be specified.


Two GPUs is not worth it. IMO, aim for four and get a QQ89 Cascade Lake ES. 128GB VRAM plus 128GB system RAM will let you run Qwen3 235B probably around 15t/s. You can run gpt-oss-120b on three cards, while running qwen3 Coder 30B on the fourth card with context on the third. You'll need to manually split layers of gpt-oss-120b so that only context is on the 3rd card, same with Qwen3 30B.
I do hear you but that is 220 for the board + 90 CPU + 300-400 for the RAM + 500 for the GPUs to use a model for 2 hours every week or so?

It makes sense for you as you seem to use it daily but I think it is still cheaper to pay 1.5$ for an hour of a linode GPU instance every 2weeks or so.

But I am thankful for everything you said and for your time used to educate me. I will be experimenting more and maybe I will enjoy using it as much as you do :)
 
  • Like
Reactions: iraqigeek

iraqigeek

Member
Sep 17, 2018
96
65
18
I think my problem is I assumed such models are intelligent. Seems I have to treat them as dumb that need everything to be specified.
They're somewhere in between.

We often forget how much "embedded context" we have when interacting with other people. This context defines scope and clarifies intention. LLMs don't have that. It's like grabbing a random person in the street and asking them the same question. Some solve it by giving a long system prompt that details who they are, what they do, their skills and knowledge, what they want or expect from the LLM, etc. Basically, all the info someone who knows you personally or professionally could say about you. I'm too lazy to do that and it slows everything down because every conversation will start with a 20k system prompt.

I'm used to dealing with less than stellar "developers" at work who need to be spoon fed what needs to be done in a ticket, so I treat the LLM like a dim bulb, except it takes a minute to do the task you give to it, rather than the week it takes for the human counterpart.
 

luckylinux

Well-Known Member
Mar 18, 2012
1,473
455
83
Wouldn't 2xAMD Radeon Mi50 32GB (total 64GB VRAM) enough for gpt-oss-120b ? Seems like it needs >= 60 GB of RAM but yeah, maybe that's just to store the Model and doesn't leave any Space for Prompt Processing ?

I still need to figure out how to integrate the 20 x AMD Radeon Mi50 16GB. 8 of those will be 128GB VRAM so I guess I could fill 2 Chassis and then have a couple in two Desktop Systems :) .

How are you integrating / interfacing to it though ? I know there is Open WebUI for Ollama, but I also saw some People using e.g. N8N for Home Automation (there are some Videos on Youtube), Continue (VS Code / Codium Extension) and of course there is also the API / Model Context Protocol (MCP) that you could call from e.g. a Python Script.
 

iraqigeek

Member
Sep 17, 2018
96
65
18
Wouldn't 2xAMD Radeon Mi50 32GB (total 64GB VRAM) enough for gpt-oss-120b ? Seems like it needs >= 60 GB of RAM but yeah, maybe that's just to store the Model and doesn't leave any Space for Prompt Processing ?
The model is ~64GB and you still need a few more GBs for context. I'd say the minimum to run it with decent is 72GB.

I still need to figure out how to integrate the 20 x AMD Radeon Mi50 16GB. 8 of those will be 128GB VRAM so I guess I could fill 2 Chassis and then have a couple in two Desktop Systems :) .
Your comment reminded me of that old Dell PowerEdge C410x "GPU shelf". It runs only PCIe Gen 2 but can hold 8 two slot GPUs and connect to the host server via single x16 card. They used to be very cheap on ebay, but I see they've also become unobtainium now

How are you integrating / interfacing to it though ? I know there is Open WebUI for Ollama, but I also saw some People using e.g. N8N for Home Automation (there are some Videos on Youtube), Continue (VS Code / Codium Extension) and of course there is also the API / Model Context Protocol (MCP) that you could call from e.g. a Python Script.
OpenWebUI works with anything that exposes an OpenAI compatible API. Most open source VS code extensions also work with anything OpenAI API compatible.

For chat, I use OpenWebUI. For inference I use llama.cpp exclusively, and use llama-swap to switch models. In VS code I use RooCode. I don't use vLLM because model loading takes forever. Llama.cpp has regressions sometomes, but I just keep previous builds to get around that. I have a simple bash script to build it into a directory named after the git tag and then copy the artifacts to a fixed directory where my llama-swap configure points to.
 
  • Like
Reactions: luckylinux

luckylinux

Well-Known Member
Mar 18, 2012
1,473
455
83
The model is ~64GB and you still need a few more GBs for context. I'd say the minimum to run it with decent is 72GB.
Uhm, I don't think it's supported to have 2x32GB + 1x16GB, is it ?

Your comment reminded me of that old Dell PowerEdge C410x "GPU shelf". It runs only PCIe Gen 2 but can hold 8 two slot GPUs and connect to the host server via single x16 card. They used to be very cheap on ebay, but I see they've also become unobtainium now
I could also get a X9DRX+-F, X10DRX or X10DRG-Q, which would be good for 5x GPUs without Risers.

OpenWebUI works with anything that exposes an OpenAI compatible API. Most open source VS code extensions also work with anything OpenAI API compatible.
I actually use Codium, not VS Code, on my Personal Desktop. Different Story at work (Windows + VS Code).


For chat, I use OpenWebUI. For inference I use llama.cpp exclusively, and use llama-swap to switch models.
So you don't run Ollama, you just use the most recent changes from the upstream project of Ollama, basically ?

Is that because you don't need a GUI (although I think llama.cpp just received a GUI recently) or need to do more advanced Stuff ?

In VS code I use RooCode. I don't use vLLM because model loading takes forever. Llama.cpp has regressions sometomes, but I just keep previous builds to get around that. I have a simple bash script to build it into a directory named after the git tag and then copy the artifacts to a fixed directory where my llama-swap configure points to.
OK I didn't have to build custom Stuff so far, I've just been playing a bit with Ollama.

I built quite a few custom Docker Container so it's not like it scares me :) , but I guess there are some special Things with llama.cpp and the AMD Radeon Mi50 to know.
 

iraqigeek

Member
Sep 17, 2018
96
65
18
Uhm, I don't think it's supported to have 2x32GB + 1x16GB, is it ?
Repeat after me: VRAM is VRAM. the models don't care, and with MoE models rubbing in llama.cpp or any of it's derivatives, you can split however you want with almost no penalty (within reason) because there is no tensor parallelism.

I could also get a X9DRX+-F, X10DRX or X10DRG-Q, which would be good for 5x GPUs without Risers.
I have some X10DRX that I plan to sell in January, if you're interested :D :D :D
I'm using one with an eight P40 build (watercooled) with zero risers.

I actually use Codium, not VS Code
IIRC, supports the same plug-ins as the full Spyware VS Code.

So you don't run Ollama, you just use the most recent changes from the upstream project of Ollama, basically ?
You can even say I actively hate ollama. I fully understand the premise: easy to setup for a noob. I used it for about a week when I was starting. It gets very irritating very quickly the moment you know what you want to do, and the way they leech off llama.cpp without giving anything back (they don't even acknowledge that it's basically a GUI wrapper for llama.cpp) is irritating.

Is that because you don't need a GUI (although I think llama.cpp just received a GUI recently) or need to do more advanced Stuff ?
No, it's because they make it very painful to do anything beyond using it as-is. Any configuration changes have to be done by polluting your environment with 30 env vars. I also dislike how they mangle model filenames making it practically impossible to use any model outside of ollama, despite them being regular GGUFs, and how they hide quantizatoons and sometimes even deceive by naming distillations to make them seem like the full fat models (ex: deceiving people into thinking DeepSeek 8B is the same model as the 671B).

I never cared for the ollama GUI from day one. I run it with open-webui since the beginning because I want to run the thing on one machine, but be able to access it from anywhere on my home network. Llama.cpp web UI is nice for testing things are working, but it also doesn't cut it for me because it stores chats in the browser's local storage, so you can't access your chats from another machine.

OK I didn't have to build custom Stuff so far, I've just been playing a bit with Ollama.

I built quite a few custom Docker Container so it's not like it scares me :) , but I guess there are some special Things with llama.cpp and the AMD Radeon Mi50 to know.
Ollama tends to be 2-6 weeks behind llama.cpp in terms of features, except when they decide to implement support for some new model (which they never contribute back to llama.cpp).

The work to bring proper support for the Mi50 in llama.cpp has been done by the same German Physics PhD student who wrote the P40 kernels. Single guy turned both those GPUs from ewaste to hot cakes. That guy is a legend.
 
  • Like
Reactions: luckylinux