Overheating problems with 2x Epyc 7742 in Define 7 XL 1TB/2TB RAM

gabe-a

New Member
Sep 10, 2020
16
1
3
I have a dual Epyc 7742 workstation built for me by one of the few companies that builds these things. I have been plagued with overheating problems since day 1, and a rebuild in the Define 7 XL case has only partially mitigated the problem.

Here are the specs [edited for accuracy]:

Specs:
SuperMicro H11DSI-NT motherboard
2 AMD EPYCTM 7742 64-core CPUs
16x 128GB DDR4-2933 ECC RAM Micron load-reduced
1 1.0TB Intel® SSD Pro 7600p M.2 (OS, CentOS 8)
1 AOC, Dual m.2 Asset Drive
1 2.0TB Intel® SSD Pro 7600p M.2 (on AOC)
1 1TB Samsung 860 EVO SSD (WIN 10 Pro 64-bit)
6 HGST Ultrastar DC SAS HC530 16TB 512e HDD (RAID0)
LSI 9361-8i SAS RAIDCard
1 NVIDIA® GeForce® RTX 2080 Ti 11GB GPU

Fans:
2 Noctua EPYCTM CPU Coolers
2 140mm top fans labeled "industrial"
2 XXXmm front fans (which look similar to the top fans judging by the orange corner nibs)
1 default Define 7 XL chassis fan (low airflow)
4 Corsair Vengeance RAM cooling fans
1 unknown fan-like object occupying the bottom PCI slot.


The first issue was the RAM overheated almost immediately when running HPC applications. This was kind of resolved with the 4x corsair RAM fans above (they still reach close to the IPMI's near-critical threshold of 80C, but they don't quite touch it anymore).

Now this has been replaced with new overheating problems -- the various VRMs.

There are 2 types of VRM here that are overheating (exceeding 100C and triggering a loud motherboard beep).

I looked through threads here and found 4 separate instances (I think) of exactly the same problem with VRM overheating with this motherboard and this processor (and/or lower processors that have been OC'd to reach this one's performance). One forum maestro, alex_stief, suggested placing dual 40mm fans underneath some sort of "VRM heatsink". Another suggestion entailed water cooling (which some people say is frivolous but they probably aren't doing hardcore NUMA-aware high-AVX2 teraflop computation using this high-voltage CPU).

But I don't know what the latest guidance is. Has anyone tried to use this case? Anybody want pictures of the innards? It is shocking to me that the only consumer motherboard is this silly H11-DSi[-NT] which apparently isn't designed for towers but flat air-streamed loud servers on a rack.

What can I do?

Further reading on other users' VRM overheating issues with the same motherboard that covers both the CPU and CPU VRM (but not the RAM VRMs presumably):
https://forums.servethehome.com/ind...c-and-ram-voltage-on-supermicro-h11dsi.28068/ (mentions OC which I don't and can't do, but their clocks are the same as the 7742 stock, so it may be applicable)

A "custom cooling block" (?) has been made for the "-B" revision of the motherboard (don't know if that's compatible with my presumably "non-B" revision, or if it will fit with dual CPUs, or what else it would entail since it seems to require some watercooling DIY expertise): Monoblock HCM PRO MBD-H11DSI-NT-B

Pictures of the rig and the various items mentioned in the thread can be found here: gabedev.com/rig
 
Last edited:

alex_stief

Active Member
May 31, 2016
652
204
43
35
I would definitely be interested in some pictures. Partly because I am interested in that case myself, but we could also judge if there could be any improvement to case air flow.

From your list I conclude that you only have 4 case fans, 3 of them exhausting air at the top. That could be improved quite a bit.
1) A top exhaust fan towards the front of the case does not help airflow through the case over the critical components. I would remove that and place it somewhere else.
2) Put in as many front intake fans as you can. 3x140mm according to the specs. If you want to remove the PSU shroud, another intake fan at the bottom won't hurt.
3) Have 3 exhaust fans total: one at the back, and two at the top towards the rear of the case.

That won't completely solve CPU VRMs overheating, because those are in a very bad spot to benefit from case airflow. But It will get overall temps down.
For the VRMs themselves, there are still only two solutions, you found them both. Which one to use depends on how much effort you want to put into it, and how thoroughly you want to solve the problem ;)
What's the other type of VRM overheating? I glued some tiny copper heatsinks to the various memory VRMs. Mostly for good measure, I never had them overheating, despite running memory bandwidth intensive applications. But maybe the situation here gets worse with that amount of memory. Or maybe memory overheating was just a result of poor case airflow, combined wit a lot of heat being dumped into the case.
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,240
419
83
Only one intake fan on a setup like that won't provide enough airflow; three exhaust fans won't pull enough air across all the hot components in the case. A quick look at a review says the default intake fans also top out at only 1000rpm; quiet, but won't provide enough airflow for this setup by a long shot.

You'll want at least two fans on the front intake and a couple at the bottom probably wouldn't hurt either. Quiet fans probably aren't going to cut it when everything's running flat out so you're probably going to want some high static pressure intake fans that'll go up to at least 15000-2000rpm if the temperatures get high enough.

Installing as many fans on the front and bottom intake slots as you can tolerate should be the first step. I'm fond of the Noctua NF-A12x25 PWM fans myself, quite but capable of ramping up to a goodly amount of airflow.
 

mirrormax

Member
Apr 10, 2020
44
22
8
I have a very similar setup with the same case/cpu/mb but "only" 256gb ram
is there a windows benchmark i could run to try reproduce it?
I've only seen major vrm overheating while trying to overclock, but i also havnt ran sustained avx2 load over time to check.

Even though i dont have stability issues with my workload I am still looking to change motherboard from the h11dsi to gigabytes MZ72-HB0 which should be available in a weeks time, for the pcie4.0 and better layout, and hopefully better vrm. you could consider the same.

edit: also your fan layout is abit unclear, you have 3 top fans exhausting but only 1 front fan for intake? you probabaly wanna max out on front intake fans.
 
Last edited:

i386

Well-Known Member
Mar 18, 2016
1,992
517
113
31
Germany
Put noctua industrial fans (2k rpm for a compromise between cooling and noise, 3k rpms for better cooling/stability) in that chassis
 

gabe-a

New Member
Sep 10, 2020
16
1
3
Hi again,

Apologies for posting incomplete/misleading specs -- there are more fans in the rebuilt case than were initially quoted. I just pulled off the front to check and there appear to be 2 intake fans at the front of the unit and 3 fans at the top of the unit, and one final chassis fan at the back.

Here are the pics, as promised! It won't look great because I am not an expert (yes, I know I'm posting in a forum with real wizards -- this rig wasn't built by me, but for me! And I had to muck around to get the RAM fans installed as per their suggestions, but am not adept at figuring out how to hide the cables):

(Okay it seems they're too big for the forum so I put them on my site: Index of /rig )


 
Last edited:

alex_stief

Active Member
May 31, 2016
652
204
43
35
The top fans need to be switched to exhaust instead of intake. Right now, you are mostly creating positive pressure inside the case with your fan setup. What you would rather need for lower temperatures is airflow.
 

gabe-a

New Member
Sep 10, 2020
16
1
3
First of all I want to say thanks for the responses. You folks really are wizards with this stuff which makes me seem even more foolish having to follow-up on them.
What's the other type of VRM overheating? I glued some tiny copper heatsinks to the various memory VRMs. But maybe the situation here gets worse with that amount of memory. Or maybe memory overheating was just a result of poor case airflow, combined wit a lot of heat being dumped into the case.
The two types of VRMs overheating according to the IPMI logs (ipmitool dumps are my new friends -- example report attached -- as they report all the motherboard sensors):

1. the CPU VRMs (which overheat during CPU-intensive applications that fit into cache such as 16 instances of lbcb-sci/raven with 16 threads each; which also runs fine on my dual Xeon 8180 1.5TB RAM rig from HP and my 400GB RAM closed-loop-cooled Xeon Phi rig from Colfax). These VRMs are called VRMCpu2 and VRMCpu1.
2. The RAM VRMs (which overheat during RAM-intensive applications such as 16 instances of voutcn/megahit at 16 threads each). I believe you are correct when it comes to more RAM running hotter -- Colfax's engineer told me they have been seen the latest crop of 1 and 2TB 2933MHz+ RAM modules running very hot, and the load-reduced high density modules may demand more of the VRMs in some way. These VRMs are called VRMP1ABCD, VRMP1EFGH, VRMP2ABCD,VRMP2EFGH. I mostly have trouble with VRMP1EFGH (but the others, especially VRMP2EFGH, are not far behind).

None of these VRMs are labeled in the manual. I spoke with a SuperMicro rep who told me the location of the VRMP1EFGH chip is in a different place than what another rep said, which is of course maddening and their support sadly has been among the worst I've experienced in my lifetime. The manual also doesn't document how to turn off the alarm (there are apparently no jumpers for it), but a forum member here (thanks STH!) mentioned I just have to disconnect the power supply for awhile and it will clear itself.

Attachments/extras:
- ipmidump.txt shows an example dump of the sensors. Here you can see one is critical and a couple more are nc (bordering on critical).
- onerep/otherrep show what 2 different SuperMicro reps stated regarding the physical location of VRMP1EFGH
http://www.gabedev.com/rig/otherrep.jpg

is there a windows benchmark i could run to try reproduce it?
I've only seen major vrm overheating while trying to overclock, but i also havnt ran sustained avx2 load over time to check.
Unfortunately when I tried Windows 10 on it, I was unable to get the NUMA pinning to work correctly (in Linux you can use numactl and also the numa.h toolkit in C/C++ code, which many tools in bioinformatics HPC are using). Hence the benchmarks may not actually be performing as well as they could be, leading to less heat as the latency just swishes across RAM domains while the CPUs idle. This is just speculation, though -- just because I couldn't figure it out in Windows doesn't mean it can't be done. One other thing that won't work -- numactl in WSL [the windows Linux subsystem]. Thanks for the tip about the new motherboard -- that's probably a little beyond my level of expertise as I've mostly just tinkered with systems that were already installed at the motherboard level, but perhaps I could pay someone to swap me out if/once that board is proven effective at thermal management.


The top fans need to be switched to exhaust instead of intake. Right now, you are mostly creating positive pressure inside the case with your fan setup. What you would rather need for lower temperatures is airflow.
Wait, you're saying the top fans are literally pointed the wrong way? Dang. I will flip them around. Thanks for that. I wonder why they were installed that way...

[edit] I also want to point out I updated the first post with accurate details and more information!
 

Attachments

Last edited:

mirrormax

Member
Apr 10, 2020
44
22
8
i could replicate it by doing a FPU stress test in aida64, vrm shoots up like non of the other stress test in very short time. would have hit 100 if i didnt stop it. i dont have fans on the vrm like some others on here have installed pointing straight at the vrm(40mm ones) so you could consider that.
 

gabe-a

New Member
Sep 10, 2020
16
1
3
Interesting there's an aida64 test that reveals the same issue. That's rather surprising, as I used to run the aida64 suite of tests on my dual Xeon systems without issue (and they have AVX512!). For me it seems any routine HPC software causes the issue (but all HPC software are built to maximize instructions per clock for high performance on large datasets).

Thanks, I have stumbled across mention of the 40mm fans, as well as a custom heatblock. However, I live in the US and it doesn't seem those particular pieces are available -- although I do see plenty of 40mm fans, some appear quite thick so they might not fit under the large CPU fans (see the pics of the rig). Are there any slim 40mm fans you'd recommend? Would it require removal of the CPU coolers to fit the 40mm fans underneath? (Important consideration for whether to get thermal paste or other tools).

Has anyone here tried the custom water block mentioned in the first post near the bottom? I've never done watercooling (is it hard to build/install?) but I've tinkered a bit with fans...
 

gabe-a

New Member
Sep 10, 2020
16
1
3
The other problem is what to do about the overheating RAM VRMs (I understand you're talking about the CPU VRMs). The RAM VRMs aren't so conveniently positioned, it seems -- any idea how to cool those?
 

alex_stief

Active Member
May 31, 2016
652
204
43
35
Let's tackle this one issue at a time.

The first thing you need to fix is case airflow. Right now, you have a lot of case fans, but set up in a way that produces little to no useful air flow inside the case.
1) Turn around the two fans at the top. They are currently set up as intake fans, when they really should be exhaust.
2) If possible, remove the two small covers in the PSU shroud towards the front, which block airflow from a third front intake fan at the bottom of the front panel. And mount a third front intake fan in that bottom position. If you can, use the three most powerful fans as front intake, they have to work against more pressure drop than the exhaust fans.
define_7xl_psu_shroud.jpg
3) Get rid of that blower accessory in the bottom PCIe slot. Other than creating more noise, all it does is starve the GPU for cool air.

With that out of the way, you should have some decent air flow in the case. That makes dealing with overheating issues on various components much easier later on. If it does not fix some of them right away.
Everything you can do from this point on -except water cooling everything- relies on case air flow to work. So there are no shortcuts here, case airflow needs to be fixed first.
 
Last edited:
  • Like
Reactions: RedX1

ari2asem

Active Member
Dec 26, 2018
544
86
28
The Netherlands, Groningen
i agree with @alex_stief ....airflow in the case should be fixed first.

but my own conclussion with this board is: it should be used in a server rack case as is meant to be, with air shroud. then all overheating problems would be minor problem.
said in another words....we all use the wrong computer case's for this board. and i think that dual socket gigabyte-board will have the same issue's as this h11-board has
 

gabe-a

New Member
Sep 10, 2020
16
1
3
Thanks @alex_stief -- I've done #3 so far, and I have also confirmed I can remove the Fractal partition separating the PSU and the rest of the case.

1) Turn around the two fans at the top. They are currently set up as intake fans, when they really should be exhaust.
2) If possible, remove the two small covers in the PSU shroud towards the front, which block airflow from a third front intake fan at the bottom of the front panel. And mount a third front intake fan in that bottom position. If you can, use the three most powerful fans as front intake, they have to work against more pressure drop than the exhaust fans.
View attachment 15709
One issue is heterogenous fans -- the top 2 fans are 4000rpm fans, the back fan is 1000rpm, and the front 2 fans are both 4000rpm and 1000rpm fans. So, I have a total of 3 fast fans and 2 slow fans. This leads to multiple possibilities for rearrangement, but if I understand correctly the best idea for me is:
- 2 fast fans in front chassis, intake
- (Buy 1 additional fast fan for the front chassis, intake)
- 1 fast fan in back chassis, exhaust
- 2 slow fans at top of chassis, exhaust

I'll do it! Thanks again.
 

alex_stief

Active Member
May 31, 2016
652
204
43
35
As far as I am aware, the fastest 140mm industrial fans sold by Noctua go up to 3000 rpm. But that's besides the point.
What I would do for now, with the fans you already have:
1) Mount two fast fans at the front intake position towards the top. And a slower one in the bottom slot
2) Mount the remaining "4000rpm" Noctua fan as rear exhaust. This is where most of the hot air from the CPU coolers will try to exit the case first, so having a fast fan here is ideal
3) The slow stock case fan can go in top exhaust position for now.
Then as soon as you can, replace the slow stock fans with something better, and add another top exhaust.

With that out of the way, we can move on to CPU VRMs.
Those are located under the aluminium heatsink between both CPU sockets. Since you have pretty small CPU coolers, maybe you could try something other than mounting tiny 40mm fans to that heatsink: an 80mm fan mounted at about a 45° angle relative to the board, below the CPU coolers, pushing air towards that heatsink. As a side-benefit, this will also help temperatures of the second CPU, because it will be fed some cooler air that was not recycled by the first CPU cooler. How the hell could you mount a fan in that position you ask? Zip-ties and creativity I guess ;)
Considering the whole machine is not quiet anyway, maybe use a fan that can go up to ~3000rpm
By the way, your CPU coolers don't look like they were made by Noctua. They rather look like Supermicro SNK-P0064AP4, which according to some sources, is only rated for up to 180W CPUs. So if you should ever run into CPUs themselves overheating, maybe consider swapping those for something better.

And lastly: memory VRMs. These are located all over the board, but close to the DIMM slots. Let me see if I can find them all:
h11dsi_mem_vrms_01.jpg

With plenty of airflow through the case, adding small heatsinks to these mosfets should bring down temperatures enough to stop throttling.
What I used: 2x https://www.alphacool.com/shop/graf...ool-gpu-ram-copper-heatsinks-6-5x6-5mm-10-stk.
These should be available on amazon and other shops. The main benefit: they are the perfect size for these components, no filing or Dremel required.
And for sticking them on: (€112,00*/100g) Phobya 2-Komponenten Waermeleitkleber 5g - Wärmeleitpasten | Mindfactory.de
I would not trust double-sided thermal tape for this application, so I really recommend some 2-component thermal adhesive.
For best adhesion, this is how it should be done:
1) slightly sand the contact surface of those copper heatsinks
2) thoroughly clean both the heatsink and the mosfet on the board with isopropyl alcohol
3) mix both components of the thermal adhesive with the recommended ratio. Mix them really thoroughly
4) apply a little bit of the mixture to either the heatsink or the mosfet, and press them together gently
5) let it cure for at least 8 hours with the motherboard lying flat. Better yet, let it cure for 24 hours
The result:
IMG_20200405_160208.jpg
 
Last edited:

gabe-a

New Member
Sep 10, 2020
16
1
3
Thanks Alex!
the fastest 140mm industrial fans sold by Noctua go up to 3000 rpm
Just ordered 6x 3000 RPM Noctua fans (confirmed my "4000rpm" are indeed 2000RPM: NF-A14 industrialPPC-2000, so wanted to step up to the PPC-3000). I'll fill the last spot(s) on the top of the case with 2 of the 2000rpm fans or mix-and-match as needed to test what's best for airflow.
your CPU coolers don't look like they were made by Noctua. They rather look like Supermicro SNK-P0064AP4, which according to some sources, is only rated for up to 180W
Confirmed the CPU coolers are exactly the model you specified (how on earth did you know the model?!). Very troubling news if they only support 180W as these CPUs are nominally rated at 225W and actually burst to 250W (with configurable TDP to 240W via BIOS as well). But if this means buying bigger CPU coolers, that may be impossible as the current RAM fans take up almost all the space (and these fans are absolutely, absolutely necessary as the RAM overheats in seconds without them but takes hours to just touch "near critical" 80C with them). By the way, I tried aluminum heatsinks on the RAM and that was a horrible idea -- they made the RAM heat up at 6-10x the rate, but this could also be due to the stagnant airflow within the case.
you could try something other than mounting tiny 40mm fans to that heatsink: an 80mm fan mounted at about a 45° angle relative to the board, below the CPU coolers, pushing air towards that heatsink.
Remember I am actually a super-newbie when it comes to hardware. I had to look up zip ties! I've seen them before but was never able to figure out how to undo them (always had to cut them with a scissors when they were in my way). Meaning I'll probably not be able to implement the 45degree angled fan solution, as my "creativity" is far less than zero (yes, newbie-level destructive, as I'd likely tie them to the CPU fans and rip out the processors or something, causing a fire and $30,000+ component damage).

And lastly: memory VRMs. These are located all over the board, but close to the DIMM slots.
...[A]dding small heatsinks to these mosfets should bring down temperatures enough to stop throttling.
...They are the perfect size for these components, no filing or Dremel required.
And for sticking them on: (€112,00*/100g) Phobya 2-Komponenten Waermeleitkleber 5g - Wärmeleitpasten | Mindfactory.de
I would not trust double-sided thermal tape for this application, so I really recommend some 2-component thermal adhesive.
Thanks, I bought both from Amazon. I don't know what filing or Dremel entail (a quick Google tells me they're about trimming dogs' nails, so presumably cutting the metal in some way to make it fit the components?).

Also bought sandpaper because you mentioned "sanding" (I have seen references to this in ultra-extreme overclocking forums a long time ago where people popped lids off of CPUs and sanded down something called the PCB to attach a heatsink directly). I don't know how to do this -- I bought a bunch of sandpaper with grain from 120 to 3000 on Amazon. Any idea what grain to use, and how much to sand? Should I make it much shorter, or a little shorter, or just enough to add visible grooves for increasing surface area adhesion of the compound? How do I sand a motherboard without scraping off the green wafer and nearby pieces? These components are pretty tiny... maybe tape the sandpaper around a pencil eraser? (Tell me if this would actually be a very bad idea!).

Really sorry I'm like a 4yr old listening to an engineering professor here, and I truly beg your patience and understanding. Bear with me! :)
But even as an absolute newbie, I am really wondering why the company I bought my computer from:
- Could not test for overheating properly
- Could not orient the fans properly, and mixed-and-matched a hodgepodge of weak fans
- Used inadequate CPU coolers for the CPU spec (!!)

But maybe they're not quite as knowledgeable about server-grade computers...
 

alex_stief

Active Member
May 31, 2016
652
204
43
35
Just to avoid misunderstandings about the sanding part: all that is supposed to do is rough up the very smooth copper surface of the heatsinks a little bit, so the epoxy has something to hold on to. It has absolutely nothing to do with sanding down a CPU heatspreader, let alone a CPU die. Any sandpaper between 80-400 grit should do the job.
Edit: and please don't sand anything on the motherboard. Just the contact surface of the copper heatsinks.

Up until now, I was under the impression that some tech-savvy friend assembled this workstation as a favor, or for some pocket money. If you bought it from a company, you have every right to get a working machine, that does not overheat within seconds. If I bought this machine, I would not stop pestering them before they either fix it, or give me a refund. That's the whole point of buying a prebuilt system: not having to fiddle around in order to get it working.
May I ask which company you bought it from?

Don't worry too much about TDP ratings on CPU coolers. The 180W I found were not official spec by Supermicro anyway. And if you asked me, TDP ratings on coolers should not be a thing. As long as the CPUs don't overheat and you are ok with the noise, everything is fine.

If you can't figure out how to get another fan mounted to blow air towards the CPU VRMs, maybe let's just leave that part out for now. If we are lucky, fixing case airflow brings down overall temperatures enough to solve that. I'll try to come up with an easier solution or better instructions.
Btw: most zip-ties aren't supposed to be undone without cutting them. Of course, that doesn't stop cheap folks like me from trying to reuse them anyway :rolleyes:
 
Last edited:
  • Like
Reactions: gabe-a and nasi

gabe-a

New Member
Sep 10, 2020
16
1
3
Sure, looks like you know your parts better than you think: On the official supermicro website for this part, just click on "detailed specifications". Straight from the source, you nailed the 180W. It's good to know it's not the end of the world, though. :)

Yes, sure, happy to tell my story. It is an a-X2 from MediaWorkstations dot net. The rig you see here is actually their "rebuild"! Prepare for a fun rant... (This turned into a huge missive but it felt soooo good to write! Yay catharsis).

TL;DR: MediaWorkstations is incompetent, and NEMIX is conniving. And I am out much of a life's savings and most of my love of computers.

Introduction.

I wanted an upgrade from my dual Xeon Platinum 8180 system from HP which is awesome and I love it to death. I bought it in 2018 with 1.5TB RAM, crazy Quadro graphics, SSDs in RAID. Part gift, part subsidized, part self-funded, this computer made me fall in love with high-end computers (along with my Colfax Intl Knights Landing machine in 2016 which I upgraded to 400GB RAM and the 68-core CPU with help from a friend; BTW RAM was 384GB main + 16 HBM). Come late 2019, I decide it's time for my 2020 rig, which would be an AMD (since it was outperforming Intel at the time). But the only people building it were third-party companies. I picked the most expensive one with the nicest looking case, MediaWorkstations. What sealed the deal was they offered to install custom RAM, with the caveat that if it didn't work, I had to buy RAM from them (apparently 1TB). But others online had gotten 2TB working and the motherboard site claimed it was fine (mentions up to 4TB at 3200MHz with Rev 2+) so I didn't suspect a thing...

Play-by-play

0. RAM ordering fiasco.
I initially wanted 4TB of RAM -- so I ordered 16x256GB modules on NewEgg (engineering samples, apparently, but the listing didn't say that!). It cost me so much money. But I get a call from Nemix the next day claiming they were sold out. They asked what to do and I said invoice me for 2TB RAM and we can go from there if it looks good. They did not invoice me, but instead suddenly shipped 2TB of RAM at an unknown price to MediaWorkstations. Eventually the NewEgg balance is adjusted, after I demanded they refund the difference. But they did not refund enough -- they had overcharged me thousands of dollars. I call them out on it. They give me most of the overcharged money back, but not all of it -- they have a resident BS guy who kept feeding me conflicting information that was later proven false. First offense, he claimed prices rose since I bought it (but they didn't -- I showed him the NewEgg URL of the exact RAM module sent, up to date. Price had not budged. Second offense, he claimed it should work fine with the motherboard, then backtracked and said it was my fault because it won't work with the motherboard, then backtracked again and said only 2933MHz worked with the motherboard. (Later he will spew more BS, but stay tuned).

1. Build fiasco (part 1)
Paid Nov 25 for a build according to the invoice instructions (claiming Wire or ACH; I chose wire for speed). The next day they claimed wire wasn't supported so had to pay again (scary). Nemix had shipped the mystery 2TB of RAM to them around the same time. They start building around Dec 13 but then claimed the chassis was damaged. (Red flag?). They told me they could build it into some other case they had lying around or reorder this one, so I said reorder if you can promise it's sturdy (they swore by it). Dec 27 (over a month after the order) they claim the RAM doesn't work (they seriously couldn't have tested it earlier? They knew I got it from another vendor, Nemix, with a limited return window!). They then charged $5300 for 1TB of RAM and sent the Nemix RAM back to them. Meanwhile, Nemix without anyone's knowledge or consent decides they don't want to refund it but replace it instead with bargain-bin 2666MHz modules, so they suddenly ship a slower replacement to MediaWorkstations a week after MediaWorkstations shipped me the completed computer with 1TB of RAM. The Nemix BS-guy then claimed that "they can't print return labels" (even though they did before) and "they can only ship to the first address on NewEgg" which was the MediaWorkstations build site (you'll soon see this is also false, as they can ship wherever they want).

By this point, the return window had closed on NewEgg for returning the NEMIX RAM (and NEMIX can just claim ignorance or incompetence but get away with keeping my money). NEMIX absolutely refused returns at this point. After much begging and complaining they finally let me exchange it for the happy medium -- 2TB of 2933MHz RAM (obviously no more refund would be issued, so I'm still out thousands of dollars). But their RMA guy shipped it straight to me (invalidating the BS guy's statement earlier about not being able to ship anywhere but the first NewEgg address). MediaWorkstations then decided to charge me $250 to ship the bargain-bin RAM back to NEMIX so NEMIX could ship me the "final" replacement. So I end up getting gouged again for RAM I don't want.

Finally, to get the computer delivered, apparently I had to call a bunch of people and talk to a bunch of folks at MW and FedEx in realtime to "coordinate shipping the unit" (seriously? At that time I was out of town and could not take calls). MediaWorkstations blamed me with passive aggressive emails about not answering my phone (I have never needed a phone to buy a machine before, OR to "coordinate shipping"). The computer finally arrived with its 1TB of MediaWorkstations RAM, and I still have the 2TB of RAM sitting here. It does work fine in the system actually, but I decided to use the 1TB of MediaWorkstations in case of problems. (Both have identical thermal profiles BTW, so I may go with the 2TB and sell the 1TB).

2. RAID Fiasco
So the machine arrives on a FedEx Freight wooden pallet, shrink-wrapped with a bunch of Chinese-labeled empty boxes flanking it (I guess for insulation? No idea). At first (roaring loud) boot, it beeps like it's the end of the world. Turns out the RAID card was throwing a fit. Tech support told me to open up the chassis and check the hard drives. Turns out they were sliding all over the place because they hadn't been screwed in like literally every hard drive on every computer I've ever owned has been to date. So I got my first crash-course in server hardware -- "re-seat" the drives. I had to read the Avago manual and learn how to to navigate the obtuse AVAGO MegaRAID control panel in the BIOS to rebuild my drives. But still didn't work -- the alarm would go off after every couple hours. Probably overheating. Solution? Take out the RAID card (something or other MegaRAID) and MediaWorkstations would be happy to ship a bunch of individual cables for me to connect each hard drive without the RAID. Funny enough, this sort of worked.

3. OS inconveniences
- The computer came with a Windows (unactivated!) partition. Having not built my own computers before, every system I owned had come with factory pre-activated Windows via key in the BIOS (Microsoft tech support actually explained this is called "SLIC OEM product key insertion into the BIOS/UEFI"). Solution? They sent me the activation key upon request. So not a big deal, and just my own inexperience talking. But stay tuned, Windows activation trouble will be making a grand re-appearance!
- A CentOS partition was there, but setup with broken graphics drivers (no display was shown whatsoever), and upon trying to remotely login, it required a login password I was never provided until asking for it. Yes, a system was installed I literally could not have accessed even if the graphics had worked. Solution? I figured out how to install Fedora myself using a hybrid signed kernel and both GPT and MBR partition-table aware OS profiling schemes in order to co-exist with Windows and the BIOS. That was an OS journey, but I got it working fine.

4. Stuck in debug mode
The system came with "debug settings" active that were "accidentally" left on in the BIOS. Fans were at 100% full blast and it sounded like a rocket ship at all times, sending my ears ringing. I did not know it was possible for a computer to be so loud, and I own a dual Xeon platinum workstation and Knights Landing Xeon phi system! But wait, what were they debugging? Ah, apparently they'd run into some trouble with the RAID controller too! Which they couldn't fix! And shipped an obviously incomplete (roaring!!) system anyway. Solution? I had to plug an ethernet cable into a specific debug port, then remotely connect to an active IPMI server at a DNS address I had to set up in the BIOS to access a hidden control panel console via browser to fix fan speed and other debugging settings.

5. Overheating RAM
Once the system could finally boot to Linux, I tried running some simple bioinformatics software (in my case, megahit to assemble some public metagenomes). Some strange clicking sounds came from the box, and the system ground to a halt very quickly. Thanks to my earlier adventures with IPMI, I figured out how to dump the sensors and see what was happening. Turns out both processors were at 400MHz and RAM was at 84C (a hair away from "critical"). I felt very angry, but would give them a chance to explain. They could not. They told me I needed to buy, at my own expense, either RAM fins or dedicated RAM fans. I tried the fins. What took nearly 5 minutes to overheat before now took 10 seconds -- the cheap, LED-boasting aluminum RAM "cooling" fins they had suggested (Easy-DIY) had insulated the RAM and made it much hotter. And cost $400. The fans didn't all fit into the chassis (literally, that humongous chassis). And it overheated anyway. Solution? They would rebuild.

6. The Great Rebuild
I demanded a refund, outlining the overheating problems, what I did to mitigate them to limited success. They refused, claiming that orders over $8k were ineligible. They did agree to rebuild the unit to avoid overheating, but decided to completely ignore everything I told them and requested: I requested water-cooling. I requested good RAM fins/heatspreaders (as found in my Knights Landing system). I requested RAM fans (as found in my HP Z8 system as part of the specialized cooling setup). They laughed me off and said the new case would magically solve everything. I offered to help them run real HPC software, or benchmark it, and work with them to check temps/throttling (I learned all of this while trying to fix it the first time myself). They again completely ignored me and brushed me off.

After shipping my old unit back, they sat on it for nearly 4 months waiting for a "response from SuperMicro" about something to do with the motherboard (why they didn't just order a new motherboard which was readily available is beyond me...). They were content to let the damn thing I ordered in Nov 2019 sit even longer until June 2020 collecting dust in their hands. That's 7 months since I ordered it. But in 7 months time, they must have learned from my detailed and frequent communications, right? Nope. They didn't do much better this time around, as you immediately could see.

7. (Re-)Build fiasco (part 2)
RAM on the "rebuild" overheated within seconds (any benchmark showed the throttling; real HPC applications and any RAM-intensive benchmarks over 5 minutes throttled the CPU to 400MHz and the RAM shoots up over 80C anyway, the near critical temp!). So I installed my own RAM fans, learning how to move pins around in a motherboard in the process. I had to move the graphics card further down to make the RAM fans fit, which jeopardized the ability of the PCI blower to do anything (although it may have done nothing to begin with). They were keen to suggest which RAM fans and fins to use at my own cost and time expense, despite not installing it themselves knowing full-well the heating problems I described. By the way, there is still RAM throttling even with 4 fans flowing at max speed on them constantly, but it's tolerable until the RAM VRMs reach 100C in an hour or two of RAM-intensive work and then it's throttle-heaven. For compute-intensive work, it's 60 seconds then alarm and 400MHz until I physically disconnect the power supply and reboot (CPU VRMs in this case).

The rebuild also came with an "invalidated" Windows 10. I asked for an updated key and they did not provide one, only told me to call Microsoft at a phone number (and no other instructions). I did and MS told me about new computers having OEM keys in the Bios if it came with Windows 10, and I should clear the BIOS and reset the OS to "rule out tampering". This caused the HDMI output to stop working, and the only way to fix it was to reset jumpers on the BIOS, and I had to purchase a VGA monitor for the first time in over a decade to see the BIOS screen. And the sales guy/CEO (probably the same dude) started yelling at me and criticizing me for doing something I wasn't supposed to (even though, on the contrary, I did exactly what was said, as insufficient as the instructions were). Solution: The rebuild changed the motherboard which invalidated the Windows activation. Terminal "slui.exe 4". But when you call them, you MUST indicate that you DO NOT HAVE a key in the menu in order for you to be able to use your phone to enter a reset code -- I had pressed "I have a key" before and that led me to live customer support which caused the bios reset rabbit-hole.

Concluding remarks.

I've spent so much money -- my savings! -- on this piece of garbage, so much that it would make you sick, and now my lifelong passion for computers is almost completely dead, beaten into the ground by these companies' shenanigans, and yet they make out like bandits as I sit here with a huge hole in my account, shattered ambitions, and without a working computer (Nov will mark my 1yr anniversary of non-working computer). Oh, and 2TB of RAM sitting in a box next to me.

And yet it's funny -- I would have paid alex_stief twice the amount I paid for labor in a heartbeat. Why are MediaWorkstations, a company, so much less knowledgeable?

Whew. Rant over. May this missive rest in peace in the serene graveyard of the internet.
 
Last edited:

gabe-a

New Member
Sep 10, 2020
16
1
3
Hi,

Update!

I got 6 3000rpm fans and plugged 4 of them into 2 4-pin locations on the motherboard labeled fan8 and fan9. (Interestingly these 2 don't show up in ipmitool). I used 2 2-way splitters to accomplish this. All 3 front fans and exhaust rear fans are plugged into these 2 4-pin ports.

The 2 top fans are plugged into normal 3-pin sockets nearby because I couldn't find another 4-pin nearby and am out of splitters. But they should run at full speed when plugged into 3-pins, right?

Here are some pics of the build coming together (I removed the bottom PCI blower; it's just sitting there to show it's being cast out.

I couldn't figure out how to remove the PSU sheath at the bottom without doing something risky (no easy flaps like the rest of the case that I could see). But it has holes and is hollow on the back so the bottom front intake fan still is doing something useful.

My overheating problems have not been fixed (despite it sounding like a jet engine when it ramps up its fans. VRMP1EFGH still reaches 100C pretty quick (although it takes 5 minutes now instead of 60 seconds, haha!). Interestingly, the CPU VRMs used to be the first to overheat on this task -- now they get up near 90 but aren't the first to trigger the alarm (although given time, they probably would if the RAM VRMs hadn't blown first).

Here's the log. So the rig is now (much) louder, but overheating just happens after 5 minutes instead of 1, and the RAM VRMs seem to overheat before the CPU VRM.

I'm worried my RAM fans are blocking some airflow to the RAM VRMs, ironically. But I can't remove them or else RAM is the first to blow and the system throttles so bad it masks the problems everywhere else!
 

gabe-a

New Member
Sep 10, 2020
16
1
3
Sorry, forgot the pics.

And forgot to ask if the space between the top 2 fans is okay. (3 fans won't fit because the hard drive panel blocks a third fan!).