Thanks for your thoughts, Mike -- very appreciated.
I can hopefully answer some of your questions.
1. We do not know ambient temps ... This has a huge effect on any thermals if the delta between the cooling air and the heat sources reaches the 'point-of-no-return'. if the delta is too small you will never be able to cool it down again properly other than stopping any workload or shutting it off.
Ambient temps: the room air is kept at 68F (20C). The air inside the case has a "case temp" that is also referenced in some of the IPMI logs as well, and it sometimes approaches 37C, but does not seem to go above this. We don't know where the actual point of measurement is on the motherboard/case for this, though, just what the logs say, unfortunately.
2. The case with this load of periphery inside is 'useless' in terms of providing airflow ... Simple said it's just too small to allow any flow
I'm not sure the case is the only factor here, as I have a Knights Landing and dual Xeon platinum 8180 from HP with identical or smaller cases. It's just that the HP case has a bunch of strange cooling inside, and the Knights Landing is watercooled. Neither of those systems overheat on the same workloads. A friend of mine also built into an ATX mid case with a couple of older Xeons and it (apparently) doesn't overheat.
3. Did you consider the installed periphery giving up heat into the case? I mean there are tons of added NVME's, HBA, graphic card etc ... All producing lots of thermal energy
No, I didn't -- and this is a good point. There is a lot of...stuff in here. Unlike the HP, which has a bunch of modular compartments with dedicated fans whooshing across each separately, everything is sort of mixed together in here like in a desktop PC. During my workloads, though, the GPU is never running load while the CPUs are, and the I/O is usually staggered around the computation, so it's not terribly likely the components would all be under load at once. Of course RAM is an exception, as the CPUs and RAM are often under very high load together.
7. Not to mention the direction of the airflow. Front to back & top is in my opinion the wrong way. The front fans are cooling the HDD's. Nothing else. The air around the HDD cages becomes already heated up. The first CPU fan takes the air directly from the heated HDD. The second CPU cooler sucks air from the super-hot controller between both CPUs.
This is interesting. Putting my hand in the case shows that the air coming past the hard drives is relatively cool still (of course, this is unscientific). But I do know the hard drives turn off when not in use (and I don't pull data from the RAID HDDs while doing computation, so they are confirmed to be powered down in an idle state while the CPUs are doing their work (which caches in RAM and occasionally checkpoints onto the NVME PCIe card). Changing the airflow direction has noticeably improved temperatures (a workload that failed in 60 seconds now runs to completion, albeit with some throttling at the 10 minute point as the RAM VRM cools off).
The lower the delta in temps between periphery and air temp in the case, the worse it gets ... You will never escape this circle if:
1. You don't use a bigger case allowing flowing air
The flux of air out of the exhaust fans is now quite high, where it was low before. A piece of paper dropped at case height is blown ~ 4 meters now. The same piece of paper was not blown 0.5 meters previously. I don't think case size alone is the greatest factor considering my functioning HP Z8 and Colfax Knights Landing rigs. But I'm sure you're right that it plays a role. This case is huge, though.
2. If you do not cool down your ambient temp
This is interesting -- how to do this? Ambient as in "within-case" or "outside of case"? Room temp A/C ensures 68F/20C consistently. A thermometer placed in front of the intake fans reports 68F during load.
3. If you place periphery in the direction of airflow and therefore slow down airflow
This makes sense. I do have this problem. I don't know how to fix it. I do know air is rushing past, but there are indeed a bunch of HDDs (and even the front of the case, which redirects inflow from the sides of the front panel!). There are also a bunch of RAM fans in the way, and a bunch of other pieces throughout the case.
4. If you use way too much fans interfering with each other in a counter-productive way
This is a good point, but not sure how to ensure air flow without more fans.
etc etc
This can go further and further. Start from new and totally rethink the environment.
Actually would really appreciate elaborating on the "etc etc" because all of this is helpful. I did not build this computer and have no idea what to rethink. Can you provide some pointers for a rebuild? (I might not be able to do it myself but it would help the company I bought it from to rebuild it, as this is now the
third reincarnation of the computer).