From what I have read from various sources damage starts kicking in at 110C and usually occurs around 120C.I thought running too hot hurt the HBM, is it the case ?
From what I have read from various sources damage starts kicking in at 110C and usually occurs around 120C.I thought running too hot hurt the HBM, is it the case ?
How certain are you (or the Chinese forums, which seem to be up to date) that this is because of the missing traces versus the Nvlink clock being different from 156.25MHz, which is the default for the P100 and V100? If the board is meant to be used in a dual card setup, at least some links should be present, right?I can confirm NVLink does not work, because they are missing the traces on these GPUs.
They seem to be pretty certain about the traces missing, I actually asked about whether its possible to get it working some how, their feedback was they have had success by migrating the whole chip to a PCIe doner. I assume they have tested the circuits out on the SXM2 board if they have gone to that regard, but thats all the info I have.How certain are you (or the Chinese forums, which seem to be up to date) that this is because of the missing traces versus the Nvlink clock being different from 156.25MHz, which is the default for the P100 and V100? If the board is meant to be used in a dual card setup, at least some links should be present, right?
To be clear, I don't know what the clock for the A100 should be, but I can imagine it's different.
Mines have been sitting in the corner of my room for months now, I hope none are faultyI also have another interesting development on my side...
Out of the 4 QS I have received, 1 of them was faulty. I have shipped it back and the seller has prepared another unit for exchange, and it seems like they will be sending me the CS version.
I will keep an eye out on any differences (if any) I can notice on my system.
why not use themMines have been sitting in the corner of my room for months now, I hope none are faulty
The rest of the setup isn’t ready yet. It almost is thoughwhy not use them
I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.I finally had a chance to test the Bykski water block. I can't make the GPU exceed 50 degrees celsius, but it still crashes with sufficient load. With the following python code, the power usage goes up to 460W+:
Python:import torch import torch.nn.functional as F a = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16) w = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16) print("START") with torch.no_grad(): while True: F.linear(a, w)
The VRMs are significantly hotter than the core, but it doesn't seem catastrophic. Does anybody have an idea what is happening?
View attachment 41781
Using the script there. I had the GPU drop out when the core was less than 80C and the HBM was less than 60CI currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.
Interesting...I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.
This card doesn't have any downclocking strategies. Therefore, once either the core or the HBM exceeds the temperature limit, the card will malfunction and the system will freeze.Using the script there. I had the GPU drop out when the core was less than 80C and the HBM was less than 60C
This temperature can be seen through other SMI commands. What is shown in nvidia - smi is just the core temperature. I completely removed the Integrated Heat Spreader of the core. After removal, it was found that the thermal paste inside had become ineffective. However, the removed IHS had also deformed and could no longer be used. So I directly made the radiator contact the core.Interesting...
How are you getting the HBM temperatures?
So you delidded the heat spreader and reapplied the thermal paste underneath right?
Do you have some instructions/photos on how to remove it safely?I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.
That’s the point I was making. The card did not exceed any temperature limit since it was below 80C and 60C on HBMThis card doesn't have any downclocking strategies. Therefore, once either the core or the HBM exceeds the temperature limit, the card will malfunction and the system will freeze.
I'm pretty sure it depends on the temperature limit because if I make the contact of the radiator worse or lower the pump speed, it dies with much less load. Maybe the temperature goes up at a peak, and the card dies before nvidia-smi would show the issue.That’s the point I was making. The card did not exceed any temperature limit since it was below 80C and 60C on HBM
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32If it is just LLM inference calculation, the inference speed of this A100 32G card is not much faster than that of V100 32G.