High temp reported on MCX4121A-ACAT

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

TRACKER

Active Member
Jan 14, 2019
182
56
28
Hello folks,

i have the following situation and would like to ask for advice :)

I am using two dual port 25Gbps NICs MCX4121A-ACAT installed next to each other (pic attached).
Both cards are running latest available firmware.
OS is latest truenas core 13 U6.1.
I know these cards are running hot.

Error in dmesg is:
mlx5_core1: WARN: mlx5_temp_warning_event:227:(pid 12): High temperature on sensors with bit set 0 0x1
mlx5_core0: WARN: mlx5_temp_warning_event:227:(pid 12): High temperature on sensors with bit set 0 0x1

core0 and core1 are mce0 and mce1 adapters respectively.
When i check temp reporting from console, one of the cards is showing strangely high temps:

top card:
MCE0 -> dev.mce.0.hw_temperature: 107000 (107°C)
MCE1 -> dev.mce.1.hw_temperature: 107000 (107°C)

bottom card:
MCE2 -> dev.mce.2.hw_temperature: 75000 (75°C)
MCE3 -> dev.mce.3.hw_temperature: 75000 (75°C)

I understand that's temp on the chip die itself, but still there is 30 degrees difference!
There is active cooling in front of the cards on 10cm distance (12cm fan ~1800 rpm).


I plan to measure temps with IR thermometer next time when i open the case.
Do you think issue may be with heat sink on top card?

Any other ideas are highly appreciated :)
 

Attachments

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
What chassis?
How many fans? Specs?
Are there other add on cards under the second mellanox nic? (The fan might push all the air to that space instaed to the top nic)
 

TRACKER

Active Member
Jan 14, 2019
182
56
28
Chasis: Cooler Master Centurion Silencio RC-550-KKN1
Mobo: SM X10SRI-F
CPU: Intel Xeon 2667 v4
RAM: 4x16GB DDR4 2400MHz
1) 4xNVMe -> temps around 45°C idle, 55°C under load
2) 2xNVME -> temps around 45°C idle, 55°C under load
3) LSI SAS 2308 -> temps unknown
4) Broadcom 57810 2x10Gbps, temps unknown
5) 12cm fan 1800rpm, on the picture there are cables, now removed due to fan installation

After fan was installed, all NVMe cards dropped temp with 5-6°C
Also chipset temp dropped with around 10°C
Unfortunately i don't know what were NIC temps before installed the fan because i did not know how to check mellanox nic temps (found out how 2 days ago)

Just to add: i have absolutely no issues with any of the hardware in my storage machine.
No errors or drop outs of the nics or storage under heavy load (e.g. 5-6GB/s scrub and 30-40Gbps traffic balanced through 4 mellanox cards)
The only thing that bothers me is the temp values for mce0/mce1 and also warning messages in dmesg.
 

Attachments

Last edited:

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
Unfortunately i don't know what were NIC temps before installed the fan because i did not know how to check mellanox nic temps (found out how 2 days ago)
What temperatures do you get when you swap the nic positions?
 

TRACKER

Active Member
Jan 14, 2019
182
56
28
i did not try to swap them yet.
Next week will test what will be temps if i swap the nics.
 

TRACKER

Active Member
Jan 14, 2019
182
56
28
I've measured temps of the running system with thermal imager and temps are similar (+/- 1 to 2°C)
Card 1 had temp on heat sink 65°C, card 2 - 64°C.
So it seems card 1 temp sensor is showing wrong temp of 105°C
Clearly visible are the sfp+ modules (red on the left), their temp was around 46°C.
 

Attachments

TRACKER

Active Member
Jan 14, 2019
182
56
28
Or the heatsink is not sitting properly on the asic/there is not enough/too much thermal grease between asic & heatsink
yes, that will be next thing i will check, but i need to have downtime for the machine, so it won't be today :)
 

TRACKER

Active Member
Jan 14, 2019
182
56
28
I've re-pasted the card.
There was some pink-ish paste, cleaned it up and used new "some brand" paste :) (i think cooler master brand).
no change...

dev.mce.0.hw_temperature: 102000
dev.mce.1.hw_temperature: 102000
dev.mce.2.hw_temperature: 73000
dev.mce.3.hw_temperature: 73000
 

Attachments

mach3.2

Active Member
Feb 7, 2022
132
87
28
what happens if you swap the position for both cards?

If the temps stay the same, it might be the card being wonky...
 

TRACKER

Active Member
Jan 14, 2019
182
56
28
yes, i forgot to mention that.
When i swap the cards, situation stays the same. The card which is "overheating" still reports temp over 100°C.
Non-overheating card reports normal temps ~75°C
 

TRACKER

Active Member
Jan 14, 2019
182
56
28
One more thing, i found also the error logs from the time when i bought the card (Jan 2023) and even then i was getting temp related warning, which means back then temp report should have been around 105°C (that's the temp when warning message is generated in system log).
I don't think if real temp was over 100°C that card would survive without dying more than a year, but who knows :)
I guess i will just keep it like it is and just ignore this temp reading.
 
  • Like
Reactions: mach3.2

TRACKER

Active Member
Jan 14, 2019
182
56
28
Update on my "issue", finally i was able to figure out why one of the cards has ~30°C higher temp.
The difference is caused by the PCIe slot.
The card with higher temp is installed in PCIe 3.0 x8, electrical x8
The card with lower temp is installed in PCIe 3.0 x8, electrical x4.
When i installed the latter into PCIe 3.0 x8, electrical x8 temp is similar (bit lower due to optic modules are not plugged in atm).

dev.mce.0.hw_temperature: 94000
dev.mce.1.hw_temperature: 94000
dev.mce.2.hw_temperature: 85000
dev.mce.3.hw_temperature: 85000

I was able to decrease the temp from 105 to 90-95°C via installing additional fan blowing over the slots.
 
  • Like
Reactions: mach3.2 and blunden