Overheated Transceiver: KAIAM 100G CWDM4 SM 2xLC

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
I thought I'd create a new post about this and share some pics.

Due to an accidental fan unplugging, my PCIe cards got extremely hot and ran for an hour before my 100G connection dropped. The transceiver was extremely hot, way beyond normal, and after it cooled it no longer works. Luckily the cards themselves are unharmed.

I've heard these early CWDM4s can have thermal issues, which I have now experienced. Perhaps they undergo thermal runaway, or the earlier designs don't have much thermal headroom. With only a slow fan aimed across the PCIe cards, the exposed metal of the transceiver gets only slightly warm. So it needs airflow but not a lot.

I took it apart, and there is nothing visibly wrong that I can see, but the features are tiny - And it smells like burning electronics inside. Here are some pics:

1681229816606.png
1681229861820.png
I put a question mark there but I'm pretty sure it's toast.

1681229909361.png

There is some thermal interface material on 2 of the chips, and on the optical/electrical converter. Under the microscope there are some interesting features like bond wires:

1681230032385.png
1681230088477.png

The optical interface on the LC connector is also interesting. Bare fibers appear to be epoxied into the ceramic ferrules, then they loop around and are fused to those rectangular prism type things that go into the optical/electrical converter.
1681230135180.png
 

awedio

Active Member
Feb 24, 2012
776
225
43
Is it safe to assume that the same "fate" would apply to the Intel version?
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
I don't really know for sure - I have so many Intel ones now that I could do an experiment and overheat one intentionally but I don't want to risk frying my cards.

I've read a lot of comments about CWDM4s and thermal issues in datacenters so there's probably something to it. Maybe someone who works in a DC would know more.
 

klui

Well-Known Member
Feb 3, 2019
831
454
63
Did you ever note the temperatures of the KAIAM? I only have their 40G CWDM4s.

For comparison, the Intel 500m CWDM4s run 5C higher than a Mellanox AOC. Arista CWDM4s run 5C hotter than the Intel, but they are rated to 70C.
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
Normally when the fan is working, it feels only slightly warm, and the exposed part of the transceiver measures low 30s degrees C with IR thermometer. I'm sure the internal components are much warmer.

There's probably an mlx* command in windows to display the CX4's SFP temp, or in SONiC, but I don't know.
 

klui

Well-Known Member
Feb 3, 2019
831
454
63
You can use ethtool -m <int> on the host.

For SONiC, it's show interfaces transceiver eeprom <int> -d or sudo sfputil show eeprom -p <int> --dom. The latter works better because often times show interfaces transceiver eeprom ... doesn't show any transceivers--at least on the D4040/DX010. Although there is currently some work on SFP refactoring for DX010. Hopefully things will improve sooner rather than later. [Seastone] DX010 platform switch to sfp-refactor based sfp impl by qnos · Pull Request #13972 · sonic-net/sonic-buildimage
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
Thanks that helps a lot - I remember picking an older SONiC version that had the platform code working properly to show the SFP, but I didn't know about the extra individual interface info.

I run Windows 11 Pro for Workstations on the host, so I can't use ethtool, but here is what SONiC reports for its end. It's safe to assume this temp (45.8C) is less than the host, because the switch has far more airflow.

Code:
admin@sonic:~$ show interfaces transceiver eeprom Ethernet88 -d
Ethernet88: SFP EEPROM detected
        Application Advertisement: N/A
        Connector: LC
        Encoding: 256B257B
        Extended Identifier: Power Class 4(3.5W max), CDR present in Rx Tx
        Extended RateSelect Compliance: Unknown
        Identifier: QSFP28 or later
        Length Cable Assembly(m): 0
        Nominal Bit Rate(100Mbs): 255
        Specification compliance:
                Extended Specification compliance: 100G CWDM4
        Vendor Date Code(YYYY-MM-DD Lot): 2018-12-07
        Vendor Name: KAIAM CORP
        Vendor OUI: 14-ed-e4
        Vendor PN: XQX5170
        Vendor Rev: 1A
        Vendor SN: BL1832000BN
        ChannelMonitorValues:
                RX1Power: -2.4972dBm
                RX2Power: -0.9184dBm
                RX3Power: -1.4752dBm
                RX4Power: -1.4454dBm
                TX1Bias: 57.046mA
                TX1Power: 0.6115dBm
                TX2Bias: 46.428mA
                TX2Power: 0.61dBm
                TX3Bias: 42.978mA
                TX3Power: 0.1098dBm
                TX4Bias: 60.54mA
                TX4Power: 0.0039dBm
        ChannelThresholdValues:
                RxPowerHighAlarm  : -0.8492dBm
                RxPowerHighWarning: -0.8492dBm
                RxPowerLowAlarm   : -0.8492dBm
                RxPowerLowWarning : -0.8492dBm
                TxBiasHighAlarm   : 25.218mA
                TxBiasHighWarning : 26.0mA
                TxBiasLowAlarm    : 52.4mA
                TxBiasLowWarning  : 30.894mA
        ModuleMonitorValues:
                Temperature: 45.7891C
                Vcc: 3.2618Volts
        ModuleThresholdValues:
                TempHighAlarm  : 17.7969C
                TempHighWarning: 0.0C
                TempLowAlarm   : 7.5C
                TempLowWarning : 0.0C
                VccHighAlarm   : 0.0Volts
                VccHighWarning : 1.9265Volts
                VccLowAlarm    : 0.0064Volts
                VccLowWarning  : 1.8753Volts
For comparison purposes, here is an Arista 100G MM transceiver w/ MPO connector: (31.3C)

Code:
admin@sonic:~$ show interfaces transceiver eeprom Ethernet0 -d
Ethernet0: SFP EEPROM detected
        Application Advertisement: N/A
        Connector: MPOx12
        Encoding: 256B257B
        Extended Identifier: Power Class 4(3.5W max), CDR present in Rx Tx
        Extended RateSelect Compliance: QSFP+ Rate Select Version 1
        Identifier: QSFP28 or later
        Length Cable Assembly(m): 50
        Nominal Bit Rate(100Mbs): 255
        Specification compliance:
                Extended Specification compliance: 100GBASE-SR4 or 25GBASE-SR
        Vendor Date Code(YYYY-MM-DD Lot): 2017-11-12
        Vendor Name: Arista Networks
        Vendor OUI: 00-1c-73
        Vendor PN: QSFP-100G-SR4
        Vendor Rev: 20
        Vendor SN: AMD1801000LT
        ChannelMonitorValues:
                RX1Power: -1.5839dBm
                RX2Power: -2.0004dBm
                RX3Power: -2.0004dBm
                RX4Power: -1.8916dBm
                TX1Bias: 7.494mA
                TX1Power: -1.411dBm
                TX2Bias: 7.494mA
                TX2Power: -1.6292dBm
                TX3Bias: 7.494mA
                TX3Power: -1.1537dBm
                TX4Bias: 7.494mA
                TX4Power: -0.9696dBm
        ChannelThresholdValues:
                RxPowerHighAlarm  : 2.6057dBm
                RxPowerHighWarning: 1.2529dBm
                RxPowerLowAlarm   : 3.2899dBm
                RxPowerLowWarning : -0.8492dBm
                TxBiasHighAlarm   : 25.696mA
                TxBiasHighWarning : 4.0mA
                TxBiasLowAlarm    : 34.0mA
                TxBiasLowWarning  : 0.44mA
        ModuleMonitorValues:
                Temperature: 31.25C
                Vcc: 3.2578Volts
        ModuleThresholdValues:
                TempHighAlarm  : 17.7969C
                TempHighWarning: 0.0C
                TempLowAlarm   : 12.5C
                TempLowWarning : 0.0C
                VccHighAlarm   : 0.0Volts
                VccHighWarning : 1.6754Volts
                VccLowAlarm    : 1.28Volts
                VccLowWarning  : 2.6995Volts
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
Ok got it! Here is what the host says: (47C)

Code:
C:\...> mlxcables -d mt4115_pciconf0_cable_0
Querying Cables ....

Cable #1:
---------
Cable name    : mt4115_pciconf0_cable_0
>> No FW data to show
-------- Cable EEPROM --------
Identifier                     : QSFP28 (11h)
Technology                     : 1310 nm DFB (40h)
Compliance                     : 100G CWDM4 MSA with FEC
Wavelength                     : 1310 nm
OUI                            : 0x14ede4
Vendor                         : KAIAM CORP
Serial number                  : BL1741001P7
Part number                    : XQX5170
Revision                       : 1A
Temperature [c]                : 47 [-10..80]
Digital Diagnostic Monitoring  : YES
Length [m]                     : 0 m
Maybe what I'll do is switch to the Intel ones at both ends and query it again.
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
OK I swapped in the Intel CWDM4s in-place of the KAIAMs. Here is what the temps indicate after a while to reach equilibrium:

Switch: 43.6C (2C cooler than the KAIAM)
Host: 44C (3C cooler than the KAIAM)

So it would seem the Intels run a bit cooler. Still quite a bit hotter than the Multimode Aristas though.

I'll check the temps again tomorrow to see if it's stable.
 

TRACKER

Active Member
Jan 14, 2019
177
54
28
I am observing similar dependency between temps of 100G MM transceiver w/ MPO connector (FS) and my KAIAM XQX4302 - around 8-10°C difference in temps between two types.
From CRS504 switch temp of SR4 transceiver is around 46, while CWDM4 transceiver keeps around 54-56°C. Also, temperature measured with IR thermometer on the outside parts sticking out of the switch is 32-34°C for both types of transceivers. Btw, when i remove the CWDM4 transceiver after it was running for some time, it feels very hot in the hand, so i assume this temp i can feel is actually around 50°C :)
 

Attachments

  • Like
Reactions: DavidWJohnston

awedio

Active Member
Feb 24, 2012
776
225
43
OK I swapped in the Intel CWDM4s in-place of the KAIAMs. Here is what the temps indicate after a while to reach equilibrium:

Switch: 43.6C (2C cooler than the KAIAM)
Host: 44C (3C cooler than the KAIAM)

So it would seem the Intels run a bit cooler. Still quite a bit hotter than the Multimode Aristas though.

I'll check the temps again tomorrow to see if it's stable.
So, what did you conclude here?