PSA: New builds of older SONiC versions no longer work on the Celestica DX010

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
So I just went through a painful process of re-installing SONiC numerous times on my DX010. I wanted to try the newest builds in the pipeline:


New builds work for QSFP28 transceivers but the breakout functionality is broken. The interfaces will come up when you first do the breakout, and everything looks good, but on the next reboot they die and never come back. I did a lot of troubleshooting but no dice.

So I thought OK I will download my previous version (202111) and everything will be fine. Unfortunately, it looks like backported changes broke all the builds for previous versions as well, so the working builds are no longer available for download.

Luckily I keep almost everything, and I have the Downloads folder from my old workstation on LTO tape, and I was able to restore the old BIN file I downloaded before.

There may be some way to still download the older versions, but here it is in case anyone is in the same situation. The password is: celestica

sonic-broadcom-dx010-202111.208607-f1f42c9a6.bin

Has anybody else had this breakout problem with the newer builds and resolved it?
 
Last edited:

klui

༺༻
Feb 3, 2019
991
581
93
Is it for all builds in the current pipeline for 202111 that goes back to late July? I've experienced the opposite where messing with the port configuration drops the link/metadata goes bad but a reboot fixes it.

You may want to file a report but who knows how long that would take to get fixed.
 

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
I didn't bother trying any of the July 202111 builds - If one of them does happen to work, it's just going to get aged-off anyway, as new stuff keeps getting backported. But it might help pinpoint when the problem started.

I may create a github issue. At this point I've re-installed my previous working version. Maybe on the weekend I'll have some energy to repeat it and collect all the info to support a ticket.
 

traetox

New Member
May 22, 2024
6
0
1
If I ever find you in a bar, or know that you are within stiking distance of me, I am going to buy you all the beers.

I wasted two full days trying to figure out why breakout wasn't working with my SFPs and/or DACs.

Fun fact, if you connect between two celestica DX010 switches (links right up). Connect between the switch and ANYTHING ELSE and nothing. I was about to take these things into the desert and start shooting them.

THANK YOU SIR!
 

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
You're welcome, glad I could help. If you're using L3 functionality with that build, you may run across an ARP table bug I raised here: Aging-off of ARP entries for nexthop gateways can break static routes · Issue #1210 · sonic-net/SONiC

As a workaround, I use a rather hideous script (since been modified) that pings everything in the ARP table periodically from inside the correct VRFs. It wouldn't scale in production.

I don't know if you'll come across that or not, but there's a bunch of other random things I fixed. Just tag me in your post if you get stuck during your setup and I will try to help.

Good Luck!!
 

traetox

New Member
May 22, 2024
6
0
1
I am all layer 2 for now. I appreciate your willingness to help very much.

You wouldn't happen know know if this build supports STP while in an MC-LAG setup? My EdgeCore switches oddly won't enable STP when MC-LAG is active.
 

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
I've never done any multi-chassis stuff at all, nor even regular LAG or STP. Due to the DX010 being broken by some backport, the features in the new builds won't be available to us. Unfortunately in the last few years a massive ton of features have been added.

The underlying reason the new builds don't work properly may be a very tiny and fixable bug. If I have some extra free time I might try to dig into it.
 

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
All of the builds (all versions) in the pipeline AFAIK are currently broken on the DX010 (specifically port breakout post-reboot) due to some problematic backported patch that was done in the last year or so. My recommendation would be to install the version from my OneDrive link in my post. This one works.

If you're not using port breakout at all, and/or want to try out the new ones, go ahead, I'm curious to hear about your experience if so.

Upgrading ONiE is probably not necessary. I installed the OS by booting into stock ONiE then mounting a fat32-formatted USB drive containing the bin into /mnt/usb, then running onie-nos-install /mnt/usb/blah.bin.

If you do it this way, make sure to plug the USB drive into the switch AFTER you get to the ONiE prompt, so it is detected as the last /dev/sdX in the system. I also use a small USB drive, because part of the way ONiE detects the target system disk for installation can be based on largest disk in the system.
 
  • Like
Reactions: BackupProphet

traetox

New Member
May 22, 2024
6
0
1
Ok, so I just finished an MC-LAG configuration on two DX010 switches and while the configs all took, the build posted above does not have the "mclag" argument on the "show" command:

admin@TOR2:~$ show mclag status
Usage: show [OPTIONS] COMMAND [ARGS]...
Try "show -h" for help.

Error: No such command "mclag".
admin@TOR2:~$



I also can't get the PortChannel interfaces to come up on any of my devices that are NOT the switch:

Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
S - selected, D - deselected, * - not synced
No. Team Dev Protocol Ports
----- -------------- ----------- -------------------------
0 PortChannel0 LACP(A)(Up) Ethernet0(S) Ethernet4(S)
1 PortChannel1 LACP(A)(Dw) Ethernet124(D)


I am going to keep poking around, but it looks like mc-lag might not be enabled for
 

traetox

New Member
May 22, 2024
6
0
1
Welp... that explains that.... Do you happen to know if I have to completely rebuild all of SONIC with the MCLAG feature or can I just build the ICCPD service/docker container?
 

klui

༺༻
Feb 3, 2019
991
581
93
No, but at a minimum you need to retrieve the commit id of the build you've downloaded and use that. Otherwise you'd need to decide which branch/commit to use. Newer branches will have issues. Aside from MLAG, there appears to be a memory leak affecting most if not all Broadcom-based switches.
 

traetox

New Member
May 22, 2024
6
0
1
This is kind of frustrating... Are any of the other switch OS's more stable?

At the end of the day I just need layer 2 and MCLAG with LACP support.
 

klui

༺༻
Feb 3, 2019
991
581
93
Cumulus supports MLAG, is stable, but requires a license. However, after Nvidia acquired Cumulus Broadcom-based switches took a backseat--or some folks will say the ejection seat--so only bug fixes are released for the platform.

You can expect similar behavior like SONiC from other open sourced/freely available NOSes for these types of open switches. The NOSes are really for people who have teams that have the expertise to modify the software to suit their needs.

Read these threads on r/networking for example:

https://www.reddit.com/r/networking/comments/s8t683
https://www.reddit.com/r/networking/comments/s468zd
If you need 100G, get one from a good vendor, but you will pay for the stable battle-tested software. Arista is my first choice.
 

traetox

New Member
May 22, 2024
6
0
1
You are entirely correct. I typically only buy Arista... went out on a limb on these celesticas... oh well... lesson learned.
 

darthray

New Member
Apr 11, 2021
20
7
3
Hey folks,

Is this Dx010 issue still a thing? I'm considering a DX010 to play with and I know I'll eventually use breakout cables. Interested in knowing if anyone is running this config and can how SONiC is behaving.