PSA: New builds of older SONiC versions no longer work on the Celestica DX010

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
I've been running the DX010 for a few years now, I'm the originator of this post. I'm happy to help if you come across any problems.

I am still using the old build I posted about, where breakout still works. The link and password in my original post are still active. I have not tried the new builds again in a long time. As long as you install "my" build, you will be OK - But the DX010 (and SONiC for that matter) isn't for everyone:

I used to have a Dell Z9100, which I sold on this forum, it is a very clean bug-free prod-ready switch. If that's what you're looking for, you may not like a DX010. SONiC on whitebox hardware has a lot of advantages (like unlocked SFP compatibility), but it takes some degree of messing around, be ready.

There are a couple other annoyances you may run into with the older build (or possibly the new ones too) that I have discovered, I'll share them below for any DX010 users out there, I suspect we will see an influx because they are available cheaply again.

- This post and its comments has a lot of useful getting started info:
https://www.reddit.com/r/homelab/comments/n5opo2
After starting with that post, also be aware of the following:

- You need to add a MGMT_PORT entry manually into the /etc/sonic/config_db.json to avoid errors. Here is where you put it:
1727483289378.png

The text is:
Code:
    "MGMT_PORT": {
        "eth0": {
            "admin_status": "up",
            "alias": "eth0"
- The static route add command doesn't always put the correct entries in the config file. See my own config file at the bottom for examples.

- If you enable DCB (Data Center Bridging) for RoCE PFC/QoS, future breakouts fail after configuring DCB. I run RoCE without DCB and it works perfectly fine. I would recommend just not enabling it. If you must, you will need to remove all DCB config to do a breakout, then do the breakout, and config DCB again after.

- If you create multiple VRFs, and create routes that cross VRF boundaries, that traffic needs to be routed by the slow Atom CPU, not the ASIC. This isn't unique to the DX010. In my case, I have one VRF with all VLANs in there and it works perfectly.

- There is a strange issue that happens with very little network load: The ARP entries for the upstream gateways can age-off in a strange way that prevents them from coming back. Not everybody seems to hit this issue, but if you do, see my ridiculous workaround in the references links below.

- If using DHCP helper service, sometimes it just stops working for no reason, and starts putting errors in the syslog. I have never really tried to fix it because it happens so infrequently. I usually reboot the switch frequently enough to never hit it. I'm open to creating a workaround if other people have this.

- For some SFPs and devices at the far-end, a specific set of commands is needed to get the link to come up after the breakout, especially for 1G and some 10G. This is also necessary for a specific model of 4x10G AOC breakout I use; in this case put 10000 instead of 1000. I also have a 4x25G AOC that needs speed 25000 and fec mode fc. For 1x100G and 2x50G usually none of the below is needed, all you have to do is set fec mode rs and it comes right up.

Code:
sudo config int fec Ethernet32 none
sudo config int speed Ethernet32 1000
sudo config int advertised-speeds Ethernet32 1000
sudo config int autoneg Ethernet32 disabled
sudo config int transceiver lpmode Ethernet32 disable
sudo config int shut Ethernet32
sudo config int start Ethernet32
Links:

My DX010 config file: Online Notepad with Privacy and Publishing

My Github post about ARP age-off and workaround: Aging-off of ARP entries for nexthop gateways can break static routes · Issue #1210 · sonic-net/SONiC
 
  • Like
Reactions: darthray and klui

klui

༺༻
Feb 3, 2019
984
576
93

AlbertD007

New Member
Jan 1, 2024
16
1
3
Hi @DavidWJohnston and @BackupProphet I also tested with the 202405 build and I can confirm the breakouts work across reboots. However. I think there is another configuration causing the issue because when I configure the switch with all the QOS stuff and a bunch of other things, the ports won't come up... Thankfully I made a backup of my old config file.

I am still yet to dial in the switch but here are my rough dirty config files to have a look at, maybe someone can quickly spot the issue... but it makes zero sense to me so far...

I will try to do a clean run again, reconfigure the switch from absolute scratch to just the bare essentials to locate the problem child. At this stage I can't see why interfaces won't come up.
 

Attachments

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
@AlbertD007 I know in my case if I did any additional breakouts after configuring QoS/DCB it would break everything.

I notice in your "broken" config file, Ethernet0 thru Ethernet3 (those 4 broken-out sub-ports) do not seem to have any QoS profile attached. I think this is a problem. At this point, you might need to manually add it in the JSON.

Are the interfaces simply not up, or is the interface list totally empty? You can also run "docker ps" to see if any of the containers are broken.

I have not tried the new SONiC build yet, I'm still on 202111.208607-f1f42c9a6. But posts here are looking good, I think I will try it.

Let us know how it goes, thanks.
 

AlbertD007

New Member
Jan 1, 2024
16
1
3
@AlbertD007 I know in my case if I did any additional breakouts after configuring QoS/DCB it would break everything.

I notice in your "broken" config file, Ethernet0 thru Ethernet3 (those 4 broken-out sub-ports) do not seem to have any QoS profile attached. I think this is a problem. At this point, you might need to manually add it in the JSON.

Are the interfaces simply not up, or is the interface list totally empty? You can also run "docker ps" to see if any of the containers are broken.

I have not tried the new SONiC build yet, I'm still on 202111.208607-f1f42c9a6. But posts here are looking good, I think I will try it.

Let us know how it goes, thanks.
Hi @DavidWJohnston, I see, okay will give that a go, that makes sense. I will do some more testing.

Yeah the interfaces just would not come up, they certainly all list.
Despite me trying various different things, including transceiver resets, startup, messing with the cable length, but no luck.

Yeah I keep checking all the containers constantly and make sure they are all up before I do anything on the switch.

Will do!
 

darthray

New Member
Apr 11, 2021
19
5
3
@DavidWJohnston, I very much appreciate you taking the time to provide all that good information. It took me some time, but I bought a DX010 and used your post to find my way around it. I'm running 202405 but I not doing anything complicated (1x 40G and 1x 100G link on trunked ports). I plan to add other hosts to it, but I don't expect it to get any more complicated that some VLAN definitions. It seems to be working fine, although I can now see how rough SONiC is around the edges compared to other stuff I'm used to (basically just Cisco and Brocade NOSes). It's slow too -- it seems it loads some unjitted Python code for every command, all because those files don't end with .py. I'm not sure why they do that.

BTW, is there a way to pass multiple interfaces when configuring something? Say I want to change all interfaces from routed to trunk. Is there a way to do that with a single command? I tried passing "Ethernet0-Ethernet124 but it didn't like it.

Thanks again!
 

DavidWJohnston

Well-Known Member
Sep 30, 2020
295
252
63
Glad to hear about a new DX010 user! It looks like they're still available for ~ $400USD on eBay if anyone else is following.

The OS is fiddly, and it is much slower compared to "normal" NOSs, you're right. And it's not that hard to break the mgmt network, so have a serial console ready. One of the best additions to my lab was a 48-port OpenGear serial console server. It's easy on power too.

I'm not aware of any direct way to use interface ranges. In the Reddit post I linked to, there are some examples of using BASH and awk/sed/grep and such to batch changes. When I was doing many repeated re-installs I had everything inside a BASH script. I can't find it atm but it's just nothing fancy - Just every command to run, one after the other with comments. I made most of it using loops just like that Reddit post.

There are also "presets" you can load. I never did this, but the first comment in the Reddit post shows an example. YMMV with the new build, the platform string or profile number might need to be changed or something. I think preset 12 gives you a more traditional L2 config.

I'm open to writing a command wrapper to accept interface ranges, once I try the new build I might do that.
 

darthray

New Member
Apr 11, 2021
19
5
3
Thanks, yeah, I saw some of the scripts in one of the reddit posts, probably one you linked. I'm comfortable with bash and scripts, just thought SONiC would support those out of the box (they do support ranges for VLANs, but I couldn't make it work for interface ranges).

Also, alias naming mode (which makes more sense for me) is a bit broken, presumably because not a lot of folks use it.

I'll continue to play with it. If you ever try a new build, let us know your findings.
 

pimposh

hardware pimp
Nov 19, 2022
391
226
43
Also confirm 202405 work with breakout DAC's. So far nothing unexpected noticed in my fairly simply config.
 

AlbertD007

New Member
Jan 1, 2024
16
1
3
@DavidWJohnston Okay, tried to get the QOS going but haven't had much success with the breakout interfaces.
Here is my current config, a print out of the interfaces as well as the running config.

I am a little confused how I should set up QoS so I can get RoCE going etc... Do you have a guide I can follow to set it up properly @DavidWJohnston ?

I followed the following steps (Thanks to the Reddit thread on the DX010):
1. Copy some base config files to the Celestica-DX010 dir.
cd /usr/share/sonic/device/x86_64-cel_seastone-r0/
sudo cp -a ./Celestica-DX010-C32/. ./Celestica-DX010/
sudo cp -a ./Seastone-DX010/. ./Celestica-DX010/

2. Edited ./Celestica-DX010/buffers.json.j2 and changed the profile to t0 as I am using short DACs.

3. Ran the following command to load in the configuration.
/usr/local/bin/sonic-cfggen -d --write-to-db -t /usr/share/sonic/device/x86_64-cel_seastone-r0/Celestica-DX010/buffers.json.j2,config-db -t /usr/share/sonic/device/x86_64-cel_seastone-r0/Celestica-DX010/qos.json.j2,config-db -y /etc/sonic/sonic_version.yml

4. Modified the config_db.json and set the cable lengths to 5m instead of 40. Not really sure if this protects the transceivers? They are DACs not optics so not sure.


One of the particular reasons I am so unsure is when I perform a # config qos reload, it failed looking for files in the Seastone-DX010 dir:
> config qos reload
Operation not completed successfully, please save and reload configuration.
Buffer definition template not found at /usr/share/sonic/device/x86_64-cel_seastone-r0/Seastone-DX010/buffers.json.j2

The contents of the directory was just this:
[root@ccoresw01]/usr/share/sonic/device/x86_64-cel_seastone-r0
> ll Seastone-DX010
-rw-r--r-- 1 root root 2720 Feb 20 2017 hwsku.json
-rw-r--r-- 1 root root 2077 Feb 20 2017 port_config.ini
-rw-r--r-- 1 root root 85 Feb 20 2017 sai.profile

Now thinking to myself, I thought ah maybe the Reddit thread lead me wrong and I should be copying files over to the Seastone-DX010 dir instead. However when I copied everything from the Celestica-DX010 dir in to here(being careful not to overwrite files) did not go well after a reboot. Containers kept dying, so really didn't look good...

LOL Jumping on the next Arista 100G switch looks like a better and better plan LOL. Saw quite a bit of guides with the Arista's.
 

Attachments

AlbertD007

New Member
Jan 1, 2024
16
1
3
Okay I found my problem with the files being in different folders, I was using the Seastone-DX010 preset. That was recommended funnily enough on DX010 initial config Reddit thread. This is actually bad advice because it assumes you want near nothing and will be copying files from the other preset/template directories to make your switch work the way you want.

After I stopped using the preset, life got easier, except for a new bug in the 202405 builds... Or so I think???
Build 202405.680018-378ae5876 just doesn't have "rs" as an FEC option on breakout interfaces... So went back to a previous build I had from June.
Build 202405.580192-926d03322 and it has the same issue, what is so special about these breakout interfaces??? Don't think this is a show stopper.
Not sure if I am being stupid or have to configure something but from memory on a 2022 build I have, I could set rs on breakout interfaces without any special config...
I still have the ability to set the FEC to rs on any other interface...

The following is a ramble, does anyone know of a requirement to set all broken out interfaces to the same speed? If not, read on...

Using Build 202405.580192-926d03322 at the moment (The one from June).
And have done just the bare essentials on the switch, removed bgp neighbors and IPs from all interfaces, and added 2 VLANs, then decided to do the breakout and test. Performed the breakout, set the interface speed and the FEC for Ethernet0 and it works, the interface came up etc, no surprise there...
Rebooted the switch and it's not working again... So this made me think there is config issue because I knew I had it going before on this build and my old config.
Same results for the 202405.680018 build.

I decided to compare the config files and the running configs exports and between the configs and files there is very little different...
Then I replaced my current config file with the one that is working and hazzah the breakout works... I have attached both configs and running configs to compare.

Okay I decided to roll back my config file. Then set each breakout interface to the same speed as this was one of the differences. So this means in my 4 breakout cable, I set the speed for each of the 4 interfaces. In my case 10G. Despite only 2 of the interfaces connected to a 10G box.
Rebooted and hazzah again, interfaces came up.
I should add here that I have always set Ethernet0 to 10G.

I wondered if the mismatch in speed is the issue on Ethernet3 (The other broken out interface connected to a 10G box).
And it's causing some enumerating issue with the switch's code. Or maybe some limit?
I set the speed to 10G, rebooted the switch, and now Ethernet3 comes up just fine. The only difference is the MTU at this point between Ethernet0 and 3 but Ethernet0 stays down.

Setting all 4 broken out interfaces to 10G kept my interfaces up.

Going to move on to the QoS stuff now and see how it goes. I have attached the fresh switch and the old working config I had for some reading including the running configs
 

Attachments

NablaSquaredG

Bringing 100G switches to homelabs
Aug 17, 2020
1,827
1,206
113
Build 202405.680018-378ae5876 just doesn't have "rs" as an FEC option on breakout interfaces... So went back to a previous build I had from June.
Build 202405.580192-926d03322 and it has the same issue, what is so special about these breakout interfaces??? Don't think this is a show stopper.
Not sure if I am being stupid or have to configure something but from memory on a 2022 build I have, I could set rs on breakout interfaces without any special config...
DX010 does NOT support RS-FEC on Breakout interfaces. That's a limitation of the Tomahawk ASIC (and was fixed with Tomahawk+).

See Switch Port Attributes | Cumulus Linux 3.7
Tomahawk predates 802.3by. It does not support RS FEC or auto-negotiation of RS FEC on a 25G port or subport. It does support Base-R FEC.
 

AlbertD007

New Member
Jan 1, 2024
16
1
3
Okay @DavidWJohnston, I got the interfaces to stay up on the breakout after applying the QoS via the Reddit thread:
1. Copy some base config files to the Celestica-DX010 dir.
cd /usr/share/sonic/device/x86_64-cel_seastone-r0/
sudo cp -a ./Celestica-DX010-C32/. ./Celestica-DX010/
sudo cp -a ./Seastone-DX010/. ./Celestica-DX010/

2. Edited ./Celestica-DX010/buffers.json.j2 and changed the profile to t0 as I am using short DACs.

3. Ran the following command to load in the configuration.
/usr/local/bin/sonic-cfggen -d --write-to-db -t /usr/share/sonic/device/x86_64-cel_seastone-r0/Celestica-DX010/buffers.json.j2,config-db -t /usr/share/sonic/device/x86_64-cel_seastone-r0/Celestica-DX010/qos.json.j2,config-db -y /etc/sonic/sonic_version.yml

4. Modified the config_db.json and set the cable lengths to 2m instead of 40.

Working okay apart from one problem. I can't ssh in to the switch from a linux box and am seeing RX_DRP going up on the breakout interface that is connected to the access switch which I am using to ssh through to the DX010.

Also can't ssh to any boxes attached to the switch from my workstation attached to the access switch. Hilariously I can ssh to a box directly connected to the switch from the switch and back to the switch without issue and seeing no drops on the other non-broken out interfaces.
This was working just fine before and RX_DRP did not go up.

My guess this is because QoS is enabled on this port too now. Any advice from anyone?
 

AlbertD007

New Member
Jan 1, 2024
16
1
3
@BackupProphet That's certainly what I mean by QoS. I want RoCE in all it's glory.

BTW - Fixed my SSH issue, disabling all QoS for the breakout interfaces solved the problem regarding some of the connectivity issues. Annoyingly had to manually do it via modifying the config file directly. (I am really trying to be as reliant on commands as possible so if I ever need to make a change with servers connected to the switch, they are happily ticking a long)

Still seeing RX_DROP go up. But not seeing any drops on my access switch for the port. I suspect I need to check the transceiver on the access switch side.
 
  • Like
Reactions: BackupProphet