Celestica DX010 100GbE switch w/ Intel Avoton C2358 CPU - AVR54 C0 stepping failure?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

AveryFreeman

consummate homelabber
Mar 17, 2017
413
54
28
42
Near Seattle
averyfreeman.com
Hey,

I searched around the net to see if I could find anyone saying their Celestica DX010 switch died on them due to the C0 stepping bug, and didn't see anything. Curious to know how dangerous the bug is to these switches.

Does anyone on here have one of these switches, and if so, how are you mitigating the impending doom for which these switches are apparently imperilled?

Anyone here had one die?
 

klui

Well-Known Member
Feb 3, 2019
834
457
63
If mine fails I'm going to ask a coworker who has experience with hardware, scopes and stuff to do the rework. But hopefully won't need it since it's manufactured late 2019. There is another thread here about their 40G D4040 and @okrasit reworked his but the PCB layout changed in later versions. Might have been reworked by Celestica already.

There was an eBay listing that had a unique version with an "LPC" suffix. My guess is it specifically had the rework done and all units built after that date (September 2016) was reworked.
 
  • Like
Reactions: AveryFreeman

AveryFreeman

consummate homelabber
Mar 17, 2017
413
54
28
42
Near Seattle
averyfreeman.com
If mine fails I'm going to ask a coworker who has experience with hardware, scopes and stuff to do the rework. But hopefully won't need it since it's manufactured late 2019. There is another thread here about their 40G D4040 and @okrasit reworked his but the PCB layout changed in later versions. Might have been reworked by Celestica already.

There was an eBay listing that had a unique version with an "LPC" suffix. My guess is it specifically had the rework done and all units built after that date (September 2016) was reworked.
Yeah, I read the thread on the D4040, that's how I learned about the bug and the DX010. I'm always looking for deals on interesting gear on eBay and was thinking about getting a switch for my currently direct-connected connectX-3s.

The DX010 is just so nice and so ridiculously affordable right now, I'd love to give sonic a try. I just don't want to shell out for something that's going to die in a year or two. And unlike the D4040 or the affected Synology NAS, I haven't seen any info or guides on how to remedy the C0 stepping bug.

Anecdotally, have you ever heard / read about any of these (pre-2017 or otherwise) DX010 dying? I imagine you've probably done quite a bit of research since you own one...
 

AveryFreeman

consummate homelabber
Mar 17, 2017
413
54
28
42
Near Seattle
averyfreeman.com
DX010 does not use LPC to start, so it shouldn’t be affected in fact
Wow, you really think so? How does it "start"? (what do you mean by "start" exactly? That's kind of an ambiguous term...) Are you talking about the boot procedure? If it doesn't use LPC to boot, how does it go about it?

It's gotta use the LPC for the clock, right? I imagine that affects everything... but I admit I don't know much about how this bug works exactly, just that it kills supposedly anything with an Avoton made before 2018ish
 

nasbdh9

Active Member
Aug 4, 2019
166
96
28
Wow, you really think so? How does it "start"? (what do you mean by "start" exactly? That's kind of an ambiguous term...) Are you talking about the boot procedure? If it doesn't use LPC to boot, how does it go about it?
processor can read the BIOS file from SPI or LPC and then boot (DX010 uses SPI

It's gotta use the LPC for the clock, right? I imagine that affects everything... but I admit I don't know much about how this bug works exactly, just that it kills supposedly anything with an Avoton made before 2018ish
I’m not sure if there are other places where the LPC bus is used, I don’t remember it (my DX010 is already sold :oops:
 

thefloyd

New Member
Dec 21, 2020
29
7
3
processor can read the BIOS file from SPI or LPC and then boot (DX010 uses SPI


I’m not sure if there are other places where the LPC bus is used, I don’t remember it (my DX010 is already sold :oops:
the LPC is the 'clock' line (hence the 'C') which sets the timings for the SPI (and other buses). The switch can't just magically read the BIOS via SPI with a dead clock.

**EDIT just found out that the SPI bus uses a completely separate clock. neato!
 
Last edited:

LodeRunner

Active Member
Apr 27, 2019
540
227
43
I'm having trouble finding it, but I seem to recall (maybe confusing it with the D4040) that something in the boot chain would still call or try to set a register on the LPC and if the LPC is dead, it hangs forever trying to set the register.
 

Sjhwilkes

New Member
Oct 17, 2020
28
2
3
Resurrecting this - just bought one of these myself and very tempted to buy four plus spares to solve a customer issue where we can't get anything 100G new fast enough. Anyone got any further anecdotes on these failing?
 

thefloyd

New Member
Dec 21, 2020
29
7
3
Resurrecting this - just bought one of these myself and very tempted to buy four plus spares to solve a customer issue where we can't get anything 100G new fast enough. Anyone got any further anecdotes on these failing?
I've got two, and while far from any kind of scientific study neither of my 2018 models has suffered any ill effects. I popped one open and compared to the photos on the google drive attached to the reddit thread (see below) and can confirm my 2018 model has different component values around the SoC compared to the model that failed, but also has different values when compared to the even more recently made ones (IE: my values match NEITHER set of pics). I can potentially explain this as: flawed atoms could be run fine with some tweaks to the external circuitry - which is probably what my unit is - whereas there was going to be yet another revision when the actual atom SoC got a new stepping itself.

https://www.reddit.com/r/homelab/comments/n5opo2
 
  • Like
Reactions: AveryFreeman

klui

Well-Known Member
Feb 3, 2019
834
457
63
DX010s should have the C0 stepping if they were manufactured in 2018. There is a thread on r/homelab where someone posted the proper command to check when booted in some form of Linux like SONiC. lscpu and cpuinfo do not work.

https://www.intel.com/content/dam/w...ion-updates/atom-c2000-family-spec-update.pdf page 15.
The BIOS is able to determine the silicon stepping of the entire SoC. This is accomplished by reading the 32-bit CUNIT_CFG_REG_CLASSCODE register in the configuration space, bus 0, device 0, function 0, offset 8h. The SoC stepping is shown in bits [7:0]. Note that Ethernet Rev ID will not change from B0 to C0 stepping.
Parameter​
B0 SoC​
C0 SoC​
PCI Rev ID
2​
3​
Code:
setpci -s 00:00.0 8.b
If it returns 03 then the SoC is C0 stepping. 02 means B0 stepping.

It appears ONIE does not have the setpci command.
 

thefloyd

New Member
Dec 21, 2020
29
7
3
DX010s should have the C0 stepping if they were manufactured in 2018. There is a thread on r/homelab where someone posted the proper command to check when booted in some form of Linux like SONiC. lscpu and cpuinfo do not work.

https://www.intel.com/content/dam/w...ion-updates/atom-c2000-family-spec-update.pdf page 15.

Parameter​
B0 SoC​
C0 SoC​
PCI Rev ID
2​
3​
Code:
setpci -s 00:00.0 8.b
If it returns 03 then the SoC is C0 stepping. 02 means B0 stepping.

It appears ONIE does not have the setpci command.
Neat! Thanks for that! I was able to confirm my 2018-manufacture DX010 is indeed a C0 SoC and thus I shouldn't have to worry about the rangely issues.
 

AveryFreeman

consummate homelabber
Mar 17, 2017
413
54
28
42
Near Seattle
averyfreeman.com
I've got two, and while far from any kind of scientific study neither of my 2018 models has suffered any ill effects. I popped one open and compared to the photos on the google drive attached to the reddit thread (see below) and can confirm my 2018 model has different component values around the SoC compared to the model that failed, but also has different values when compared to the even more recently made ones (IE: my values match NEITHER set of pics). I can potentially explain this as: flawed atoms could be run fine with some tweaks to the external circuitry - which is probably what my unit is - whereas there was going to be yet another revision when the actual atom SoC got a new stepping itself.
That's really helpful, thanks for sharing your link to your reddit post. Info on these is definitely obscure so it's great when people can source info across disparate locations.

It's helpful you detail getting back ONIE, but isn't basically anyone who buys this thing getting it to run SONiC? (I'm giving you a hard time, of course)

I guess the most important point you make here is the switches haven't taken a shit. The bug has scared previous owners into unloading them en-masse to liquidators, who are fire-selling these originally $4000 switches for like $300-$400 a piece, so for their sake I kinda hope there's something to it, and for everyone else's sake (yours included, of course) there's not.

Did you manage to setpci test them to see your stepping model? I think that's the "scientific study" you're looking for. If I missed that, apologies in advance.