Bug in Intel Atom C2000 series processors?

smithse79 · Feb 6, 2017

I've not seen it mentioned over here just yet, but there is quite a bit of talk on the various IT /r/'s about something that came out in Intel's 4Q16 earnings statement about a problem in certain processors. It appears looking in this whitepaper: http://www.intel.com/content/dam/ww...ion-updates/atom-c2000-family-spec-update.pdf (AVR54 on page 34) that it's a bug in the Atom C2000 series chips. Most of the discussion so far has been surrounding Cisco equipment that seems to be using these chips. Apparently a lot of their mid to higher-end gear uses them. I'm more concerned because I know a lot of folks around here use them for low-power servers. Has anyone seen much about this outside the Cisco talk?

Patrick · Feb 6, 2017

From what I heard this was mostly due to a specific implementation of the chips so I did not think it would impact most of STH readers.

But as a FYI - Rangeley CPUs are used in all kinds of networking gear. I would not be surprised if switch management was not the #1 market for these things. Not just Cisco, I believe QCT, Arista and others use(d) them.

smithse79 · Feb 6, 2017

The issue I take with it, is that this is a Super Scary Bug^tm and Intel has not come clean with it at all. I've been impressed with Cicso's response once they found out about it. But Intel's has left something to be desired.

Drewy · Feb 6, 2017

Care to elaborate on those "specific implementations"?

Patrick · Feb 6, 2017

@Drewy - I am not sure how much I am allowed to talk about publicly on it but there are use cases where it is likely to be a big deal, and use cases where it is unlikely to be an issue from what I understand.

Drewy · Feb 6, 2017

I guess a phone call to supermicro is on the todo list for the morning

smithse79 · Feb 6, 2017

Patrick said:
@Drewy - I am not sure how much I am allowed to talk about publicly on it but there are use cases where it is likely to be a big deal, and use cases where it is unlikely to be an issue from what I understand.

Ah, so there is some NDA info involved. I was curious where this was coming from. I hadn't seen anything from the normal sources about this, but there seemed to be a LOT of scrambling from Cisco and nobody else. I wasn't sure if it was conjecture on your part or if you had seen something we hadn't and weren't sharing. Now I think I know.

Evan · Feb 6, 2017

Do Cisco really use these CPU except in low end devices like asa5506-x ? All the stuff I see seems to use Xeon

smithse79 · Feb 6, 2017

Evan said:
Do Cisco really use these CPU except in low end devices like asa5506-x ? All the stuff I see seems to use Xeon

Thats not all

Clock Signal Component Issue

Evan · Feb 6, 2017

From my perspective give still pretty much more low end stuff although does not really matter and the more common the component then it would mean a massive qty of items to be replaced assuming all are affected ?
(I am sure Cisco sell a lot more ASA5506 etc than 5585 !)

What's the deal with the bug is there a fixed version of the hardware / CPU or is it seemingly present in all c2000 cpu's ever and until now ?

I guess the also found same bug in c3000 and that's one of the reasons for delay as well.

nj47 · Feb 6, 2017

From the linked doc:

System May Experience Inability to Boot or May Cease Operation

The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.

If the LPC clock(s) stop functioning the system will no longer be able to boot.

So while certainly not ideal, it doesn't sound like it's a security vulnerability with unknown repercussions - rather if it affects you, you will know because your server won't turn back on.

Jon Massey · Feb 7, 2017

El Reg (usual caveats apply) are reporting Synology issues : Intel's Atom C2000 chips are bricking products – and it's not just Cisco hit

smithse79 · Feb 7, 2017

nj47 said:
From the linked doc:

So while certainly not ideal, it doesn't sound like it's a security vulnerability with unknown repercussions - rather if it affects you, you will know because your server won't turn back on.

I'd say less of a security issue and more of a stability issue. It's not like there is a gaping back door, your system just randomly won't turn back on one day... It's MORE secure that way ;-)

Drewy · Feb 7, 2017

I think I'm screwed. Supermicro want RMA's via dealers and I purchased mine out of the US since I couldn't get them in the U.K.
have to check my legal recourse, but ultimately I guess I'll be running them until they die, if in fact they do.

smithse79 · Feb 7, 2017

Where did you see that SM is doing RMA for this?

Patrick · Feb 7, 2017

Just had it confirmed that Supermicro will do an RMA for the platform level fix if a customer is concerned.

Also confirmed that C2000 series products from Supermicro shipped from Jan 2017 onwards have the platform fix applied.

smithse79 · Feb 7, 2017

Patrick said:
Just had it confirmed that Supermicro will do an RMA for the platform level fix if a customer is concerned.

Also confirmed that C2000 series products from Supermicro shipped from Jan 2017 onwards have the platform fix applied.

Got a link to a white paper I can show my boss? Our physical domain controller is running on a C2758 from 2015

leonroy · Feb 7, 2017

Don't suppose anyone knows whether the issue is exacerbated by something in particular eg. heat, power cycling etc.?

EffrafaxOfWug · Feb 7, 2017

From an AC post at slashdot from someone that also sounds like they've avoiding an NDA:

Can't post to The Register, since they don't have ACs.

Anyway, the issue is damage to the LPC (low-pin-count) bus clock line. This is a secondary bus where you hang old ISA-style devices, like the system FLASH. If the FLASH is the only thing in there, it will mostly render the system unbootable (so, stuff that never gets power-cycled would just keep going). But LPC can generate interrupts, and one often hangs other crap to that bus, such as i2c controllers for hot-swap bays, motherboard management controllers, and other sensors. In that case, you can expect severe runtime misbehavior.

The issue is caused by *continuous degradation due to use*, so repairing it is easy, if costly: replace the motherboard with a new one under warranty (and even if out of warranty period wherever this kind of "stealth" manufacturing defect is not subject to warranty time period limitations, such as in Brazil). It will "reset" the counter. This is your zero-day solution to the issue.

Depending on time-to-market for the new stepping (hardware revision) B1/C0 of the Atom C2000, you might need an interim solution, which is the "platform-level change", i.e. redesigned board with extra components that work around Intel's hardware design error. As soon as you have these, you start using these to replace any boards returned due to the defect, or start a "recall" to preemptively replace boards.

Depending on the total cost of the board plus other components, you keep the old boards you replaced around, and when revision B1/C0 of the Atom C2000 is out, you BGA-replace them in a factory (about US$ 25 per board in large volumes, if that much), maybe replace any liquid electrolytic capacitors and other crap that ages badly, and use the boards either as new or as refurbished, depending on your corporate/regulatory ethics. This kind of repair almost always really resets the boards MTBF. If Intel supplies the replacement Atoms at no charge, the cost of repair might well be far less than the cost of the production run for boards you'd want to keep around for warranty services, anyway.

Mind you, at 1.5 years per failure, it will be rare the legislation/contract that forces more than one replacement... so, let's hope they don't replace a faulty board with a brand-new virgin but-still-timebombed board. You'd have trouble to replace it a second time if it fails after the warranty period.

Depending on how true that is, it sounds like a design flaw in the C2xxx SoC itself (probably a gate too thin to take the clock voltage over a prolonged period) and thus any device with it will be affected.

Patrick · Feb 7, 2017

Did a post on this. It is all being NDA'd. It is not 18 months per failure like that AC posted. Period.

The Intel Atom C2000 Series Bug - Why it is so quiet

I think I lit a fire under some marketing folks today

Do keep me posted if you hear of any other programs to replace.

Bug in Intel Atom C2000 series processors?

Active Member

Administrator

Active Member

Active Member

Administrator

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

New Member

Active Member

Active Member

Active Member

Active Member

Administrator

Active Member

Member

Radioactive Member

Administrator