Bug in Intel Atom C2000 series processors?

smithse79

Active Member
Sep 17, 2014
196
33
28
40
I've not seen it mentioned over here just yet, but there is quite a bit of talk on the various IT /r/'s about something that came out in Intel's 4Q16 earnings statement about a problem in certain processors. It appears looking in this whitepaper: http://www.intel.com/content/dam/ww...ion-updates/atom-c2000-family-spec-update.pdf (AVR54 on page 34) that it's a bug in the Atom C2000 series chips. Most of the discussion so far has been surrounding Cisco equipment that seems to be using these chips. Apparently a lot of their mid to higher-end gear uses them. I'm more concerned because I know a lot of folks around here use them for low-power servers. Has anyone seen much about this outside the Cisco talk?
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,883
4,845
113
From what I heard this was mostly due to a specific implementation of the chips so I did not think it would impact most of STH readers.

But as a FYI - Rangeley CPUs are used in all kinds of networking gear. I would not be surprised if switch management was not the #1 market for these things. Not just Cisco, I believe QCT, Arista and others use(d) them.
 

smithse79

Active Member
Sep 17, 2014
196
33
28
40
The issue I take with it, is that this is a Super Scary Bug^tm and Intel has not come clean with it at all. I've been impressed with Cicso's response once they found out about it. But Intel's has left something to be desired.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,883
4,845
113
@Drewy - I am not sure how much I am allowed to talk about publicly on it but there are use cases where it is likely to be a big deal, and use cases where it is unlikely to be an issue from what I understand.
 

Drewy

Member
Apr 23, 2016
168
23
18
50
I guess a phone call to supermicro is on the todo list for the morning :(
 

smithse79

Active Member
Sep 17, 2014
196
33
28
40
@Drewy - I am not sure how much I am allowed to talk about publicly on it but there are use cases where it is likely to be a big deal, and use cases where it is unlikely to be an issue from what I understand.
Ah, so there is some NDA info involved. I was curious where this was coming from. I hadn't seen anything from the normal sources about this, but there seemed to be a LOT of scrambling from Cisco and nobody else. I wasn't sure if it was conjecture on your part or if you had seen something we hadn't and weren't sharing. Now I think I know.
 

Evan

Well-Known Member
Jan 6, 2016
3,026
499
83
Do Cisco really use these CPU except in low end devices like asa5506-x ? All the stuff I see seems to use Xeon
 

Evan

Well-Known Member
Jan 6, 2016
3,026
499
83
From my perspective give still pretty much more low end stuff although does not really matter and the more common the component then it would mean a massive qty of items to be replaced assuming all are affected ?
(I am sure Cisco sell a lot more ASA5506 etc than 5585 !)

What's the deal with the bug is there a fixed version of the hardware / CPU or is it seemingly present in all c2000 cpu's ever and until now ?

I guess the also found same bug in c3000 and that's one of the reasons for delay as well.
 

nj47

New Member
Jan 2, 2016
13
4
3
28
From the linked doc:

System May Experience Inability to Boot or May Cease Operation

The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.

If the LPC clock(s) stop functioning the system will no longer be able to boot.
So while certainly not ideal, it doesn't sound like it's a security vulnerability with unknown repercussions - rather if it affects you, you will know because your server won't turn back on.
 

smithse79

Active Member
Sep 17, 2014
196
33
28
40
From the linked doc:



So while certainly not ideal, it doesn't sound like it's a security vulnerability with unknown repercussions - rather if it affects you, you will know because your server won't turn back on.
I'd say less of a security issue and more of a stability issue. It's not like there is a gaping back door, your system just randomly won't turn back on one day... It's MORE secure that way ;-)
 

Drewy

Member
Apr 23, 2016
168
23
18
50
I think I'm screwed. Supermicro want RMA's via dealers and I purchased mine out of the US since I couldn't get them in the U.K.
have to check my legal recourse, but ultimately I guess I'll be running them until they die, if in fact they do.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,883
4,845
113
Just had it confirmed that Supermicro will do an RMA for the platform level fix if a customer is concerned.

Also confirmed that C2000 series products from Supermicro shipped from Jan 2017 onwards have the platform fix applied.
 

smithse79

Active Member
Sep 17, 2014
196
33
28
40
Just had it confirmed that Supermicro will do an RMA for the platform level fix if a customer is concerned.

Also confirmed that C2000 series products from Supermicro shipped from Jan 2017 onwards have the platform fix applied.
Got a link to a white paper I can show my boss? Our physical domain controller is running on a C2758 from 2015
 

leonroy

Member
Oct 6, 2015
62
7
8
39
Don't suppose anyone knows whether the issue is exacerbated by something in particular eg. heat, power cycling etc.?
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,197
405
83
From an AC post at slashdot from someone that also sounds like they've avoiding an NDA:
Can't post to The Register, since they don't have ACs.

Anyway, the issue is damage to the LPC (low-pin-count) bus clock line. This is a secondary bus where you hang old ISA-style devices, like the system FLASH. If the FLASH is the only thing in there, it will mostly render the system unbootable (so, stuff that never gets power-cycled would just keep going). But LPC can generate interrupts, and one often hangs other crap to that bus, such as i2c controllers for hot-swap bays, motherboard management controllers, and other sensors. In that case, you can expect severe runtime misbehavior.

The issue is caused by *continuous degradation due to use*, so repairing it is easy, if costly: replace the motherboard with a new one under warranty (and even if out of warranty period wherever this kind of "stealth" manufacturing defect is not subject to warranty time period limitations, such as in Brazil). It will "reset" the counter. This is your zero-day solution to the issue.

Depending on time-to-market for the new stepping (hardware revision) B1/C0 of the Atom C2000, you might need an interim solution, which is the "platform-level change", i.e. redesigned board with extra components that work around Intel's hardware design error. As soon as you have these, you start using these to replace any boards returned due to the defect, or start a "recall" to preemptively replace boards.

Depending on the total cost of the board plus other components, you keep the old boards you replaced around, and when revision B1/C0 of the Atom C2000 is out, you BGA-replace them in a factory (about US$ 25 per board in large volumes, if that much), maybe replace any liquid electrolytic capacitors and other crap that ages badly, and use the boards either as new or as refurbished, depending on your corporate/regulatory ethics. This kind of repair almost always really resets the boards MTBF. If Intel supplies the replacement Atoms at no charge, the cost of repair might well be far less than the cost of the production run for boards you'd want to keep around for warranty services, anyway.

Mind you, at 1.5 years per failure, it will be rare the legislation/contract that forces more than one replacement... so, let's hope they don't replace a faulty board with a brand-new virgin but-still-timebombed board. You'd have trouble to replace it a second time if it fails after the warranty period.
Depending on how true that is, it sounds like a design flaw in the C2xxx SoC itself (probably a gate too thin to take the clock voltage over a prolonged period) and thus any device with it will be affected.