Atom Bay Trail dying because of LPC bus design flaw?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Petr

New Member
Apr 22, 2017
9
5
3
41
Hello.

Intel recently published PCN 116196-00 on Atom E3800 series CPUs stating they are going to transition to D1 stepping to mitigate LPC bus degradation issues:
http://qdms.intel.com/dm/i.aspx/BF417A50-F5AE-4179-8393-08083292AC0D/PCN116196-00.pdf

Summary of changes of the D-1 stepping:
1. Intel identified possible circuit design issues in the LPC bus, USB2.0 LS/FS and SD Card logic which may result in degradation of the LPC bus, USB2.0 LS/FS and SD Card signals over time at a rate higher than Intel's quality goals. The D-1 stepping dispositions these possible issues.


If you look at the Specification Update for the mentioned series:
https://www.intel.com/content/dam/w...ion-updates/atom-e3800-family-spec-update.pdf

... you will find this:

VLI89 System May Experience Inability to Boot or May Cease Operation

Problem: Under certain conditions where activity is high for several years the LPC, USB (low speed and full speed) and SD Card circuitry may stop functioning in the outer years of use.

Implication: LPC circuitry that stops functioning may cause operation to cease or inability to boot. SD Card or USB circuitry that stops functioning may cause SD Cards to be unrecognized or Low Speed or Full Speed USB devices to not function. Intel has only observed this behavior in simulation. Designs that implement the LPC interface at the 1.8V signal voltage are not affected by the LPC part of this erratum.

Workaround:Firmware code changes for LPC circuitry and mitigations for SD Card & USB circuitry have been identified and may be implemented for this erratum.

Status: For the steppings affected, see the Summary Tables of Changes



Does it look familiar? Yes, it is the same issue that killed Atom C2000 - the famous AVR54 bug mentioned in this Specification Update:
https://www.intel.com/content/dam/w...ion-updates/atom-c2000-family-spec-update.pdf

The Atom C2000 is server products expected to run 24/7. Therefore if there is a design flaw it will likely be impacted first. The Atom E3800 is embedded design often used running industrial machines - for example the Intel NUC DE3815TYBE, which uses it, has I2C bus headers that are common for devices controlling sensors, motors etc. Those Atoms E3800 are therefore also likely to be used for significant amount of time.

Now the possibly big thing here is that if you look at the Specification Update for Celeron and Pentium N- and J- series:
https://www.intel.com/content/dam/w...n2820-n2815-n2806-j1850-j1750-spec-update.pdf

...you will find that most of the documented bugs are the same as in on the Atom E3800 series (although mentioned in different order and with different names). With all this it is hard to believe all the Bay Trail products are not based on single design and not affected with this LPC bus design flaw. The LPC flaw is not (yet) mentioned in the consumer products' specification update, maybe because those are not expected to be used that much? But if the issue is there and someone plans to use those consumer products for home server or for longer time, he/she might be impacted the same way Atoms C2000 and E3800 are impacted.

I hope Intel will clarify on this soon as this might be much bigger deal than with the Atom C2000 fiasco.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
@Petr - I sent Intel a request to comment/ clarify and will keep our readers updated.
 

Evan

Well-Known Member
Jan 6, 2016
3,346
598
113
I am going to assume since the flaw is maybe not as bad (ie happens less often because of use case) than the C2000 and consumer is less 24x7 combined with the lesser lifetime expectation of consumer (ie not 7 year enterprise lifecycle parts) that it will be a case of out of luck, sorry no replacement or fix.

Let’s see how intel play this....
On enterprise side I would assume more like C2000 approach.
 

BlueFox

Legendary Member Spam Hunter Extraordinaire
Oct 26, 2015
2,059
1,478
113
I am going to assume since the flaw is maybe not as bad (ie happens less often because of use case) than the C2000 and consumer is less 24x7 combined with the lesser lifetime expectation of consumer (ie not 7 year enterprise lifecycle parts) that it will be a case of out of luck, sorry no replacement or fix.

Let’s see how intel play this....
On enterprise side I would assume more like C2000 approach.
Atom E38xx is an enterprise (embedded), not a consumer part. I'd expect the kind of device that you'd find these in to be on 24/7 and generally expected to seldom be touched or serviced. Expected service life would be far longer than anything with an Atom C2xxx CPU.

Here's hoping it's an easy fix as I've been looking at buying a few devices with these CPUs.
 

Evan

Well-Known Member
Jan 6, 2016
3,346
598
113
The LPC flaw is not (yet) mentioned in the consumer products' specification update, maybe because those are not expected to be used that much? But if the issue is there and someone plans to use those consumer products for home server or for longer time, he/she might be impacted the same way Atoms C2000 and E3800 are impacted.
@BlueFox I was actually answering @Petr comments about other consumer CPU. Eg J1900 and so on...
As you say I expect the enterprise versions like E3800 series are indeed replaced etc like the C2000. Or atleast thats the hope.
 

BlueFox

Legendary Member Spam Hunter Extraordinaire
Oct 26, 2015
2,059
1,478
113
@BlueFox I was actually answering @Petr comments about other consumer CPU. Eg J1900 and so on...
As you say I expect the enterprise versions like E3800 series are indeed replaced etc like the C2000. Or atleast thats the hope.
Didn't realize this affected other stuff. Have an industrial MSI barebone with a J1900 at home as my router. That's going to be fun to replace. :(
 

Terry Kennedy

Well-Known Member
Jun 25, 2015
1,140
594
113
New York City
www.glaver.org
I hope Intel will clarify on this soon as this might be much bigger deal than with the Atom C2000 fiasco.
And remember, a few C2000 products needed to be redesigned because Intel removed functionality as part of the "fix". From the C2000 Specification Update: "All of the LPC interface signals (Table 10) are defined as muxed with GPIO signals. This implies that if the LPC interface is not used in your design these signals can be GPIO signals. This specification change removes the muxed support of GPIO signals for all LPC signals except LPC_CLKOUT1. These signals must be left in their default (LPC) state and not de-selected via software to be GPIO pins. Bits [26:21] and bits [29:28] in register SC_USE_SEL should not be set to 1."

I have big concerns:

1) This is another "signal integrity degradation" type of design error. This appears to be the same class of error as the Cougar Point SATA3 failure. The last VLSI that was hand-routed (AFAIK) was the DEC Alpha chips, so this has to have been an undetected fault in their auto-router rules package. If this is indeed the same class of error, it is disturbing that the same thing happened 5+ years after that.

2) Similarly, if this was detected in one family of chips, why weren't the other chips that share the same base architecture checked at the same time? Or were they, and Intel decided to keep quiet about it in the hope that nobody would notice?
 

Petr

New Member
Apr 22, 2017
9
5
3
41
I did some research on this and found it likely impacts also 14nm Goldmont CPUs. There is a Specification Update for Pentium N/J4000 and Celeron N/J3000 series - Apollo Lake:
https://www.intel.com/content/dam/w...n-n-series-j-series-datasheet-spec-update.pdf

The erratum APL46 is again "System May Experience Inability to Boot or May Cease Operation" with this description
"Under certain conditions where activity is high for several years the LPC, RTC, SD Card and GPIO Termination Circuitry may stop functioning in the outer years of use."

No fix is planned for affected B0 and B1 steppings.


What is interesting Atoms C3000 are not affected - https://www.intel.com/content/dam/w...ion-updates/atom-c3000-family-spec-update.pdf

Neither is affected 14nm Goldmont Plus architecture (Pentium N/J5000 and Celeron N/J4000 series) - https://www.intel.com/content/dam/w...product-briefs/silver-celeron-spec-update.pdf


It seems this bug was introduced in Silvemont, impacted all those CPUs (with Atoms C2000 being hit the hardest because of 24/7 use case), then was transfered to 14nm products and after being found Intel delayed introduction of Atoms C3000 to fix it. All CPUs introduced in H2/2017 have this fixed, older CPUs either received new stepping (Atoms C2000, Atoms E3800) or are left as they are with some more or less working software mitigations - likely some power-off function (consumer products). Now we know why it took Intel half a year to deliver Atom C3000 after it was announced last year.
 

Petr

New Member
Apr 22, 2017
9
5
3
41
Here is something about the impacts on congatec site:
https://www.congatec.com/fileadmin/...Documents/BYT_xA30_Errata_LPC_USB_SD_Card.pdf

LPC bus
There is a firmware change which enables something called "LPC_CLKRUN# feature" what reduces the LPC bus utilization therefore decreasing the rate of degradation (but not eliminating it).

Also, the Serial IRQ Mode has to be turned to Quiet instead of Continuous what limits usage (as described here https://opencores.org/websvn/filedetails?repname=wb_lpc&path=/wb_lpc/trunk/doc/wb_lpc.pdf).

Some systems are however not compatible with those workarounds or may not work correctly as described in the congatec document and also in this discussion related to Serial IRQ Quiet mode on ICH4 - High speed serial driver stops writing data.


USB
Limit active time to maximum of 10%. There is 50TB life expectancy per port.


SD card interface
Do not use SD card as boot device. Remove SD card when not in use.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
I was about to point out that the erratum info above was already all over Wikipedia, but looks like it was Petr who made the changes already so well done for pushing it to a wider audience :)

The LPC, USB and SD Card buses circuitry degradation issues also apply to other Bay Trail processors such as Intel Celeron J1900 and N2800/N2900 series.[21] and also to Pentium N3500, J2850, J2900 series and Celeron J1800 and J1750 series as those are based on the same affected silicon.

Cisco stated failures of Atom C2000 processors can occur as early as 18 months of use with higher failure rates occurring after 36 months.[22]

Mitigations[23] were found to limit impact on systems. Firmware update for the LPC bus called LPC_CLKRUN# reduces the utilization of the LPC interface what in turn decreases (but not eliminates) LPC bus degradation - some systems are however not compatible with this new firmware. USB should have a maximum of 10% active time and there is a 50TB transmit traffic life expectancy over the lifetime of the port. It is recommended not to use SD card as a boot device and to remove the card from the system when not in use.

Intel admitted on the issue stating the impact on consumers depends on use condition.[24]
Not sure how this works in countries in rest of the EU, but in the UK at least if the manufacturer has acknowledged a design flaw and refuses to fix it and it goes wrong within the device lifetime, they're on the hook for it. The important definition of device lifetime isn't up to the manufacturer either, it's decided by the courts and it's very common for defects well outside of warranty (up to six years via the small claims court) to be compensated. Saying this as someone who has these CPUs in use at home and at work I'm keeping a file on all of this info for if and when they break.

I assume the only real "fix" is to switch to systems running Goldmont (Apollo Lake + Denverton) chips when they come out? Wish the embedded Epyc 3000 would start to make more waves...
 
Last edited:

Evan

Well-Known Member
Jan 6, 2016
3,346
598
113
I don’t think the issue is intel replacing failed consumer chips , the difference is that it won’t likely be proactive like the c2000 enterprise recall.

Wow 50TB for USB for some people will be nothing, that’s just fill a 10tb external drive 5 times. And people with 10tb sata could just use USB drives on rotation for backup.
 

ullbeking

Active Member
Jul 28, 2017
506
70
28
45
London
Oh crap. I recently received an order of 3x Intel 6th gen NUCs using the Celeron J3455. I am planning to make a small, quiet, homelab cluster from. I was really excited impressed with, and excited by, the J3455.

Am I now on the hook for these or is it possible to RMA them? I ordered from Amazon and the order was fulfilled by a private seller (CCL). I am in the UK. It looks as though Amazon is offering me an option to return these to the seller. I should just take them up on this, return them, and get a refund, right? (While/if I've still got the opportunity.) And then re-evaluate and try to find different low-power, quiet, SFF machines that are suitable fort the home. Any opinions?
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
Depends on the definition of "recently", but distance selling regulations means you get a no-questions-asked refund window for 14 days from when you receive the goods - you can return them for a full refund in that time (and stipulate to the seller that you've just learned about the above flaw).

Outside of that 14 day window it's a bit more of a grey area, but you can probably use the above as evidence of the goods being faulty (i.e. having a manufacturer-acknowledged design flaw) and not wishing to continue using them.
 

ullbeking

Active Member
Jul 28, 2017
506
70
28
45
London
Depends on the definition of "recently", but distance selling regulations means you get a no-questions-asked refund window for 14 days from when you receive the goods - you can return them for a full refund in that time (and stipulate to the seller that you've just learned about the above flaw).

Outside of that 14 day window it's a bit more of a grey area, but you can probably use the above as evidence of the goods being faulty (i.e. having a manufacturer-acknowledged design flaw) and not wishing to continue using them.
Thanks for clarification. Yes I am aware of the 14-day window, but it’s been a week or two longer than that. Some retail stores are very strict about it, but as you mentioned, as the manufacturer has ack’ed the serious flaw, I would expect a little leeway.

Moreover, as I purchased through Amazon, they provide a channel for returning things and I think they might even extend the returns window. I submitted my request, which was accepted by Amazon, so now I need to print a few return labels and hand the unopened package to the post office, which will hopefully make things happen.
 

mattlongman

New Member
Jul 12, 2018
6
2
3
I feel a bit late to this one, having received an AsRock J3455M yesterday (and only here searching for it to try to work out why I can't boot to a RAID card).

Implication: LPC circuitry that stops functioning may cause operation to cease or inability to boot. SD Card or USB circuitry that stops functioning may cause SD Cards to be unrecognized or Low Speed or Full Speed USB devices to not function.
I'm not so worried about USB circuitry (aside from an initial boot from USB to install, I won't have any USB devices, and there is no SD card slot), but now I'm concerned about installing this board in a system that will be on 24/7 for 2 years or so.

Then again, this is to replace another old m-atx board, and a relatively inexpensive swap. Moving to a different (and hopefully unaffected?) low power embedded board would mean additional costs (new case + DDR4 SO-DIMM).

How worried should I be?
 

ullbeking

Active Member
Jul 28, 2017
506
70
28
45
London
@mattlongman I don't know how worried you should be in practice.

I think the AsRock J3455M looks like a fantastic board, but knowing that it has those faults means that I won't buy it. The Celeron J3455 has fantastic specs on paper. I actually purchased three Intel NUCs with this CPU to make myself a homelab cluster, but I returned them for a refund as soon as I found out about this issue (just in the nick of time according to their returns policy).

But to put things into another perspetive, I have several colleagues who have been using the Supermicro A1SRi-2758F ( Supermicro | Products | Motherboards | Atom Boards | A1SRi-2758F ) in demanding business environments and who have never experienced the dreaded AVR.54 defect ( Intel's Atom C2000 chips are bricking products – and it's not just Cisco hit ). I don't personally know anybody who has experienced the effects of this defect.

You have to weigh up the consequences of what happens if it does manifest. For example, I'm going to RMA all of my 2758F mainboards because they will be sent to a remote colo. If the bug does manifest then it's going to cost me downtime, a lot of money to repair (including possible a flight to another country), and stress due to the nuisance of the problem. Aside from this defect, I think the C2000 series boards are fantastic machines.
 
Last edited:

mattlongman

New Member
Jul 12, 2018
6
2
3
If the bug does manifest then it's going to cost me downtime, a lot of money to repair (including possible a flight to another country), and stress due to the nuisance of the problem. Aside from this defect, I think the C2000 series boards are fantastic machines.
Thanks @ullbeking - good to know the reasoning behind the returns.

I think with the price of this board vs a more expensive board and other components, combined with cost of returning it (to Germany from UK), I'll take the risk. It's for a home NAS, storage as pooled NTFS drives, and anything critical replicated elsewhere.
 
  • Like
Reactions: Tha_14

mattlongman

New Member
Jul 12, 2018
6
2
3
I'll take the risk. It's for a home NAS, storage as pooled NTFS drives, and anything critical replicated elsewhere.
Just a quick update on this as it's kind of related: I had another look, and the J4105M seems to be an identical board (besides the faster processor and DDR4 memory). Ordered it (same price as the J3455M) but the big issue I found is that there's no option anywhere for CSM. I believe it should be there according to the spec, but neither the v1.00 or v1.30 firmware shows it. This means I can't boot from a PCI-E RAID card, and after some hours of fiddling with Windows Server software raid, I decided it wasn't for me.

Unless there is actually a way of enabling the compatibility mode, I'll have to return the board and memory and stick with the J3455M. Shame really as the J4105 seems to be a nice bump over the J3455 (not to mention the LPC flaw).

As a side note, the J4105 processor (and J5005) supposedly only support 8GB memory (according to Intel), but both seem to support up to 32GB in practice.
 

ullbeking

Active Member
Jul 28, 2017
506
70
28
45
London
Just a quick update on this as it's kind of related: I had another look, and the J4105M seems to be an identical board (besides the faster processor and DDR4 memory). Ordered it (same price as the J3455M) but the big issue I found is that there's no option anywhere for CSM. I believe it should be there according to the spec, but neither the v1.00 or v1.30 firmware shows it. This means I can't boot from a PCI-E RAID card, and after some hours of fiddling with Windows Server software raid, I decided it wasn't for me.

Unless there is actually a way of enabling the compatibility mode, I'll have to return the board and memory and stick with the J3455M. Shame really as the J4105 seems to be a nice bump over the J3455 (not to mention the LPC flaw).

As a side note, the J4105 processor (and J5005) supposedly only support 8GB memory (according to Intel), but both seem to support up to 32GB in practice.
Thanks for the update. I’m considering the J4105 too so your notes are super helpful. Pardon my ignorance, but what exactly is CSM and how does this CPU/board differ from the J3455 in this regard?