Atom Bay Trail dying because of LPC bus design flaw?

Discussion in 'Processors and Motherboards' started by Petr, Apr 11, 2018.

  1. Petr

    Petr New Member

    Joined:
    Apr 22, 2017
    Messages:
    5
    Likes Received:
    5
    Hello.

    Intel recently published PCN 116196-00 on Atom E3800 series CPUs stating they are going to transition to D1 stepping to mitigate LPC bus degradation issues:
    http://qdms.intel.com/dm/i.aspx/BF417A50-F5AE-4179-8393-08083292AC0D/PCN116196-00.pdf

    Summary of changes of the D-1 stepping:
    1. Intel identified possible circuit design issues in the LPC bus, USB2.0 LS/FS and SD Card logic which may result in degradation of the LPC bus, USB2.0 LS/FS and SD Card signals over time at a rate higher than Intel's quality goals. The D-1 stepping dispositions these possible issues.


    If you look at the Specification Update for the mentioned series:
    https://www.intel.com/content/dam/w...ion-updates/atom-e3800-family-spec-update.pdf

    ... you will find this:

    VLI89 System May Experience Inability to Boot or May Cease Operation

    Problem: Under certain conditions where activity is high for several years the LPC, USB (low speed and full speed) and SD Card circuitry may stop functioning in the outer years of use.

    Implication: LPC circuitry that stops functioning may cause operation to cease or inability to boot. SD Card or USB circuitry that stops functioning may cause SD Cards to be unrecognized or Low Speed or Full Speed USB devices to not function. Intel has only observed this behavior in simulation. Designs that implement the LPC interface at the 1.8V signal voltage are not affected by the LPC part of this erratum.

    Workaround:Firmware code changes for LPC circuitry and mitigations for SD Card & USB circuitry have been identified and may be implemented for this erratum.

    Status: For the steppings affected, see the Summary Tables of Changes



    Does it look familiar? Yes, it is the same issue that killed Atom C2000 - the famous AVR54 bug mentioned in this Specification Update:
    https://www.intel.com/content/dam/w...ion-updates/atom-c2000-family-spec-update.pdf

    The Atom C2000 is server products expected to run 24/7. Therefore if there is a design flaw it will likely be impacted first. The Atom E3800 is embedded design often used running industrial machines - for example the Intel NUC DE3815TYBE, which uses it, has I2C bus headers that are common for devices controlling sensors, motors etc. Those Atoms E3800 are therefore also likely to be used for significant amount of time.

    Now the possibly big thing here is that if you look at the Specification Update for Celeron and Pentium N- and J- series:
    https://www.intel.com/content/dam/w...n2820-n2815-n2806-j1850-j1750-spec-update.pdf

    ...you will find that most of the documented bugs are the same as in on the Atom E3800 series (although mentioned in different order and with different names). With all this it is hard to believe all the Bay Trail products are not based on single design and not affected with this LPC bus design flaw. The LPC flaw is not (yet) mentioned in the consumer products' specification update, maybe because those are not expected to be used that much? But if the issue is there and someone plans to use those consumer products for home server or for longer time, he/she might be impacted the same way Atoms C2000 and E3800 are impacted.

    I hope Intel will clarify on this soon as this might be much bigger deal than with the Atom C2000 fiasco.
     
    #1
    MellowTone, Patriot and ecosse like this.
  2. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,045
    Likes Received:
    3,996
    @Petr - I sent Intel a request to comment/ clarify and will keep our readers updated.
     
    #2
  3. Evan

    Evan Well-Known Member

    Joined:
    Jan 6, 2016
    Messages:
    2,184
    Likes Received:
    301
    I am going to assume since the flaw is maybe not as bad (ie happens less often because of use case) than the C2000 and consumer is less 24x7 combined with the lesser lifetime expectation of consumer (ie not 7 year enterprise lifecycle parts) that it will be a case of out of luck, sorry no replacement or fix.

    Let’s see how intel play this....
    On enterprise side I would assume more like C2000 approach.
     
    #3
  4. BlueFox

    BlueFox Active Member

    Joined:
    Oct 26, 2015
    Messages:
    445
    Likes Received:
    158
    Atom E38xx is an enterprise (embedded), not a consumer part. I'd expect the kind of device that you'd find these in to be on 24/7 and generally expected to seldom be touched or serviced. Expected service life would be far longer than anything with an Atom C2xxx CPU.

    Here's hoping it's an easy fix as I've been looking at buying a few devices with these CPUs.
     
    #4
  5. Evan

    Evan Well-Known Member

    Joined:
    Jan 6, 2016
    Messages:
    2,184
    Likes Received:
    301
    @BlueFox I was actually answering @Petr comments about other consumer CPU. Eg J1900 and so on...
    As you say I expect the enterprise versions like E3800 series are indeed replaced etc like the C2000. Or atleast thats the hope.
     
    #5
  6. Petr

    Petr New Member

    Joined:
    Apr 22, 2017
    Messages:
    5
    Likes Received:
    5
    Thank you Patrick. We will see what they say.
     
    #6
    Patrick likes this.
  7. BlueFox

    BlueFox Active Member

    Joined:
    Oct 26, 2015
    Messages:
    445
    Likes Received:
    158
    Didn't realize this affected other stuff. Have an industrial MSI barebone with a J1900 at home as my router. That's going to be fun to replace. :(
     
    #7
  8. Terry Kennedy

    Terry Kennedy Well-Known Member

    Joined:
    Jun 25, 2015
    Messages:
    953
    Likes Received:
    417
    And remember, a few C2000 products needed to be redesigned because Intel removed functionality as part of the "fix". From the C2000 Specification Update: "All of the LPC interface signals (Table 10) are defined as muxed with GPIO signals. This implies that if the LPC interface is not used in your design these signals can be GPIO signals. This specification change removes the muxed support of GPIO signals for all LPC signals except LPC_CLKOUT1. These signals must be left in their default (LPC) state and not de-selected via software to be GPIO pins. Bits [26:21] and bits [29:28] in register SC_USE_SEL should not be set to 1."

    I have big concerns:

    1) This is another "signal integrity degradation" type of design error. This appears to be the same class of error as the Cougar Point SATA3 failure. The last VLSI that was hand-routed (AFAIK) was the DEC Alpha chips, so this has to have been an undetected fault in their auto-router rules package. If this is indeed the same class of error, it is disturbing that the same thing happened 5+ years after that.

    2) Similarly, if this was detected in one family of chips, why weren't the other chips that share the same base architecture checked at the same time? Or were they, and Intel decided to keep quiet about it in the hope that nobody would notice?
     
    #8
  9. Petr

    Petr New Member

    Joined:
    Apr 22, 2017
    Messages:
    5
    Likes Received:
    5
    I did some research on this and found it likely impacts also 14nm Goldmont CPUs. There is a Specification Update for Pentium N/J4000 and Celeron N/J3000 series - Apollo Lake:
    https://www.intel.com/content/dam/w...n-n-series-j-series-datasheet-spec-update.pdf

    The erratum APL46 is again "System May Experience Inability to Boot or May Cease Operation" with this description
    "Under certain conditions where activity is high for several years the LPC, RTC, SD Card and GPIO Termination Circuitry may stop functioning in the outer years of use."

    No fix is planned for affected B0 and B1 steppings.


    What is interesting Atoms C3000 are not affected - https://www.intel.com/content/dam/w...ion-updates/atom-c3000-family-spec-update.pdf

    Neither is affected 14nm Goldmont Plus architecture (Pentium N/J5000 and Celeron N/J4000 series) - https://www.intel.com/content/dam/w...product-briefs/silver-celeron-spec-update.pdf


    It seems this bug was introduced in Silvemont, impacted all those CPUs (with Atoms C2000 being hit the hardest because of 24/7 use case), then was transfered to 14nm products and after being found Intel delayed introduction of Atoms C3000 to fix it. All CPUs introduced in H2/2017 have this fixed, older CPUs either received new stepping (Atoms C2000, Atoms E3800) or are left as they are with some more or less working software mitigations - likely some power-off function (consumer products). Now we know why it took Intel half a year to deliver Atom C3000 after it was announced last year.
     
    #9
  10. Petr

    Petr New Member

    Joined:
    Apr 22, 2017
    Messages:
    5
    Likes Received:
    5
    Here is something about the impacts on congatec site:
    https://www.congatec.com/fileadmin/...Documents/BYT_xA30_Errata_LPC_USB_SD_Card.pdf

    LPC bus
    There is a firmware change which enables something called "LPC_CLKRUN# feature" what reduces the LPC bus utilization therefore decreasing the rate of degradation (but not eliminating it).

    Also, the Serial IRQ Mode has to be turned to Quiet instead of Continuous what limits usage (as described here https://opencores.org/websvn/filedetails?repname=wb_lpc&path=/wb_lpc/trunk/doc/wb_lpc.pdf).

    Some systems are however not compatible with those workarounds or may not work correctly as described in the congatec document and also in this discussion related to Serial IRQ Quiet mode on ICH4 - High speed serial driver stops writing data.


    USB
    Limit active time to maximum of 10%. There is 50TB life expectancy per port.


    SD card interface
    Do not use SD card as boot device. Remove SD card when not in use.
     
    #10
  11. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    669
    Likes Received:
    233
    I was about to point out that the erratum info above was already all over Wikipedia, but looks like it was Petr who made the changes already so well done for pushing it to a wider audience :)

    Not sure how this works in countries in rest of the EU, but in the UK at least if the manufacturer has acknowledged a design flaw and refuses to fix it and it goes wrong within the device lifetime, they're on the hook for it. The important definition of device lifetime isn't up to the manufacturer either, it's decided by the courts and it's very common for defects well outside of warranty (up to six years via the small claims court) to be compensated. Saying this as someone who has these CPUs in use at home and at work I'm keeping a file on all of this info for if and when they break.

    I assume the only real "fix" is to switch to systems running Goldmont (Apollo Lake + Denverton) chips when they come out? Wish the embedded Epyc 3000 would start to make more waves...
     
    #11
    Last edited: Apr 14, 2018
  12. Evan

    Evan Well-Known Member

    Joined:
    Jan 6, 2016
    Messages:
    2,184
    Likes Received:
    301
    I don’t think the issue is intel replacing failed consumer chips , the difference is that it won’t likely be proactive like the c2000 enterprise recall.

    Wow 50TB for USB for some people will be nothing, that’s just fill a 10tb external drive 5 times. And people with 10tb sata could just use USB drives on rotation for backup.
     
    #12
  13. ullbeking

    ullbeking Member

    Joined:
    Jul 28, 2017
    Messages:
    183
    Likes Received:
    10
    Oh crap. I recently received an order of 3x Intel 6th gen NUCs using the Celeron J3455. I am planning to make a small, quiet, homelab cluster from. I was really excited impressed with, and excited by, the J3455.

    Am I now on the hook for these or is it possible to RMA them? I ordered from Amazon and the order was fulfilled by a private seller (CCL). I am in the UK. It looks as though Amazon is offering me an option to return these to the seller. I should just take them up on this, return them, and get a refund, right? (While/if I've still got the opportunity.) And then re-evaluate and try to find different low-power, quiet, SFF machines that are suitable fort the home. Any opinions?
     
    #13
  14. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    669
    Likes Received:
    233
    Depends on the definition of "recently", but distance selling regulations means you get a no-questions-asked refund window for 14 days from when you receive the goods - you can return them for a full refund in that time (and stipulate to the seller that you've just learned about the above flaw).

    Outside of that 14 day window it's a bit more of a grey area, but you can probably use the above as evidence of the goods being faulty (i.e. having a manufacturer-acknowledged design flaw) and not wishing to continue using them.
     
    #14
  15. ullbeking

    ullbeking Member

    Joined:
    Jul 28, 2017
    Messages:
    183
    Likes Received:
    10
    Thanks for clarification. Yes I am aware of the 14-day window, but it’s been a week or two longer than that. Some retail stores are very strict about it, but as you mentioned, as the manufacturer has ack’ed the serious flaw, I would expect a little leeway.

    Moreover, as I purchased through Amazon, they provide a channel for returning things and I think they might even extend the returns window. I submitted my request, which was accepted by Amazon, so now I need to print a few return labels and hand the unopened package to the post office, which will hopefully make things happen.
     
    #15
  16. mattlongman

    mattlongman New Member

    Joined:
    Jul 12, 2018
    Messages:
    6
    Likes Received:
    2
    I feel a bit late to this one, having received an AsRock J3455M yesterday (and only here searching for it to try to work out why I can't boot to a RAID card).

    I'm not so worried about USB circuitry (aside from an initial boot from USB to install, I won't have any USB devices, and there is no SD card slot), but now I'm concerned about installing this board in a system that will be on 24/7 for 2 years or so.

    Then again, this is to replace another old m-atx board, and a relatively inexpensive swap. Moving to a different (and hopefully unaffected?) low power embedded board would mean additional costs (new case + DDR4 SO-DIMM).

    How worried should I be?
     
    #16
  17. ullbeking

    ullbeking Member

    Joined:
    Jul 28, 2017
    Messages:
    183
    Likes Received:
    10
    @mattlongman I don't know how worried you should be in practice.

    I think the AsRock J3455M looks like a fantastic board, but knowing that it has those faults means that I won't buy it. The Celeron J3455 has fantastic specs on paper. I actually purchased three Intel NUCs with this CPU to make myself a homelab cluster, but I returned them for a refund as soon as I found out about this issue (just in the nick of time according to their returns policy).

    But to put things into another perspetive, I have several colleagues who have been using the Supermicro A1SRi-2758F ( Supermicro | Products | Motherboards | Atom Boards | A1SRi-2758F ) in demanding business environments and who have never experienced the dreaded AVR.54 defect ( Intel's Atom C2000 chips are bricking products – and it's not just Cisco hit ). I don't personally know anybody who has experienced the effects of this defect.

    You have to weigh up the consequences of what happens if it does manifest. For example, I'm going to RMA all of my 2758F mainboards because they will be sent to a remote colo. If the bug does manifest then it's going to cost me downtime, a lot of money to repair (including possible a flight to another country), and stress due to the nuisance of the problem. Aside from this defect, I think the C2000 series boards are fantastic machines.
     
    #17
    Last edited: Jul 24, 2018
    Tha_14 and mattlongman like this.
  18. mattlongman

    mattlongman New Member

    Joined:
    Jul 12, 2018
    Messages:
    6
    Likes Received:
    2
    Thanks @ullbeking - good to know the reasoning behind the returns.

    I think with the price of this board vs a more expensive board and other components, combined with cost of returning it (to Germany from UK), I'll take the risk. It's for a home NAS, storage as pooled NTFS drives, and anything critical replicated elsewhere.
     
    #18
    Tha_14 likes this.
  19. mattlongman

    mattlongman New Member

    Joined:
    Jul 12, 2018
    Messages:
    6
    Likes Received:
    2
    Just a quick update on this as it's kind of related: I had another look, and the J4105M seems to be an identical board (besides the faster processor and DDR4 memory). Ordered it (same price as the J3455M) but the big issue I found is that there's no option anywhere for CSM. I believe it should be there according to the spec, but neither the v1.00 or v1.30 firmware shows it. This means I can't boot from a PCI-E RAID card, and after some hours of fiddling with Windows Server software raid, I decided it wasn't for me.

    Unless there is actually a way of enabling the compatibility mode, I'll have to return the board and memory and stick with the J3455M. Shame really as the J4105 seems to be a nice bump over the J3455 (not to mention the LPC flaw).

    As a side note, the J4105 processor (and J5005) supposedly only support 8GB memory (according to Intel), but both seem to support up to 32GB in practice.
     
    #19
  20. ullbeking

    ullbeking Member

    Joined:
    Jul 28, 2017
    Messages:
    183
    Likes Received:
    10
    Thanks for the update. I’m considering the J4105 too so your notes are super helpful. Pardon my ignorance, but what exactly is CSM and how does this CPU/board differ from the J3455 in this regard?
     
    #20
Similar Threads: Atom Trail
Forum Title Date
Processors and Motherboards Supermicro A1SAi/Atom C2000 Series Motherboards - rev 1.0c vs 1.1? - AVR54 Bug? Jul 22, 2018
Processors and Motherboards Choice of atom cpu for a selfbuilt pfSense firewall Jul 14, 2018
Processors and Motherboards Supermicro A2SDI Atom c3000 active coolers for 1U chassis? May 26, 2018
Processors and Motherboards a quick question about Atom C3000 May 15, 2018
Processors and Motherboards Atom C3000 / Proxmox VE / SOHO - 4 or 8 cores? Mar 15, 2018

Share This Page