Beta Ryzen NVidia tester

DWSimmons

Member
Apr 9, 2017
44
10
8
51
tl:dr: ECC on Ryzen AM4 Asus B350 verifies. Drivers and kernel are catching up to ECC, 1080ti, and Ryzen SMT, but manually upgrading worked well for me. Having a spare computer relieves all the pressure of a new build.

Use case: Machine learning, cryptocurrency mining, programming, VM and/or docker, gaming after hours.

Goals: Make some money back from crypto, learn more about Linux, have no hardware hurdles to VM/docker or ML (time being the exception), run all three monitors without headaches

About me: I used to be a network and Windows system administrator so my curiosity was in logging, benchmark, and automation. My google-fu is strong, python is meh, Linux is beginner/intermediate. My hardware skills are strong but rusty.

Background: I've been learning Python, part machine learning (ML) and part GUI (PyQt and Tkinter). I had been using an i7-920 but a bios update bricked it. I fell back to a junker Eaglelake Core 2 Duo(released Q2 of 2008) that was actually doing fine. It was a Dell though and would not accept a larger video card for machine learning. Additionally, I could not run host and VM simultaneously without bringing the thing down. Enter Ryzen and my wife gave me the thumbs up for a new rig so I started looking at solutions that would be dev, machine learning capable, cryptocurrency miner, light gaming, and ideally "future proof". (How can anyone know what the future will bring ergo how can anyone prepare for the unknown…???)

I asked STH for suggestions and they made complete sense for the specs that I gave ( Feedback needed on ML-CryptoMining-Dev box ) Hat tip to @TLN and @MiniKnight . Their recommendation of 2683v3 and X99 made complete sense. So I started down that path and I realized that multiple GPU was actually a "problem". There were very few radial/blower fan 1080ti and those that I liked for price or maker were loud and/or not the best value. If I was going to go with axial/downward blowing fans, then I couldn't put multiple side-by-side. On the value claim, there are many radial fan options of 1060 and 1070 options that I could run side-by-side with low power and good CUDA core count. I started considering 4x1070 in X99. The return on investment was excellent for the prices of crytocurrency at the time. They are still good but the issue wasn’t ROI per se. ML would thrive on it but for what I’m doing I didn’t really need 4x1070, a single or double 1080ti would suffice.

The real issue came with location. I have an office attached to my bedroom. If the computer was there, then running time would be limited. I could put it downstairs in the living room and run it headless but the location was such that the sound would travel right up the stairs. I could have spent a variety of time, energy, and money to dampen the sound but the location might be too small at that point. Also I'd need to run a line through the attic and drop it down through two-stories of framing. The third option was by the TV but the starting and stopping due noise and dinner/watching TV was a consideration. I punted on all of it. One Asus Strix 1080ti would take care of everything and since I wanted some return on investment without the noise, that worked for my all the other considerations.

Okay... so now I don't need 40 lanes of pci-e from X99. What do I need? Well I still need at least 6 cores and since I wasn't compiling/making that much. Also, there'd be plenty of time where the rig is running but I'm not on it. At some point, I stumbled across an article from Hardware Canucks that tested Ryzen with ECC. ( ECC Memory & AMD's Ryzen - A Deep Dive ) I also saw 64GB UDIMM for sale from @Spartus . I saw articles here at STH about Proxmox ( How-to Guide - Create a Proxmox VE 5.0 All-in-One with Docker ) and I thought most of the things I wanted to achieve in ML and Python could be done in Docker. This rig might move from a dev box to a dedicated NAS/ML box later but that would be bonus points.


Combine all of the above and I decided to go with Ryzen 1700 with 64GB UDIMM. I had initially decided on the ASUS Prime X370 or the ASROCK X370 for their 10 Sata ports and crowdsourced knowledge on overclocking RAM. RAM frequency was initially a consideration for going 2x8@3200 or 2x16@3000 but after deciding on the 4x16 UDIMM, overclocking became a non-starter. Lack of supply of the motherboards plus the $150 difference pushed me over the edge so I got an Asus B350. I decided that 6 Sata ports + M2 was sufficient. I knew this build was pushing the limits of my Linux knowledge and the maturity of Ryzen platform but I dove in anyways...because hacking is fun. I also expected that it would solve all or almost all of my issues with initial purchase after all the kinks were worked out. I had a working rig so I didn't have a time pressure which was a great relief.



Actual build:
CPU: Ryzen 1700 (comes with Wraith cooler)
Motherboard: Asus B350
RAM: 64GB, Kingston 4x16 UDIMM, dual rank, CAS15, 2133MHZ
GPU: Asus Strix 1080ti overclock
HDD: JBOD, and but JBOD I mean a just a bunch of junk HDD. I have one Samsung 850 EVO 256GB - not installed- and I'll be adding an M2 but I wanted to build and test everything.
Corsair 750 PSU, and a nice case I won in a contest from PugetSystems . (Highly recommended everything they do if it fits for you, they are niche, expensive, and awesome, no affiliation with them.)


Some issues I knew about:
Outdated (<4.4) Linux Kernels had issues with Ryzen SMT
IOMMU group assignments if I did dual VMs, though not in my initial plans
ECC support on the AM4/B-350 was an experiment
My JBOD had all sorts of data from old NTFS Windows installs that hadn't been cleaned up.
My research showed that the Linux benchmarking, logging, and monitoring landscape was very much like the rest of the Linux landscape, powerful, free, and full of variability.

Everything built fine which I was very pleased with since I hadn't made a machine in 10+ years.

ASUS UEFI is supposedly the bees knees. Umm... yea.... I don’t agree with that sentiment.

On the main splash screen it has a drag and drop boot order. "Hey that's handy for booting from USB drives". It doesn't work. First, if you change the boot order, save and exit, you are prompted with "you have not made any changes to the settings". This is true for a handful of sub-menus as well. It appears that by settings they mean rates, specs, etc and not order or sequences. Any attempt to put the USB above the first SATA drive results in the USB being skipped and going straight to the boot drive. Upon returning to the UEFI, it shows the change in boot order. /shrug Whatever, I'm looking for progress and will workaround it. (This will be a theme throughout this process.)


By going to the Boot Menu [F8], I'm prompted with a list of all attached drives. It does not have a refresh button. Adding USB drives and returning to the main menu does not refresh the menu, but adding a USB drive and returning to either menu a second time does. (sigh...) On the Boot Menu, clicking on the drive immediately launches a boot initiation without saving and exit. No confirmation, no warning. Umm...OK… Bios clock does not seem to “hold” the time. I’ve set it in the AM and PM with 24hour and no matte what I seem to do, returning to the bios does not have the correct time. This bothers me the most out of all the weirdness this “media acclaimed” BIOS.


Under the initial rev. 060? BIOS there’s a menu option for KVM. I see that it’s off. I decided that I’m update the BIOS now so I can know whether to move on or not. I update to rev. 0803 via internet. Nothing bricks.


Attempt Proxmox install via USB drive. Proxmox install fails on GRUB. Not Grub prompt or Grub rescue but the text "GRUB" with blinking cursor and no rescue mode, no input response. No problem, that's not really wanted to do anyways I realize (or say to myself). I want Ubuntu on metal install first and then benchmark everything then Proxmox container benchmark, then Ubuntu VM on Proxmox and benchmark.


Oh hey, look at that, I didn't unmount my USB before doing the "dd proxmox.iso -> usb" command. Proxmox USB install instructions don't say you need to but I'll try that because I think you do. Can someone comment on that please. Proxmox install hangs on GRUB still.

16.04.02 LTS desktop version installs fine, hangs on reboot. The lights are blinking across like progress but there's a cursor blinking through the splash screen. I make dinner. Still hung. Maybe a kernel panic, maybe an alien...doesn't matter. I'm looking for progress and move on. I tell myself it doesn't really matter because I need newer kernel anyways.


Pop in 17.04 and go to install, Ubuntu install menu, I choose install Ubuntu, get ACPI error on boot about being unable to access region.

Ubuntu loads...16.04. Okay...

First things first, NVidia drivers to get the multi-monitor thing going. I have yet to figure out the magic to .run files. They hang all the time. I’m under the impression I’m supposed chmod the thing to turn into an .exe style. I’m not a big fan creating an executable under root. Oh, here's a PPA for NVidia, thank goodness.

sudo add-apt-repository ppa:graphics-drivers/ppa

I’m guessing because they are proprietary that they’ll likely always be untrusted and under development. Fine by me.

Okay, now we have triple monitors going, my neck and back are thankful.

It turns out that the NVidia, CUDA 381.* drivers have Vulkan support for the first time. Vulkan is the successor to Mantle and has a lot of support from Google, NVidia, and AMD. I find this interesting though I don't think there is currently anything I can do with the information. I have read that the nouveau (XORG) display drivers on 17.04 are not up-to-date for NVidia 1080ti, so I’m glad I took the LTS to 16.04 to 16.10 route. In retrospect that’s an old lesson but I was glad I didn’t have to relearn it.


ECC: Curious, I test ECC first. google-fu says I should install edac-util . ( edac-util(1): EDAC error reporting utility - Linux man page ) to check on my ECC. I have many other things I want to do and check but I'll start here.

sudo apt-get install edac-utils

edac-util -s

No EDAC data, can't find the EDAC data in sysfs

Okay, so no EDAC data. sysfs is kernel based. Let's update the kernel and see what happens. Using @Patrick article here I update the kernel from 4.06?? to 4.10.

edac-util -s

MC0: 0 uncorrected errors

MC0: 0 corrected errors

I manually update the Ubuntu to 16.10 to see anything blows up, still the ACPI error, nothing blows up.


I install kvm via (h/t to cyberciti.biz)

$ sudo apt-get install qemu-kvm libvirt-bin virtinst bridge-utils cpu-checker

$ kvm-ok

Returns clear errors about what is and is not working and advises me to enable virtualization in the bios. I enable virtualization in the bios.

$ kvm-ok

Returns clear confirmation that things are installed and will potentially work. I have no .iso available to test. I’ll come back to it...I say.


Progress! Fist bump! Do laundry.


At this point I have an updated kernel, installed graphics drivers, and updated OS with “working” ECC (still needs real testing). I consider clonezilla or other assorted image snapshotting but I’ll likely nuke and pave again.

I play around with a bunch of stuff but nothing worth writing about. I didn’t take great notes and will nuke and pave and do it more cleanly, likely more than one time. While I do want benchmarks and comparative analysis of 16.04LTS vs 16.10 vs 17.04 and also 4.06 vs 4.10 and also virtualized vs bare metal, I’m not too excited to be patient. I tell myself I’ll get back to it.

Benchmarks:My googling says Phoronix test suite. I start with

phoronix-test-suite ?

I get a blinking cursor, I abort.

phoronix-test-suite -h

I get a blink cursor, I abort.

Okay...maybe I’m just being impatient.

phoronix-test-suite ?

Wait about 20 seconds, returns “No Internet Connectivity” then help page shows up. No command-line argument syntax and no example commands are included in the help page. There is “View the included PDF / HTML documentation or visit Phoronix Test Suite - Linux Testing & Benchmarking Platform, Automated Testing, Open-Source Benchmarking for full details.” but no reference on where it is located. PHP is a dependency.

I run the Unigine Heaven benchmark to see all the visual glory that my card can do. It looks like crap. The roof looks like a billboard, the materials don't have subsurface scattering, the light is one-dimensional, the shadows are all one tone of black. It turns out the benchmark was made in 2009. "Whatever, I'm looking for progress" and will move on.

It turns out, after hacking around, they have an “experimental” gui

phoronix-test-suite gui

I test some things and play around with it. I give it an overall rating of “meh” for a variety of reasons though this article is long enough now.

I reach a point where I feel like I need to iterate and automate this whole process. The real problem is that my Docker skills are garbage. In my ideal world I’d have one base image, snapshotted to a golden for backup. I’d use Docker as a sort of virtualenv for everything and then I could experiment, build, destroy and basically make a mess of everything that would go with a simple kill command. I don’t know how to do that so…

Next up, level up on Docker, then there’s this site called ServeTheHome and they have this benchmark called Linux-Bench so I’ll try that next.
 

Spartus

Active Member
Mar 28, 2012
295
107
43
Toronto, Canada
Hi @DWSimmons

Regarding ECC. I found the RAM solidly stable at 2400, but unstable at 2666. It would boot but throw many ECC errors, some correctable and eventually some uncorrectable and eventually the system would crash. This is a known issue with Ryzen ECC... I believe the general consensus on preferred behaviour is that the system should halt with error on uncorrected memory error, but Ryzen doesn't currently and continues running (I find that preferred behaviour debatable though).

Anyways, bottom line is if you want to check that ECC is actually correcting errors then try booting at 2666. No guarantee your experience will match mine, but if it does you will then know for sure that ECC is working.

Glad to know your build is going well so far.
 

DWSimmons

Member
Apr 9, 2017
44
10
8
51
Hi @DWSimmons

Regarding ECC. I found the RAM solidly stable at 2400, but unstable at 2666. It would boot but throw many ECC errors, some correctable and eventually some uncorrectable and eventually the system would crash. This is a known issue with Ryzen ECC... I believe the general consensus on preferred behaviour is that the system should halt with error on uncorrected memory error, but Ryzen doesn't currently and continues running (I find that preferred behaviour debatable though).

Anyways, bottom line is if you want to check that ECC is actually correcting errors then try booting at 2666. No guarantee your experience will match mine, but if it does you will then know for sure that ECC is working.

Glad to know your build is going well so far.
Huh, I guess that part got edited out... The RAM defaulted to 2133. I clicked 2400MHZ save and exit and it rebooted into 2400MHZ no problems. I did this in 060? and it came back with 15CAS timing from JEDEC. On 0803, it came back with slightly loser timings. I probably could manually tighten the timing but I was too excited to dig in.

One of my sub goals is to get proficient at reading Linux logs. Once I get some confidence in that I'll head back to checking out the ECC overclock.
 

Spartus

Active Member
Mar 28, 2012
295
107
43
Toronto, Canada
i just was encouraging you to check that ECC is actually correcting errors. I say this because gigabyte on their B350 boards explicitly states that ECC is supported in non-ecc mode. But the ax370 gaming 5 i use says ECC supported. I was suggesting you double check that your ECC is actually correcting by pushing it to being unstable.
 
  • Like
Reactions: DWSimmons