Need Advice: V100 ECC Retired Page Error exceeds 63 and corrupted InfoRom

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

jena

New Member
May 30, 2020
28
6
3
Just bought a used V100 16GB PCIE as my learning tool for CUDA/Deep Learning.
My long term plan is to get a 2nd one and use NVLINK to learn how to do DL data/model parallel.

I ran NVIDIA’s diagnose test and found two errors (picture attached):
  1. Retired Page Error exceeds 63
  2. corrupted InfoRom
    During the test, the GPU temp is about 62C when TDP is set to the max (which is 250W).

Questions:
  1. Should I be worried?
  2. Any other test that I should do rather quick?
  3. Any way to tell if it has been used for mining? (Is the retired page caused by mining?)
Thank you all in advance.

PXL_20230919_180220978~2.jpg
PXL_20230919_175721160~2.jpg
PXL_20230919_175638887~2.jpg
 
Last edited:

jena

New Member
May 30, 2020
28
6
3
Can't you return it? Why bother dealing with faulted hardware?
It was a "ok" deal.
The ebay seller specified no return.
I will need to go through some hoops with ebay protection program to get it returned.

So if it is a trivial issue, I can live with it.
 

Syr

Member
Sep 10, 2017
55
20
8
The card may function fine, but if the accuracy of your calculations on the gpu is important, you cannot trust the results with the gpu in this state.
I would recommend that you try flashing the Inforom, it may fix both errors. If it does not resolve after being flashed, the card is likely only viable for fault-tolerant applications such as rendering.

First try rebooting - there have been cases (such as sleep/suspend in linux) that have caused the driver to enter into an invalid state and spuriously report problems with the inforom.

If that does not work, you can attempt fixing it with the nvflash tool:
Code:
nvflash --repairfs
Page retirement error:
* This means that the GPU page retirement table is full and cannot store more bad pages. As a result, it will continue to use whatever frame buffer cells it has left regardless if they begin to exhibit dual-bit ecc errors or multiple single-bit ecc errors at the same address.
* On the surface, this only matters if running the gpu in ecc mode - you may be able to get generally useable results if using the gpu to do things that are more fault-tolerant such as rendering or AI.
* The page retirement table is stored in the Inforom. If the inforom is bad however, it can potentially cause this error to be spuriously reported.

Corrupt Inforom error:
* This means that the data in the Inforom (a small non-volatile storage device on the gpu used to store various data for the gpu's operation, including things like the page retirement table) does not match the expected checksum. You can try flashing the inforom, but if the error persists it means that the inforom is damaged and cannot accurately store data.
* While in this state, the gpu will ignore the inforom. Nvidia does not have any specifications for a gpu running without its inforom and treats it as an undefined case - it may produce erroneous results or experience unpredictable behavior.
* There has also been an observed bug with linux sleep causing the driver to think that the inforom is corrupted, but rebooting solves the issue.

Sources:
https://www.reddit.com/r/homelab/comments/w7sb1l
 
  • Like
Reactions: jena