ASUS Z10PA-D8 - Dual Socket 2011-3

Rand__

Well-Known Member
Mar 6, 2014
4,592
912
113
well discussion with the seeller only to get further insights...
and if only to see how many replacements he is willing to send out till they all work ..;)
 
  • Like
Reactions: gb00s

Jernox

New Member
Jun 21, 2020
5
0
1
I wonder what I did wrong when contacting the seller. I explained the situation and just asked for some advice, and I didn't rush at all, there were 8 days in between my first and second message.

I had the intention to buy several of these mainboards, but due to my experience at that moment with the seller and troubleshooting, I was't really tempted to order some more.
 

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
... and if only to see how many replacements he is willing to send out till they all work ..;)
It appears to me they are not selling these boards anymore. So there might be no replacement.
I wonder what I did wrong when contacting the seller. I explained the situation and just asked for some advice, and I didn't rush at all, there were 8 days in between my first and second message.
As I said, no issues here.
 
Last edited:

Mauri Lehtikangas

New Member
Feb 8, 2016
6
1
3
41
I would have to re-read the whole conversation. But there was never a real discussion from their side. You could argue, it comes from a render farm and was 4yrs under their 'load'. But aren't these machines normally used for gpu workload?

EDIT: Time ... 11.33 pm ... Board shuts itself off again. Last state when the board died again. Logs empty.
It depends on workload. But my motherboard had quite a lot presets saved (with program names) to bios for processor intensive workloads. I would think these were used with 2630v3 (those were also sold by seller) and used to tax processors and board.
 

trubok

New Member
Jun 21, 2020
5
1
3
I wonder what I did wrong when contacting the seller. I explained the situation and just asked for some advice, and I didn't rush at all, there were 8 days in between my first and second message.

I had the intention to buy several of these mainboards, but due to my experience at that moment with the seller and troubleshooting, I was't really tempted to order some more.
Hey,

please tell me your Ebay Username, maybe we have missed your request on ebay. Normally i reply all message..

Actually we have problem with a document, that ebay needs from our tax office. I think we are only again in one week. We have more then 500 boards to sell. We have sold a lot of boards without any problems.

We dont use GPUs on this, because we used them only for cpu rendering. Maybe there is the issue?

The most of our mainboards are running fine with the bios version 3202. We dont upgraded the firmware. Before we send a replacement, we test the board for around 6 hours.

In my experience b7 is a problem with the ram.
 

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
@trubok .... welcome here and I appreciate you came here to discuss the issues with the board. I assume you are ralle57 on eBay.

The most of our mainboards are running fine with the bios version 3202. We dont upgraded the firmware.
I have to disagree. All my purchased and 3 exchanged boards came with bios 3107, not 3202. I'm very sure about it. The board with the new issues, I'm not 100% sure about.
In my experience b7 is a problem with the ram.
There might be a misunderstanding. I'm describing an issue where a mainboard came in, bios version not 100% clear, with 4 thermal alerts in the logs. Board was updated to bios 3807 and firmware 1.14.

It was perfectly fine, except that it did not initialize if 4x 32GB DDR4 ram were installed in socket 1. It was initializing successfully with 2x slots populated only with 1x CPU installed or 2x CPUs installed. 4x populated slots with 2x installed CPU's did not work as well. Board was working fine and after one day of running normal test scripts it shut off once. Board did not come up again until several bios & firmware flashes and 50+ attempts to power it up. If the board successfully boots at one point it shuts off after 1+ hours. If you don't wait and reset everything and re-flash the newest bios again, the next time from booting up successfully and shutting off at one point will be short and shorter. I tested with 2x E5 2630v3's I bought from you. Both were cooled down with the purchased Intel 2U fans and/or Noctua U9DX-4i. The CPU's have no obvious temp issue.

RAM tested was (all on QVL list):

2x 32GB SAMSUNG DDR4 2133P M386A4G40DM0-CBP (LRDIMM)
2x 32GB SAMSUNG DDR4 2133P M386A4G40DM0-CBP0Q (LRDIMM)
4x 32GB SAMSUNG DDR4 2133P M393A4K40BB0-CBP (RDIMM)

WELCHEN RAM HABT IHR DENN VERWENDET?

CPU's tested:

4x E5 2630 v3
4x E5 2673 v3

All the same. Even with no peripherals installed. The board has the same issues. The board powers on just to specific points shown by boot code.

Attempt - Postcode

1. doesn't boot up
2. 04
3. 19
4. b0 or 60 (but b0 seems more reasonable)
5-7. b0
8-9. b7 and/or b9
10. boots successfully .. .sometimes it shuts down immediately after posting 'BMC is ready' and you need the 11th attempt.

I think this is what you got wrong. It doesn't stop always on b7/b9.
Before we send a replacement, we test the board for around 6 hours.
All problems started running longer than 6hrs.

Let us know what you think and discuss it with your team. I can make a whole video about everything so nobody can think I'm telling bs.

EDIT: Even a RAM issue doesn't explain why the board, after a shutdown, is not able to come up again and boot up to the next issue. Why the board loses the ability to initialize or is not even able to post anything. If it would attempt to initialize the ram and could not read it due to RAM issues, it would beep. It doesn't until it is able to go beyond postcode 04. And this is possible only with a reflash of any bios or endless poweron attempts.
 
Last edited:

trubok

New Member
Jun 21, 2020
5
1
3
We run the boards with a Xeon E5 2630v3 and 2630v4 without these problems.

Maybe we have two versions of bios and firmware, i will check this.

The E5 2673 v3 is not in the supported list: Z10PA-D8 CPU Support | Servers & Workstations | ASUS Global

I think there is a problem with the newer bios or firmware. Because the other customers dont have these problems. I will upgrade this to 3807 and firmware 1.14 and do some tests, i will let you know.
 

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
E5 2673 v3 run on other of your boards absolutely fine.
I think there is a problem with the newer bios or firmware. Because the other customers dont have these problems. I will upgrade this to 3807 and firmware 1.14 and do some tests, i will let you know.
Current problem here is tied to any bios version from 3107 up to the current version. Tested with firmware 1.09 - 1.14.
 

trubok

New Member
Jun 21, 2020
5
1
3
actually we have 250x Nodes left running with the Z10PA-D8 and 2x 2630v4 without these problems. They run with the bios 3202 and firmware 1.12.
We used them 24/7 since 2016/2017.

i cant say what the problem is, but we will check this. I think we need some days to give a answer.
 

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
Ok and thank you. Please, can you provide some information for us all about what kind of RAM you are using? I have no RAM or CPU issues with the other boards you already exchanged. Again, I do not believe it's tied to bios, firmware and or CPU. It's already suspicious that 4x populated slots are not being initialized on both sockets. 2x slots populated are just fine for initialization.
 

trubok

New Member
Jun 21, 2020
5
1
3
RAM Crucial RDIMM 16GB, DDR4-2133, CL15, reg ECC (CT16G4RFD4213)
RAM Kingston ValueRAM RDIMM 16GB, DDR4-2133, CL15-15-15, reg ECC (KVR21R15D4/16)
RAM Samsung RDIMM 32GB, DDR4-2133, CL15-15-15, reg ECC (M393A4K40BB0-CPB)
 
  • Like
Reactions: gb00s

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
So all RDIMM's ... not a single LRDIMM even LRDIMMs are on the QVL list and listed in the specs LRDIMM 32/64GB DIMM up to 512GB.
 

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
Ok, I tested RDIMM and LRDIMM on both specific 'failed' boards. Already tested RDIMM from a good board on both failed boards and it didn't go well. But these RDIMM's are running well on the good boards. Let me test these LRDIMMS on the boards running perfectly fine and report it. If the 'good' boards are failing now, then it's the LRDIMM. If the boards are still good with the LRDIMM, then it's a board issue, as expected. Give me 24hrs to report.
 

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
Just to test the influence of bios version 3202 ....

TEST 1 (base) ...

TEST 'flaky' mainboard tonight (home) w/2x 32GB SAMSUNG DDR4 2133P M386A4G40DM0-CBP0Q (LRDIMM)
Bios downgraded to v. 3202 / Firmware v. 1.14 / No ACPI activated in bios
Start: 8.31pm / Powered off: 5:19am (time when script stopped running)
No reboot.

TEST good mainboard tonight (company) w/4x 32GB SAMSUNG DDR4 2133P M393A4K40BB0-CBP (RDIMM)
Bios downgraded to v. 3202 / Firmware v. 1.14 / No ACPI activated in bios
Start: 8.32pm / Powered off: non (still running >> can still access it via SSH)

TEST 2 (RAM switch LRDIMM vs RDIMM) ...

later today
 

gb00s

Active Member
Jul 25, 2018
223
61
28
Malta
UPDATE

After two sleepless days & nights of testing, I finally came to the conclusion, that the last two both failing boards are just 'done' == kaputt. I tested all CPU's, DIMM's, PSU's, switched from board to board and socket to socket.

Conclusions

  1. Against from what I remembered, it seems I did not update the bios of 2x mainboards. Surprisingly these boards had BIOS v. 3201 on it. According to a loose Asus contact, bios version 3201 is not an official bios. It is not and was never available through Asus support website for download nor is it mentioned anywhere. An official bios is version 3202 from May 2016 and it was released as an upgrade with crucial changes related to ram configurations and compatibility.

    bios3201_1024x768.jpg

  2. On 3 of my 5 boards, with bios version < 3801 my and LRDIMM's (all = 32GB) from a close contact here in Malta did not initialize in slot E-H1 if 2x CPU's are installed. All models are on the Asus QVL list and therefore are officially authorized to be used on this board. Boot hangs in the second E7 post after E9 and gives Memory_Train_ERR (OEM specific) on all these slots. If you click on the error in the reports it gives you some other error I did not record here. I will deliver the exact error message this afternoon. It was related to offset settings for the ram were not correct set in the bios. I used standard bios settings only. There was just a single Google Search entry (Lenovo official statement) to this specific error which pointed me to a possible issue with this kind of error with a fix in a different bios version.

  3. One board showed the same error above even with bios version 3801. It was not related to the DIMMs as this error was moving back and forth from slot E-H1 to A-D1 (in dual-socket configuration) when I switched CPUs (2630v3) between sockets. I did not realize it earlier as this board was running in single-socket config only with all 4 slots on socket 1 populated. At first, I thought this might be a CPU issue. But the issue was not following to a different board if I switched CPUs to the other boards. So I installed other CPUs (2666v3 & 2673v3) in the same socket and the issue just moved away back to E1-H1. DIMMs were kept in the slots. If DIMMs were changed the error kept showing up in E1-H1 and were moving with 2630v3s only. So also not DIMM related. Maybe this is a RAM controller issue as 2630v3 are rated for 1600-1866Mhz DDR4 only. But I can't find DDR4 with this specification.

  4. The failed board, which was already a replacement, was tested in several configurations again. With RDIMM's and LRDIMM's, with 3 different processor models (2630v3, 2666v3 of a friend, and 2673v3) in single- and dual-socket configuration. It appears the board runs longer in single-socket configuration than dual-socket configuration. What's different from all the other boards is the VCORE reading for socket 2 if 2 CPUs are installed. While all boards show 1.792 - 1.802 vCore voltage on both sockets (sure voltage core readings are not correct), the failing board shows 1.576 - 1.600 vCore voltage only. The temps on this socket 2 are ~6-8C higher no matter which CPU is installed. Tested with bios v. 3107 to 3801. As I wrote in other posts, the board still shuts off at some point. It doesn't show any errors or logs are not able to be written to the board. You can not get the board running again after it powered down by itself until you reflash the bios and reset all settings. The more you do it the shorter the time the board keeps running the next time until you let it rest for 1+h. An idea that a faulty PSU could cause this issue is not valid, as it was tested with 3 different PSUs (one of them new). If it would throw error messages it would be helpful. But it just doesn't show anything.

  5. All boards where I did not clear the previous logs showed CPU2 temperature warnings, PMBPower issues (power supply), and fan issues.

All in all, I have 4 (3 replacements) of 5 boards running and still throwing memory errors from slots assigned to socket 2 and I can live with it for the moment. Just the last failed board has to be replaced again. I'm just worried as all issues are related to socket 2 and all temperature issues in the logs are related to socket 2. Glad the seller is always immediately available for help and support and easy with replacements. There's nothing to say against it.
 
Last edited: