Gigabyte MJ11-EC1 EPYC 3151 Mystery

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

ynikolov

New Member
Dec 19, 2022
7
0
1
Hello everybody. I was so exited reading the thread and did not last longer till i got one of those boards in hands from the same seller. I do not have much experience with server grade hardware but have been using much older and cheaper set ups for running proxmox/truenas/openmediavault at home.
I did find two RDIMM ECC 32gb banks of ram that i got memtest+ running for about 2 hours without a single issue. So plugged the proxmox bootable usb and here is where the good things stopped.
I started to experience strange behaviors that i will list down hoping for any clues
1. Booting/initializing BMC takes more than a minute (which is may be expectable) until i got the option for choosing the boot order (F10). However pretty frequently the board reset itself even though i press F10 or DEL to enter BIOS. This would normally require a AC power cycle to come out of this loop.
2. When i am able to boot from the USB the installation process run soo crazy slow. So laggy and slow that installing proxmox takes more than 30 min. It is like the CPU is in a loop and lacks performance because looking at the power meter, the wattage goes to around 100W and even 120W when all 8 logical cores are enabled.
3. After several rest/boot attempts i was able to install proxmox on a sata ssd. After restart the proxmox starts booting however it hangs at "loading initial ramdisk" message. Here i have read a few thread but nothing helped.

Tested truenas, debian 12 as well. There were not even able to start the installation process. May by if i would have had the patience to wait longed the installation would have started but this is not normal at all.

Tried returning the BIOS default values, Tried clearing the CMOS jumper way and battery way - no effect.
When i checked the BMC web gui a saw more than a thousand log entries with a kind of ECC correction signals

Could it be a board or ram failure or misconfiguration

Regards
 

mrgrinch

New Member
Oct 25, 2023
2
0
1
Hello everybody. I was so exited reading the thread and did not last longer till i got one of those boards in hands from the same seller. I do not have much experience with server grade hardware but have been using much older and cheaper set ups for running proxmox/truenas/openmediavault at home.
I did find two RDIMM ECC 32gb banks of ram that i got memtest+ running for about 2 hours without a single issue. So plugged the proxmox bootable usb and here is where the good things stopped.
I started to experience strange behaviors that i will list down hoping for any clues
1. Booting/initializing BMC takes more than a minute (which is may be expectable) until i got the option for choosing the boot order (F10). However pretty frequently the board reset itself even though i press F10 or DEL to enter BIOS. This would normally require a AC power cycle to come out of this loop.
2. When i am able to boot from the USB the installation process run soo crazy slow. So laggy and slow that installing proxmox takes more than 30 min. It is like the CPU is in a loop and lacks performance because looking at the power meter, the wattage goes to around 100W and even 120W when all 8 logical cores are enabled.
3. After several rest/boot attempts i was able to install proxmox on a sata ssd. After restart the proxmox starts booting however it hangs at "loading initial ramdisk" message. Here i have read a few thread but nothing helped.

Tested truenas, debian 12 as well. There were not even able to start the installation process. May by if i would have had the patience to wait longed the installation would have started but this is not normal at all.

Tried returning the BIOS default values, Tried clearing the CMOS jumper way and battery way - no effect.
When i checked the BMC web gui a saw more than a thousand log entries with a kind of ECC correction signals

Could it be a board or ram failure or misconfiguration

Regards
Which slots did you have used for the banks? The documentation states that the blue slots have to be used first.
 

studunihd

New Member
Oct 21, 2023
6
1
1
Is this RAM on the compatibility list? I also got this board to build a nas. I initially tested it with non-ECC RAM I had lying around and some ECC RAM not on the list and had a lot of problems. After I got some ECC RAM that is on the List, all problems went away.
 

studunihd

New Member
Oct 21, 2023
6
1
1
Did anyone figure out how to reduce the idle fan rpm? I could reduce it a bit with a custom fan profile in the BMC but cannot get ir lower than 50%. The sysfan fan connections seem to be locked down as well. The correponding sensor in the BMC stays disabled even when a fan is connected. Any ideas how to control those fans?
 

PeterF

Member
Jul 28, 2014
55
43
18
68
I have the same board with exactly the same type of memory in the same slots. It is working without any problem. Have you checked that the memory is properly seated in the slots?
Is your motherboard installed in a case?
Could there be unused standoffs under it that are causing a short?

If you boot from the board completely off it will wait for the BMC to boot up first. This will take a long time. I usually boot it from the BMC (Power Control) That goes relatively quick

BR
Peter
 

PeterF

Member
Jul 28, 2014
55
43
18
68
Did anyone figure out how to reduce the idle fan rpm? I could reduce it a bit with a custom fan profile in the BMC but cannot get ir lower than 50%. The sysfan fan connections seem to be locked down as well. The correponding sensor in the BMC stays disabled even when a fan is connected. Any ideas how to control those fans?
I just tested and created a fan profile that starts at 25% . That worked but the resulting drop in fanrpm was not 50%. This will depend on the fan.
I instead changed the fan and put in a 92mm fan. It is only fixed with one screw and hangs out over the NVME. I got lower noise, lower temp and a cold NVME! I will try to make a bracket so I can fit a 120mm fan to also cool the DIMMs

BR
Peter
 

ynikolov

New Member
Dec 19, 2022
7
0
1
I have the same board with exactly the same type of memory in the same slots. It is working without any problem. Have you checked that the memory is properly seated in the slots?
Is your motherboard installed in a case?
Could there be unused standoffs under it that are causing a short?

If you boot from the board completely off it will wait for the BMC to boot up first. This will take a long time. I usually boot it from the BMC (Power Control) That goes relatively quick

BR
Peter
Check the RAM several times. Cleaned the case beforehand. I will try swapping the PSU today. Could somebody measure the power draw at booting time. I guess when the OS takes over it will strip down the consumption but do you hit 100W with 4 logical cores?

What is your bmc/bios/ firmware version. I got BIOS 12.49.06
 
Last edited:

PeterF

Member
Jul 28, 2014
55
43
18
68
Check the RAM several times. Cleaned the case beforehand. I will try swapping the PSU today. Could somebody measure the power draw at booting time. I guess when the OS takes over it will strip down the consumption but do you hit 100W with 4 logical cores?

What is your bmc/bios/ firmware version. I got BIOS 12.49.06
I am on the same BMC version.
I just measured the boot time (Booting from the BMC) it took 85 seconds until the OS started booting. The OS (Omnios) then booted in about 10s
I use now a 150W PicoPsu. It has no problem booting the board and also 2 3.5 HDDs.
I tried also with a 90W psu. It booted fine with a ssd but the harddisks did not start.

BR
Peter
 

beijing

New Member
Oct 28, 2023
5
3
3
In my test configuration (2x 32 GB RDIMM ECC RAM, 1x SATA SSD, 1x USB Stick) the power draw was between 60-70W during boot process and ~30W in idle after booting TrueNAS. Measured at the power plug.

You probably should try different RAM, or test your sticks in a single config to see whether it changes your power draw.
 

ynikolov

New Member
Dec 19, 2022
7
0
1
Seems 100W in my case is too much than. I will try another RAM.
I feel like CPU is on its max performance because of any reason which would relate to the slow/impossible installation process and the high power draw. The interesting thing is that sometimes - 1 out of 20 power cycle attempts, I observe power consumption 60-70W and no lagging whatsoever. The installation completes normally pretty fast as to be expected. But after the next reset it goes bad again
 
Last edited:

studunihd

New Member
Oct 21, 2023
6
1
1
I just tested and created a fan profile that starts at 25% . That worked but the resulting drop in fanrpm was not 50%. This will depend on the fan.
I instead changed the fan and put in a 92mm fan. It is only fixed with one screw and hangs out over the NVME. I got lower noise, lower temp and a cold NVME! I will try to make a bracket so I can fit a 120mm fan to also cool the DIMMs

BR
Peter
Thank you. I looked at the profiles again as well. When I created my profile I copied the default profile, which had additional policies controlling the CPU fan. Some of them had a minimum setting of 50%. After deleting those profiles the fan goes down to 25%.
This is much quieter but the fan seems to still introduce a lot of vibration to the cooling block. I will also install a bigger fan.

I also tried the sysfan connectors again. If I connect fans to the header closest to the CPU fan header the system does not boot (it restarts at boot code 99). The other sysfan header does not cause a loop but the corresponding sensor in the BMC remains disabled and the fan cannot be controlled.

Did you test additional fans on the sysfan headers?
 

beijing

New Member
Oct 28, 2023
5
3
3
this was mentioned on 1st and 3rd page with the download link to the latest version:

I've installed the new BMC firmware version without any issue.
 
  • Like
Reactions: prime420

Gymnae

New Member
Nov 20, 2023
22
4
3
Hi there,

just ordered this board to attach it to a fiber connection with the goal of replacing rented VPS or cloud instances. I ordered 2x 32GB ECC RAM along with it

Would this work as a game server for the likes of counter strike or minecraft? No more than 10 players per session.

In additional it will serve nextcloud, pi-hole and some other quality of life services and OMV, all via proxmox
 

saivert

Member
Nov 2, 2015
140
18
18
40
Norway
I think the power socket is from the TE ELCON Micro Power series, specifically 1-2204801-8 (female) and 2204748-2 (male).

--


Finally received the plug and pins from Digikey. Soldered on the pins and pushed them inside the connector housing.
I took the female ATX 24 pin connector from an extension cable. (I bought the board from ebay from a Polish seller and the ATX to 4 pin adapter was not included so I had to make it myself.)

It works. Everything so far has worked ok. I installed Fedora Server to do some quick testing and reset the IPMI password and tried out the BMC and remote management which is working as well. I also adjusted the fan profile using the BMC.


Running memtest now. I bought some Mushkin - DDR4 - 32 GB - 2666 - CL - 19 - Single ECC/REG 2Rx4 (MPL4R266KF32G24) kit of two modules. EDIT: One run passed. Took a bit over an hour to complete.
 
Last edited:
  • Like
Reactions: beijing

prime420

New Member
Oct 17, 2023
9
10
3
Here is the guy who uploaded to actual bmc firmware. The firmware 12.60.39 you have posted isn't the actual one. As i write this post: the actual version was 12.61.01. Now as i write this post i have checked for a new version and here we are. A new version was released at Nov. 15 with version number 12.61.17. Here is the link to the gigabyte download.

When you want to update over IPMI you have to use the file "rom.ima_enc" from the folder "fw".

EDIT: I flashed 12.61.17 today and got some problems. After flashing i can't login anymore. I got everytime a timeout. After a manuel flash from linux (proxmox 8.0.x) with a full reset of the bmc everything worked as aspected. I think that a reset during flashing via the web ui would prevent the problem.

In the changelog we can see some CVE security merges from AMI which will fix some security problems. I recommend to use a actual version.
Fixes and AMI merges since the last from me uploaded version 12.61.01:

[feature] Merge AMI Vulnerabilities for CVE-2023-34473, CVE-2023-34472, CVE-2023-34471, CVE-2023-34338, CVE-2023-34337, CVE-2023-34330, CVE-2023-34329, CVE-2023-28863, CVE-2022-40242, CVE-2022-40258, CVE-2022-26872
[feature] Add clear external SKU function in SKU update page.
[feature] Workaround for snmp config from old version BMC./
[feature] For specific customer request, to keep new threshold value after BMC power reset.
[defect] Fixed the issue where KCS would repeatedly the same session id command./
[feature] Add get BIOS1/2 BMC1/2 version use redfish command./
[feature] Change PSU information and let MFR_ID length to 64.
[feature] update redfish PSU info when detect sensor
[redfish] Add SKU item to display SKU version of /redifsh/v1/Systems/Self.
[feature] Syncronize Chassis serial number from BMC eeprom directly
[feature] Keep all information from FRU0 after SKU update
[feature] Make psu information from sensors can easy mapping to redfish PowerSupplies information.
12.61.15​
[feature] Support overheat auto power off by thresholds for T182-Z60/ZL0.
[feature] Merge AMI_MegaRAC_SPx_KVM_Lighttpd_Videocap_Vulnerabilities for CVE-2023-37293 CVE-2023-37297 CVE-2023-37296 CVE-2023-37295 CVE-2023-37294 CVE-2023-3043 CVE-2023-34333 CVE-2023-34332
[defect] fix PSU SERIAL/MODEL name are not correct.
[feature] sync V13 code to V12 to update read IPV4 and MAC info function
12.61.13​
[feature] Delete overheat auto power off for T182-Z60/ZL0.
12.61.11​
[feature] Add support of overheat auto power off and read adm1278 pin and iout for T182-ZL0.
12.61.09​
[feature] Add support read adm1278 pin and iout for T182-Z60.
12.61.07​
[feature] Add support of overheat auto power off for T182-Z60.
12.61.05​
[defect][CMC] Fix all node power loading will restore when PSU alert.
[defect] remove redfish Dre function
[defect] modify CMC get BMC ip/mac cmd to fix cmc can not read bmc ip/mac when bmc node in Dedicated mode
[defect]Hide HPM Firmware update support for CMC
[defect] Fix rtl8370 get ip fail when change dhcp speed.
[defect] fix BMC cannot get CMC INLET_TEMP
[feature] add CMC show node IPv6 ifno and modfiy node_bmc.c to use oem cmd to send each node and cmc ipv6 info to each node
Merged revision(s) 9092 from trunk/V12.2_20191206:/[feature] Add MI200 GPU power status sensor driver.
[defect] Dynamically change webui reset action text
[defect] add new device id HGX_H100 to pci_devices.json
[defect]CMC M lan ip did not change when change the DHCP server to another one.
[defect][CMC] Fix set time is incorrect by webUI on CMC platform./
[feature] Add CMC Updated Type and show CMC information in Firmware Update for CMC
[defect][CMC] Add CNBH40, CNBH41 to IPM support list./
[defect][CMC] Fix get node power status are incorrect at NDxx sensor./
[defect][CMC] Fix ND0x_Inlet_Temp will log low critical assert for H223./
[defect][CMC] When get midboard HDD status must disable SGPIO(GPIOJ0~J3) for CNBH40 and CNBH41./
[feature] add new pci device id V70 into pci_devices.json
[defect] fix CMC will send wrong psu status when eplist_read_cache dont get data.let CMC send invalid status info to bmc
[feature] add CMC sku/devmap for H263-S63-AAW1-ZB0-SW-015 based on H263-S63-AAW1-000
[feature] add CMC sku/devmap of H263-S66-AAW1-000 / H273-Z84-AAW1-000 based on H263-S65-AAW1-000 / H273-Z83-AAW1-000
[feature][CMC] Add new SKU for H262-002-H3_CMC and H262-003-H3_CMC/
[feature] Add MI200 GPU power status sensor.
[feature] add CMC sku/devmap of H263-S67-AAW1-000 / H273-Z85-AAW1-000 based on H263-S66-AAW1-000 / H273-Z84-AAW1-000
[defect] Enable cold redundant for H263-S60 / S61 / S62 / S63 / S65
[defect] add DS3231M for 3rd generation 2U4N CMC
[defect] need enable CR function before CMC follow node BMC cold redundant
[feature] add devmap of H223 H263
[feature] add devmap of H263-S62-LBW1-000_CMC.xml
[defect] commit fanprofile H223-S10-ABP1-000_CMC.json by marc.huang
[feature] Add RTC sensor on TO24-JDx /
[feature] add CMC sku/devmap of H263-V11-LAW1-000 / H223-V10-AAW1-000 based on H263-V11-AAW1-000
[defect][CMC] [defect][CMC] Change upper critical and upper non-critical in NDxx_CUR sensor for CNBH40./
Add pure code
[defect] remove Dre function
Add pure code for ldap./
[redfish] Let 'SSL' encryption typ not need to give certificate key when setting general LDAP configuration by redfish./
[defect] Let snmp can get correct sensor id.
12.61.03​
[defect] add H100_400W device id into pci_devices.json
[defect] add INLET_TEMP_2U4N SDR
[defect] Adjust web maxtimeout value./
[defect] update pci_devices.json to add new device L40S
[defect] Merge ami's patch: Addressed stack overflow and heap memory corruption in adviserd,Fix stack overflow in singleport plugin./
[defect] Merge ami's patch: Added fix to return error if request length size exceeds the expected limit
[defect] create new devmaps/MX33/R112-X30-00.xml and modify sku.h for this.
[defect] remove NDxx_M2_TEMP for H273-Z82-IAW1-000 CMC devmap
[defect] 1.modify P_3V3_SENSOR sensor for TO24 series 2. remove Expander_EP2, Expander_EP4, Expander_921 sensors for TO24-JD0-00
[defect][CMC] Change read BPB CPLD version from I2C.
[feature] add devmap of R282-Z95-SW-IT-003
[defect] commit fanprofile H263-S65-AAN1-000_CMC.json by kate.kung
[defect] commit fanprofile H263-S65-AAW1-000_CMC.json by kate.kung
[defect] modify G492-ZD2-00.xml GPU mapping
[defect] CMC inlet_temp sensor use INLET_TEMP_2U4N SDR
[defect] update INLET_TEMP sdr
[defect] commit fanprofile H252-3C0-00_CMC.json by marc.huang
[feature] add CMC sku/devmap for H273-Z83-ABN1/ABW1-000 based on H273-Z83-AAN1/AAW1-000
[defect] Merge ami's patch:Dynamic SSL Certificate Generation Script Changes
 
Last edited:
  • Like
Reactions: beijing

monotux

Member
Oct 23, 2019
61
41
18
Kiruna, Sweden
www.monotux.tech
How are you people powering this? My needs are modest (one NVMe, two SSDs, two 3.5 HDDs) so I've looked into picoPSUs but I'm open for ATX or SFX as well, just not very well versed in looking at the alternatives, finding an efficient one and so forth.

Any recommendations?