Hello all at STH,
I have here a Supermicro X10SDV-12C-TLN4F which has some issues. I'm trying to diagnose if the board is dead, or if it can be fixed.
Main problem is there's no way to use it with RAM in slot A1, otherwise I always get error "Memory signal is too marginal DIMMA1". RAM is not the problem, RAM sticks are Samsung M391A2K43BB1-CPB approved by Supermicro for this board. RAM sticks are working OK, and I also tried with some others same sticks pulled out from another working server, same issue. I also cleaned up RAM slots using air blower, still the same. I tried lowering RAM frequency to 1800 instead of 2133 MHz, no change.
So I removed RAM from slot A1, and with RAM in slot B1 only I can sometimes boot, sometimes I get error "Memory signal is too marginal DIMMB1":
I managed to flash BIOS to latest version using recovery, it didn't change anything:
I also flashed BMC firmware with latest version, still the same:
I also tried another PSU, now system is running with a brand new Seasonic 300W 80+ Gold, no change at all.
I thought issue was CPU related, so I tried to run CPU with only a few cores activated instead of 12 cores by default. Even with one core only and all CPU features disabled like Hyper-Threading and Turbo-Boost, system was pretty slow but it still had same issues.
So when starting the system, issue is that sometimes it will POST, and sometimes not. It can boot fine once in a while, and sometimes it won't and get stuck in a kind of bootloop while PEI--IPMI Initialisation and DXE-00B Data Initialisation:
I noticed that if I want to go into BIOS, I have to put JBR1 jumper in recovery mode, otherwise I can't go into BIOS. When into BIOS, I can make changes and save, it's working. But then, when I reboot weird behavior is here again, randomly the system will hang at POST or bootloop.
When I manage to boot the system, it's working fine. I managed to install a Freenas on it, and ran a Plex server working on video encoding on all cores for the whole night without any issue. As long as it's powered on, the system will keep on running flawlessly for days.
Issue is at POST or right after, when OS is booted up there's no issue anymore and from tests I made this Xeon D-1557 is as powerful as my Xeon E3-1245-V5 for video encoding, while it's colder and consuming less wattage. This is why I really like this little SoC and want to save it.
I still get lot of memory errors in IPMI logs:
My guess is the board has one defective component, mostly one of those:
- CPU
- BIOS chip
RAM issues with working RAM sticks and clean RAM slots may come from CPU problem. But when the system is booted, CPU is performing very well, and even running on one core only board just behave the same. CPU is performing so well when server is booted, it's hard to believe CPU could be dead but who knows? Maybe a bad soldering on the BGA?
Weird behavior which happens randomly at POST can be related to a bad BIOS chip too.
If the CPU is dead, well this is BGA and Xeon-D are not sold for retail, so board can't be fixed. But if issue is only BIOS chip, it could be changed at low cost. I just have no real idea if I should try to swap the BIOS chip, or maybe this is a known issue with CPU? I saw the "Memory signal is too marginal DIMM" error seems common with X10SDV owners, do you have any idea what it can be?
If someone has any idea, I'd be happy to save this board. Many thanks, any help is really welcome.
I have here a Supermicro X10SDV-12C-TLN4F which has some issues. I'm trying to diagnose if the board is dead, or if it can be fixed.
Main problem is there's no way to use it with RAM in slot A1, otherwise I always get error "Memory signal is too marginal DIMMA1". RAM is not the problem, RAM sticks are Samsung M391A2K43BB1-CPB approved by Supermicro for this board. RAM sticks are working OK, and I also tried with some others same sticks pulled out from another working server, same issue. I also cleaned up RAM slots using air blower, still the same. I tried lowering RAM frequency to 1800 instead of 2133 MHz, no change.
So I removed RAM from slot A1, and with RAM in slot B1 only I can sometimes boot, sometimes I get error "Memory signal is too marginal DIMMB1":
I managed to flash BIOS to latest version using recovery, it didn't change anything:
I also flashed BMC firmware with latest version, still the same:
I also tried another PSU, now system is running with a brand new Seasonic 300W 80+ Gold, no change at all.
I thought issue was CPU related, so I tried to run CPU with only a few cores activated instead of 12 cores by default. Even with one core only and all CPU features disabled like Hyper-Threading and Turbo-Boost, system was pretty slow but it still had same issues.
So when starting the system, issue is that sometimes it will POST, and sometimes not. It can boot fine once in a while, and sometimes it won't and get stuck in a kind of bootloop while PEI--IPMI Initialisation and DXE-00B Data Initialisation:
I noticed that if I want to go into BIOS, I have to put JBR1 jumper in recovery mode, otherwise I can't go into BIOS. When into BIOS, I can make changes and save, it's working. But then, when I reboot weird behavior is here again, randomly the system will hang at POST or bootloop.
When I manage to boot the system, it's working fine. I managed to install a Freenas on it, and ran a Plex server working on video encoding on all cores for the whole night without any issue. As long as it's powered on, the system will keep on running flawlessly for days.
Issue is at POST or right after, when OS is booted up there's no issue anymore and from tests I made this Xeon D-1557 is as powerful as my Xeon E3-1245-V5 for video encoding, while it's colder and consuming less wattage. This is why I really like this little SoC and want to save it.
I still get lot of memory errors in IPMI logs:
My guess is the board has one defective component, mostly one of those:
- CPU
- BIOS chip
RAM issues with working RAM sticks and clean RAM slots may come from CPU problem. But when the system is booted, CPU is performing very well, and even running on one core only board just behave the same. CPU is performing so well when server is booted, it's hard to believe CPU could be dead but who knows? Maybe a bad soldering on the BGA?
Weird behavior which happens randomly at POST can be related to a bad BIOS chip too.
If the CPU is dead, well this is BGA and Xeon-D are not sold for retail, so board can't be fixed. But if issue is only BIOS chip, it could be changed at low cost. I just have no real idea if I should try to swap the BIOS chip, or maybe this is a known issue with CPU? I saw the "Memory signal is too marginal DIMM" error seems common with X10SDV owners, do you have any idea what it can be?
If someone has any idea, I'd be happy to save this board. Many thanks, any help is really welcome.