Hardware Failures in 2021 - Post yours!

RageBone · Nov 1, 2021

On my 128GB ES Stick, there is a Winbond SPI EEPROM under the Heatsink.
Its a 25Q80DVJG.
It might contain the borked FW.

You could try to dump it from a good stick and flash that to the Bad sticks.

Only issue is that i can't see any obvious and usable debug headers and pads for that.
So far, removing the heatsink seems to also brick the sticks ? At least that was what Patrick experienced if i remember correctly.

EDIT:
@oddball as a suggestion

Marjan · Nov 3, 2021

Just took out 3 HDDs from the drawer, they were sitting there for like 2 years. They are dead. I guess they died of boredom.

edge · Nov 3, 2021

Marjan said:
Just took out 3 HDDs from the drawer, they were sitting there for like 2 years. They are dead. I guess they died of boredom.

I have needed to give disks that have been off a long time a quick horizontal twist to get them to spin up (doesn't always work). Spin up failure is a favorite of old drives.

Patrick · Nov 3, 2021

Were they ES DIMMs?

oddball · Nov 5, 2021

No, they're non-ES (normal) DIMMs.

What's frustrating is they all worked for months before the FW update.

Because of this thread I went back through a stack of 15 yesterday and tested one by one. Looks like three are totally borked, and four are visible to the server but report a size 0 and FW version of 0.0.0.0. The rest work!

If anyone wants to mess with them I'd be happy to part with them for something nominal.

One thing I've found with these is if there is a single bad stick the machine will think all of them are bad. What I do is remove them all and put them in one at a time, allow the machine to post, make sure the BIOS can see the stick, make sure the OS sees the memory capacity and then shut down and try again. Running like this is totally unsupported and these are supposed to be installed in sets, but it DOES work. I'd never run a workload like this, but to make sure the hardware is viable it's nice.

This is a really finicky technology as Patrick noted in his article. We love the potential, but it is a bit knifes edge getting it setup. We run these in production in a 50/50 capacity, 50% persistent and 50% in memory mode. On the persistent portion we store databases and staging files for an ETL process. For that purpose these things are incredible, latency is near-RAM, random read/write speeds are impressive and it acts like a drive to the OS, so no weird tricks.

We have two new hypervisors that have 1TB and 1.5TB of Optane installed in memory mode. We're going to see how they work with a bunch of VM's jammed on there. In theory this tech is a game changer, but it seems under utilized. We have some VM's that need 512GB of RAM for in-memory databases (Solr) where only a portion of that DB is 'hot' and Optane is perfect for this.

RageBone · Nov 5, 2021

I'd love to be able to finagle with them, but i neither have a budget to compensate you, or a cascade lake machine to test them in.
And i work very intermittently on projects so it would take ages anyway.

technovelist · Dec 26, 2021

oddball said:
The Optane DIMM's themselves. Not sure if they didn't flash correctly or didn't activate correctly but somehow the process just hung. We let it run for days and it appeared finished. I think the controller software just gave up. When the dimm's are inserted into a machine it never posts, just hangs testing the memory forever, days forever.

Any suggestions?

Have you contacted Intel support? They have been helpful in the past when I've had questions and/or problems with their hardware.

technovelist · Dec 26, 2021

oddball said:
No, they're non-ES (normal) DIMMs.

What's frustrating is they all worked for months before the FW update.

Because of this thread I went back through a stack of 15 yesterday and tested one by one. Looks like three are totally borked, and four are visible to the server but report a size 0 and FW version of 0.0.0.0. The rest work!

If anyone wants to mess with them I'd be happy to part with them for something nominal.

One thing I've found with these is if there is a single bad stick the machine will think all of them are bad. What I do is remove them all and put them in one at a time, allow the machine to post, make sure the BIOS can see the stick, make sure the OS sees the memory capacity and then shut down and try again. Running like this is totally unsupported and these are supposed to be installed in sets, but it DOES work. I'd never run a workload like this, but to make sure the hardware is viable it's nice.

This is a really finicky technology as Patrick noted in his article. We love the potential, but it is a bit knifes edge getting it setup. We run these in production in a 50/50 capacity, 50% persistent and 50% in memory mode. On the persistent portion we store databases and staging files for an ETL process. For that purpose these things are incredible, latency is near-RAM, random read/write speeds are impressive and it acts like a drive to the OS, so no weird tricks.

We have two new hypervisors that have 1TB and 1.5TB of Optane installed in memory mode. We're going to see how they work with a bunch of VM's jammed on there. In theory this tech is a game changer, but it seems under utilized. We have some VM's that need 512GB of RAM for in-memory databases (Solr) where only a portion of that DB is 'hot' and Optane is perfect for this.

Yes, they are an amazing technology. Probably the reason they are underutilized is that Intel has no idea how to market them.
Of course there is the real problem that there isn't much software that exploits them properly, so very few people have them.
But the reason there isn't much software is that very few people have them!

The way around this is for someone to write some software on spec, and that's what my company, 2Misses Corp., has done. We have the fastest kv-store (hash-table based) in the world.

What's your exact database scenario? We might be able to help you get the most out of your expensive Optane pmem.

Evan · Dec 26, 2021

Nothing spectacular but I have seen a lot of enterprise mixed use SSD’s with zero % life left recently. They are on appropriate firmware levels.
Still running so will investigate further in the new year to check actual writes and usage (I don’t have access to the OS on those systems myself)
Maybe the workload is just really high but I am surprised as less than 2 years it’s not that easy to kill the MU drives (I haven’t checked the supplier either but will report back early January as I am really curious given it wasn’t at all expected)

Evan · Dec 29, 2021

Evan said:
Nothing spectacular but I have seen a lot of enterprise mixed use SSD’s with zero % life left recently. They are on appropriate firmware levels.
Still running so will investigate further in the new year to check actual writes and usage (I don’t have access to the OS on those systems myself)
Maybe the workload is just really high but I am surprised as less than 2 years it’s not that easy to kill the MU drives (I haven’t checked the supplier either but will report back early January as I am really curious given it wasn’t at all expected)

Samsung PM1643 mostly, 960gb read intensive (1dwpd for 5yrs) and sure enough they have written 1700+ TB
Looks like the HP utilities just look at that and not at any other consideration when determining health so they may still be really good SSD’s just can’t tell.

So those systems have been writing 3TB a day onto a single raid1 pair. 800gb or better 1600gb mixed use (3dwpd) would be ok for the life of the system. But not often you see that much activity.

Borromini · Jan 2, 2022

An aging PSU here took out the home server (Celeron G1610). Took a while to diagnose unfortunately; it didn't die outright but made the server hang or reboot repeatedly. Found out systemd can show (re)boots with # journalctl --list-boots and that revealed a boatload of reboots on a single day even

. Was in my brother's home office, he wasn't amused (small business, all his data on that server).

We have another cheapo (AMD 5150) system in place for now, but once the new office is ready we'll be moving to an overkill 19" 1U rackserver with redundant PSUs

.

oddball · Jan 11, 2022

technovelist said:
Yes, they are an amazing technology. Probably the reason they are underutilized is that Intel has no idea how to market them.
Of course there is the real problem that there isn't much software that exploits them properly, so very few people have them.
But the reason there isn't much software is that very few people have them!

The way around this is for someone to write some software on spec, and that's what my company, 2Misses Corp., has done. We have the fastest kv-store (hash-table based) in the world.

What's your exact database scenario? We might be able to help you get the most out of your expensive Optane pmem.

Fascinating. We're using them in three scenarios:
1) add cheap ram capacity to hypervisors. Especially for VM's that don't have consistent ram needs
2) SQL Server TempDB space/fast disk. We split the Optane 50/50, half is a disk and half is just RAM that stores indexes.
3) Analytics processing as a fast disk. We are processing thousands to tens of thousands of small zipped text files a day. Those small latency gains add up very quickly.

What's the use case for your key value store?

technovelist · Jan 31, 2022

oddball said:
Fascinating. We're using them in three scenarios:
1) add cheap ram capacity to hypervisors. Especially for VM's that don't have consistent ram needs
2) SQL Server TempDB space/fast disk. We split the Optane 50/50, half is a disk and half is just RAM that stores indexes.
3) Analytics processing as a fast disk. We are processing thousands to tens of thousands of small zipped text files a day. Those small latency gains add up very quickly.

What's the use case for your key value store?

Sorry I didn't see this earlier.

The use case is for random access by key to very large databases, e.g., billions of records. We are getting microsecond or faster retrieval times regardless of the number of records, tested up to tens of billions of records. Rehashing is also fast and does not affect retrieval times regardless of number of times rehashed.

If you would like more info, drop me a line at sheller@2misses.com.

Search

Hardware Failures in 2021 - Post yours!

RageBone

Active Member

Marjan

New Member

edge

Active Member

Patrick

Administrator

oddball

Active Member

RageBone

Active Member

technovelist

New Member

technovelist

New Member

Evan

Well-Known Member

Evan

Well-Known Member

Borromini

New Member

oddball

Active Member

technovelist

New Member