WD RED HDD failed, but can PCB replacement fix it?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

cromo

Member
Jun 6, 2019
97
28
18
My WD RED WD80EFAX HDD suddenly died last week: I shut down my Proxmox server, booted it up again and the drive started "clicking". It was clicking for a while, until it stopped and no longer does that. I did not receive any SMART warnings ahead of time, and looking back at the /var/lib/smartmontools/ attrlog, I don’t think there was anything to worry about there:

dateSMART attribute IDcurrentraw
2023-10-24 09:34:5111000
2023-10-24 09:34:512128116
2023-10-24 09:34:5132532031728
2023-10-24 09:34:514996689
2023-10-24 09:34:5151000
2023-10-24 09:34:5171000
2023-10-24 09:34:51812818
2023-10-24 09:34:5199541823
2023-10-24 09:34:51101000
2023-10-24 09:34:51121002276
2023-10-24 09:34:5122100100
2023-10-24 09:34:51192939251
2023-10-24 09:34:51193939251
2023-10-24 09:34:51194127279174185011
2023-10-24 09:34:511961000
2023-10-24 09:34:511971000
2023-10-24 09:34:511981000
2023-10-24 09:34:511992000

compare that with the first values recorded in that log file:

dateSMART attribute IDcurrentraw
2022-04-15 15:52:3211000
2022-04-15 15:52:322128116
2022-04-15 15:52:3231518617263560
2022-04-15 15:52:324100584
2022-04-15 15:52:3251000
2022-04-15 15:52:3271000
2022-04-15 15:52:32812818
2022-04-15 15:52:3299628636
2022-04-15 15:52:32101000
2022-04-15 15:52:3212100557
2022-04-15 15:52:3222100100
2022-04-15 15:52:32192991794
2022-04-15 15:52:32193991794
2022-04-15 15:52:32194144279174185005
2022-04-15 15:52:321961000
2022-04-15 15:52:321971000
2022-04-15 15:52:321981000
2022-04-15 15:52:321992000

The HDD was connected through an external USB enclosure, so I first tested to make sure the problem persists using another USB enclosure and it does, unfortunately. What I am seeing in dmesg is:

Code:
[25343.421737] usb 2-3: new SuperSpeed USB device number 8 using xhci_hcd
[25343.442848] usb 2-3: New USB device found, idVendor=152d, idProduct=1561, bcdDevice= 1.04
[25343.442854] usb 2-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[25343.442857] usb 2-3: Product: SABRENT
[25343.442858] usb 2-3: Manufacturer: SABRENT
[25343.442860] usb 2-3: SerialNumber: DB98765432143
[25343.446053] scsi host1: uas
[25343.446591] scsi 1:0:0:0: Direct-Access     SABRENT                   0104 PQ: 0 ANSI: 6
[25343.448532] sd 1:0:0:0: Attached scsi generic sg0 type 0
[25353.377987] sd 1:0:0:0: [sda] 1953506646 4096-byte logical blocks: (8.00 TB/7.28 TiB)
[25353.378144] sd 1:0:0:0: [sda] Write Protect is off
[25353.378147] sd 1:0:0:0: [sda] Mode Sense: 53 00 00 08
[25353.378427] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[25353.378658] sd 1:0:0:0: [sda] Preferred minimum I/O size 32768 bytes
[25353.378662] sd 1:0:0:0: [sda] Optimal transfer size 268431360 bytes not a multiple of preferred minimum block size (32768 bytes)
[25384.996385] sd 1:0:0:0: [sda] tag#22 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
[25384.996393] sd 1:0:0:0: [sda] tag#22 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25385.016413] scsi host1: uas_eh_device_reset_handler start
[25385.148590] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25385.174465] scsi host1: uas_eh_device_reset_handler success
[25417.783354] scsi host1: uas_eh_device_reset_handler start
[25417.783528] sd 1:0:0:0: [sda] tag#24 uas_zap_pending 0 uas-tag 1 inflight: CMD
[25417.783535] sd 1:0:0:0: [sda] tag#24 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25417.915763] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25417.937381] scsi host1: uas_eh_device_reset_handler success
[25450.530389] scsi host1: uas_eh_device_reset_handler start
[25450.530552] sd 1:0:0:0: [sda] tag#26 uas_zap_pending 0 uas-tag 1 inflight: CMD
[25450.530556] sd 1:0:0:0: [sda] tag#26 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25450.658774] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25450.680523] scsi host1: uas_eh_device_reset_handler success
[25453.039632] sd 1:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=99s
[25453.039639] sd 1:0:0:0: [sda] tag#9 Sense Key : Aborted Command [current]
[25453.039641] sd 1:0:0:0: [sda] tag#9 Add. Sense: No additional sense information
[25453.039644] sd 1:0:0:0: [sda] tag#9 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25453.039646] I/O error, dev sda, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[25453.039650] Buffer I/O error on dev sda, logical block 0, async page read
[25483.301277] sd 1:0:0:0: [sda] tag#10 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
[25483.301299] sd 1:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25483.345279] scsi host1: uas_eh_device_reset_handler start
[25483.477571] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25483.499402] scsi host1: uas_eh_device_reset_handler success
While the disk appears to report the capacity (7.28 TiB), I cannot get smartctl to show anything at all, it gets stuck at -c, -i and, obviously, -a. The disk does, however, "tick" rhythmically and rather quietly during when smartctl remains stuck, but it is not the "clicking" sound.

I also tried connecting it via SATA over a borrowed PCI extension card, since my server is a Lenovo Tiny and does not come with a regular SATA connector. There, I kept getting 'sata link down' errors, although I cannot 100% be sure it wasn’t due to the PCI extension card itself, since I didn’t have another disk to test with it to exclude false negative. I'll see if I can test it again in some other system just to be 100% sure.

Lastly, I removed the PCB and did not see any immediate damage to it. I also cleaned it up a bit, but that didn't do anything.

At this point I am wondering if smartctl failing to report anything and the SATA link errors could be indicative of a PCB failure? It would be an odd one, since the disk *does* spin up and partially report itself via USB enclosure, so it's not *completely* broken.

I am slightly confused because I am not sure if I should go through that effort. The replacement PCBs for this model are readily available on Aliexpress at a reasonable price, but it would take quite some work to re-solder the BIOS SMD chip.

P.S. that HDD contained backups only, so nothing critical but I would still prefer to retain the data. And I was also going to set up RAID for it (the main SSD with OS is already RAID1), it just wasn't a priority.
 

DavidWJohnston

Active Member
Sep 30, 2020
274
232
43
It sounds like a mechanical issue rather than electronic, but it's hard to say. From your description it's probably not worth the soldering effort to try replacing the board, but that's your call.

For sure try with a SATA-to-USB adapter before trying anything more drastic. It's a handy piece of kit to have around. You can also tell if the platters are spinning by gently tilting the drive and feeling for the gyroscopic forces. If it's not spinning, this might help:

You can try putting it in the freezer for a few hours, then use a USB-to-SATA adapter to mount the drive and copy the most important data off as fast as possible. Sometimes freezing temporarily revives the drive long enough to copy some data off. This is especially effective on a drive which has been in-service for a long time, has degraded lubricants, suddenly turned off, and now has stiction. The freezing contracts the metal in the bearing just enough to start spinning again for a while before it seizes completely.

If that doesn't work, you could plug it in then LIGHTLY tap the sides of the drive with a mallet to un-stick whatever might be stuck, if that is the issue. Just lightly go around the perimeter and feel for the gyro forces starting, then you know it's spinning. Don't overdo it!

If that doesn't work and you've got nothing else to try, you can open the top cover of the drive in the cleanest place you have. Power it on and connect it, and if the platters spin up, but the heads are stuck in park (inner-most position, or stuck on the ramp if present), just help it a little by pushing it slightly, and see if the heads jump out onto the platters and start working. Sometimes a head will get stuck to the platter and prevent it from spinning. Sometimes you can remedy that with a thin piece of plastic like overhead projector acetate.

If you do take the cover off the drive, place it back on ASAP if it starts to work, then copy as fast as possible. The drive will degrade fast, and it can never be used again once opened, so only do it if you're out of options.

Good luck!
 

cromo

Member
Jun 6, 2019
97
28
18
Apologies for a late response. I very much appreciate your thorough response, although most of it assumes the drive isn't spinning — however it does spin up each time. Which is why I will abstain from freezing the drive just yet or disassembling it.

I tested with yet another system and am getting even different SATA errors, which I think could be another evidence for the electronics failing. Correct me if I am wrong, but errors such as "failed to enable AA", " Read log 0x00 page 0x00 failed", etc. suggest there's a communication error with the disk, per the SATA protocol:


Code:
2023-11-28T20:36:43.823710+01:00 proxmox kernel: [ 1687.533870] ata6: link is slow to respond, please be patient (ready=0)
2023-11-28T20:36:48.059737+01:00 proxmox kernel: [ 1691.769848] ata6: COMRESET failed (errno=-16)
2023-11-28T20:36:49.027742+01:00 proxmox kernel: [ 1692.737841] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
2023-11-28T20:36:49.027770+01:00 proxmox kernel: [ 1692.739092] ata6.00: failed to read native max address (err_mask=0x100)
2023-11-28T20:36:49.027773+01:00 proxmox kernel: [ 1692.739769] ata6.00: HPA support seems broken, skipping HPA handling
2023-11-28T20:36:54.607898+01:00 proxmox kernel: [ 1698.317931] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
2023-11-28T20:36:54.607905+01:00 proxmox kernel: [ 1698.318902] ata6.00: ATA-9: WDC WD80EFAX-68LHPN0, 83.H0A83, max UDMA/133
2023-11-28T20:36:54.607906+01:00 proxmox kernel: [ 1698.319966] ata6.00: failed to enable AA (error_mask=0x1)
2023-11-28T20:36:54.611989+01:00 proxmox kernel: [ 1698.322029] ata6.00: Read log 0x00 page 0x00 failed, Emask 0x1
2023-11-28T20:36:54.611993+01:00 proxmox kernel: [ 1698.322786] ata6.00: NCQ Send/Recv Log not supported
2023-11-28T20:36:54.611994+01:00 proxmox kernel: [ 1698.323423] ata6.00: Read log 0x00 page 0x00 failed, Emask 0x40
2023-11-28T20:36:54.611994+01:00 proxmox kernel: [ 1698.324056] ata6.00: NCQ Send/Recv Log not supported
2023-11-28T20:36:54.611994+01:00 proxmox kernel: [ 1698.324707] ata6.00: Read log 0x00 page 0x00 failed, Emask 0x40
2023-11-28T20:36:54.611995+01:00 proxmox kernel: [ 1698.325353] ata6.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 32)
2023-11-28T20:36:54.611995+01:00 proxmox kernel: [ 1698.326043] ata6.00: failed to set xfermode (err_mask=0x40)
2023-11-28T20:36:54.615803+01:00 proxmox kernel: [ 1698.326712] ata6: limiting SATA link speed to 3.0 Gbps
2023-11-28T20:36:54.615824+01:00 proxmox kernel: [ 1698.327359] ata6.00: limiting speed to UDMA/133:PIO3
2023-11-28T20:37:00.063709+01:00 proxmox kernel: [ 1703.776061] ata6: SATA link down (SStatus 0 SControl 320)
2023-11-28T20:37:00.063743+01:00 proxmox kernel: [ 1703.776851] ata6.00: disable device
2023-11-28T20:37:00.803705+01:00 proxmox kernel: [ 1704.513792] ata6: SATA link down (SStatus 0 SControl 300)
So I am leaning towards replacing the system board and hope this can help, but this will have to wait until January.
 

oneplane

Well-Known Member
Jul 23, 2021
874
532
93
Clicking is pretty much always a mechanical and/or head defect, not a PCB or controller issue. The more you run it, the harder recovery will be.
Recovery (with clicking) is much cheaper than you might think, getting about 98% of data back on a disk with a broken head causing a single dead spot on a platter was about €230. If there was no head stack transfer needed but just a dead spot bypass it would have been closer to €150.

Every time you turn it on, you risk damaging the platters (since the drive won't know if a head got stuck or came off). If you live around UTC+1 (that's what your timestamp suggests), there are a few companies I could recommend, there's one in the UK where I had my most recent disk failures handled and it was always really affordable and professional.

Anecdote:

Had a customer ask for recovery on a dead external disk, turns out they kept trying to run it for over a week, when the drive got to the recovery company the inside was filled with black metal dust because one of the heads had gotten stuck on the disk but didn't come off of the actuator; result was essentially the entire top layer of the platter scratched away and turned into dust. That UK-based company had a bit of fun with that one, and we ended up not trying any sort of recovery, it costs you shipping and not much more to find out it was that bad.
 

cromo

Member
Jun 6, 2019
97
28
18
Clicking is pretty much always a mechanical and/or head defect, not a PCB or controller issue.
For the record, I doesn't click anymore. I heard merely a couple of consecutive clicks after I first powered up the system and witnessed the issue. That lasted for some seconds and then stopped. This could very well be e.g. some circuitry dying on the HDD's system board? All I hear now is the disk spinning up doing say half of its typical head noise, but it isn't a complete start-up noise it used to do. Then it does try again for a shorter time — this is on par with "SATA link up" errors showing up twice in the logs. At that point it stops any noise because the system gives up on initializing the disk (ata6.00: disable device). I am aware, though, that this could be a broken head failing to go to position.

One thing I wish I was able a more concrete answer is how do SATA errors look for disks that have a mechanic failure. Do they also report SATA issues like mine does? If a disk cannot spin up properly should it still at least establish a working SATA communication with the system, such that e.g. SMART can be read? How would a manufacturer design their diagnostic software if a mechanical issue would prevent the disk from even showing up in the system? (I tried on Windows with HD tooling, the disk doesn't show up in the OS, too). Surely the disk's firmware knows more regarding what's going on with its hardware, which is what their diagnostic software would capture and report back to the user?

I haven't had a disk fail on me in many years, so these are all assumptions, but that's why I assume this is HDD's system board issue, i.e. that some SATA communication should be possible in any situation barring electronics failure. Also, the SATA errors above are commonly reported on the Internet in the context of controller/cabling/power supply issues.

EDIT:
Well, interestingly, I had ChatGPT help me out with these questions and it was quite insightful:


In situations where you're encountering errors immediately after connecting a drive, especially during the initialization phase, the kernel messages you provided can indeed offer some clues. However, interpreting these messages might still require additional context and investigation. Here are some insights into the specific errors you mentioned:

  1. COMRESET failed (errno=-16):
    • This error indicates that a communication reset (COMRESET) operation, which is used to reset the SATA link, failed. It suggests an issue with the communication between the system and the connected drive.
  2. failed to read native max address (err_mask=0x100):
    • This error suggests a problem reading the maximum address supported by the drive. It could be indicative of electronic or communication issues, but it may not provide a clear distinction between mechanical and electronic failures.
  3. HPA support seems broken, skipping HPA handling:
    • HPA (Host Protected Area) is a feature that allows limiting the visible capacity of a drive. This message indicates an issue with handling HPA and might be related to electronic or communication problems.
  4. failed to enable AA (error_mask=0x1):
    • This error suggests a failure to enable Asynchronous Notification (AA). It could be related to communication issues or problems with the drive's firmware.
  5. Read log 0x00 page 0x00 failed, Emask 0x40:
    • This error indicates a failure to read a log page from the drive. It might be related to communication or firmware issues.
  6. failed to set xfermode (err_mask=0x40):
    • This error suggests a failure to set the transfer mode. It could be indicative of communication issues between the drive and the SATA controller.
In situations where a drive fails to initialize, and you're unable to retrieve SMART information, it's challenging to definitively determine whether the issue is mechanical or electronic without additional diagnostic steps. Here are some suggestions:

  • Check Cables and Connectors: Ensure that the SATA cables and connectors are in good condition. Try using different cables and SATA ports on the motherboard.
  • Try on Another System: Connect the problematic drive to another system to see if the issue persists. This can help rule out issues with the specific hardware configuration.
  • Drive Manufacturer Diagnostics: Use diagnostic tools provided by the drive manufacturer to perform tests and gather more detailed information about the drive's health.
  • Professional Assistance: If the data on the drive is critical, consider seeking assistance from professional data recovery services. They may have specialized tools and expertise to diagnose and recover data from failing drives.
 
Last edited:

CyklonDX

Well-Known Member
Nov 8, 2022
1,219
422
83
The clicking is the heads trying to reset/onto plate but can't

The head is moving back, and forth here making click sound. It can't reset itself/go onto platters. (sometimes if plastic is faulty the head can literally fall off and get stuck on the other side. Sometimes the worst - heads make onto platter but one of the head is dead and it proceeds to reset them and potentially damages platter/s every time)
1701812120693.png
1701812475082.png
After certain amount of errors like that, disks board disables the mechanical post to ensure no-further damage can be done for recovery.
To fix this you need to open your disk, and replace head - and pray the platters have no damage.

The plastic failure is the most common reason for this to happen.
 
Last edited:

oneplane

Well-Known Member
Jul 23, 2021
874
532
93
The SATA stack in the HDD firmware might simply not work well if the rest of the SoC is too busy with other things. IIRC most modern drives have at least three CPU cores in their controller where the SATA stack (and serial and at least one of the debug ports) run on a different core vs. the supervisor and the data/mechanical core. The mechanics themselves are on yet another device, usually a combination motor controller/amp (with or without pre-amp, the heads usually even have some sort of in-line pre-pre-amp as well, and all sorts of tuning electronics...).

Then there's the firmware and bootloader and the multiple stages; AFAIK it's still a base ROM, a local unpersonalised firmware, a personalisation ROM or flash identity, a disk-map-flash (not about LBAs but about turning data) and then additional blocks are stored on the disk platters itself. Technically the controller could also just store everything in local flash (internal to the controller SoC or on an SPI chip) but they don't always do that.

As for PCB swaps: this is also why that generally doesn't give you an easy fix since you have to transplant the flash chips and sometimes the controller before it will work. If you don't it will either do nothing of just destroy your data. PCBs aren't "without data". So even if you were going to try it, you need a PCB of the same version, same revision, same controller and preferably same assembly timeframe, and then also swap some chips before you'd have a chance. (i.e. WD80EFAX-68LHPN0, 0J45413 BA6159A, 2W10209, Western Digital SATA 3.5 PCB )

Edit: if you look at something like this: WD80EFAX-68LHPN0 006-0A90561 0J45413 Circuit Board Repair for HDD data recovery | eBay you can see the chip with the thermal pad, and the one next to it. The first one is the controller, the one next to it either DRAM (for the cache), or Flash, but I think that's DRAM (didn't bother to look up that part number). A little bit below that is a PMIC or DC controller and to the side on the left is the analog controller (or motor driver/coil driver). The rest of the multi-pin devices seem to be FETs or Diodes or DC converters, not SPI flash chips. So that big controller chip, that would need to be desoldered and put on the new PCB. But if the controller is bad, transplanting that to a new board means it fixes nothing. If it's just electrical (that's still possible, you could post pictures of the PCB, both sides), you'd most likely see a blown fuse or blown diode. You can see some of those near the SATA Power connector. Edit (3): there seems to be a flash chip right at the top right to the middle screw hole, so maybe no need to transplant that fat controller chip, but just the small QFN package. Neat! But that still only helps is the board is the issue.

On top of all of this, the SATA protocol was never really designed for in-depth operational telemetry; there are custom commands that can sometimes do more, but real drive debugging usually requires loading debug keys or debug firmware first, and sometimes requires entering a key over some serial pins (usually on or near the jumper block) before you get any actual debug data in and out. If the SoC were to be having issues, the next thing on top of that yet again would be SWD or JTAG which is sometimes a single chain with all chips or multiple separate chains you have to connect.

This gets way too detailed, but the gist is, it hasn't been simple for a while as increasing capacity and tighter tolerances require more advanced mechanics and hardware and software.

There are specific forums (like HDDGuru) where they can go in more depth, and there are vendors like the one for PC-3000 that sell the required hardware to actually get low-level access to a drive and its controller (a normal SATA bus won't be able to).
 
Last edited:

cromo

Member
Jun 6, 2019
97
28
18
Thanks for the input.

I already looked into replacing the PCB and for this specific board only a Flash memory chip needs to be replaced, according to few sources that did that already. This isn’t too complicated in itself, but will still cost me some time which is why I was trying to find out upfront whether or not there’s a way to confirm this is not actually a mechanical failure —or at least that it is most likely a PCB failure
 

oneplane

Well-Known Member
Jul 23, 2021
874
532
93
In essence, if the drive does a correct "IDENTIFY" response, it's probably not the PCB. That means if the drive Model/Capacity/Serial all show up in stuff like SMART the electronics part is working.

If it's the digital logic, you'd see a generic placeholder, no data about the disk itself (no serial number, no model number, no capacity). If you see only part of that data but information about the capacity and DCO/HPA is missing, that smells like a bad actuator/coil/head to me.

Technically it's possible that just the SoC is bad, but that would normally either break SATA completely, break it by displaying defaults or break it by only supporting low speed interface rates. In your case the only defects it reports are slowness to respond (because it keeps trying to wait for the disk) and doesn't know what to response in DCO/HPA queries.

If you unscrew the PCB and post pictures we can relatively easily see if the motor controller or DC supply are burnt out.
 

cromo

Member
Jun 6, 2019
97
28
18
In essence, if the drive does a correct "IDENTIFY" response, it's probably not the PCB. That means if the drive Model/Capacity/Serial all show up in stuff like SMART the electronics part is working.
Just for the record, that disk does not appear as a /dev device at all, as it fails initialization altogether:
Code:
2023-11-28T20:37:00.063743+01:00 proxmox kernel: [ 1703.776851] ata6.00: disable device
2023-11-28T20:37:00.803705+01:00 proxmox kernel: [ 1704.513792] ata6: SATA link down (SStatus 0 SControl 300)
As such, there's no SMART communication with it, either.


Technically it's possible that just the SoC is bad, but that would normally either break SATA completely, break it by displaying defaults or break it by only supporting low speed interface rates. In your case the only defects it reports are slowness to respond (because it keeps trying to wait for the disk) and doesn't know what to response in DCO/HPA queries.

If you unscrew the PCB and post pictures we can relatively easily see if the motor controller or DC supply are burnt out.
I checked the PCB first thing, there's no visible damage. But I assumed the physical damage could as well be down to e.g. a cold joint, which could potentially leave it half wrecked?

That having said, the eBay repair service you referred to themselves claim that:

Typical bad hard drive circuit board symptoms:
  • No power, No Spin, No Sound.
  • Burning smell.
  • The circuit board is extremely hot.
  • Power surge. connected the hard drive to wrong power supply.
  • Obvious damage on the circuit board
For following situations , most likely the circuit boards are OK ,We do not recommend sending the board to us
  • The hard drive can spin properly .
  • Make clicking sound .
  • Being dropped.
  • Spin well but cannot be detected
Which would indicate that this drive is most likely damaged mechanically. I will actually ask them for their input, e.g. if there's anyone who has the most experience with dealing with electronics failing, it's business like theirs.
 

oneplane

Well-Known Member
Jul 23, 2021
874
532
93
I must have blanked out those last lines or something, too focussed on the
Code:
2023-11-28T20:36:54.607905+01:00 proxmox kernel: [ 1698.318902] ata6.00: ATA-9: WDC WD80EFAX-68LHPN0, 83.H0A83, max UDMA/133
part :p

That's one of the identify things that would indicate that it's working, but if the link goes down later that does seem like the SoC crashes or the power has a brownout. It's too broad to pinpoint it, perhaps with some power and current measurement it would be easier, same with monitoring local buses like SPI and I2C or clock lines and reset lines.

Depending on where you are in the world, PC Image (in the UK) is what I would suggest.

Edit: you know what, I think if you have the time and the tools, just swapping the board for the sake of it might be a cheap and fun experiment, as long as you don't strictly need the data back.
 
Last edited:

cromo

Member
Jun 6, 2019
97
28
18
I must have blanked out those last lines or something, too focussed on the
Code:
2023-11-28T20:36:54.607905+01:00 proxmox kernel: [ 1698.318902] ata6.00: ATA-9: WDC WD80EFAX-68LHPN0, 83.H0A83, max UDMA/133
part :p
That's one of the identify things that would indicate that it's working, but if the link goes down later that does seem like the SoC crashes or the power has a brownout. It's too broad to pinpoint it, perhaps with some power and current measurement it would be easier, same with monitoring local buses like SPI and I2C or clock lines and reset lines.
This is precisely why it's so puzzling to me and why I had so many questions: can a disk report itself partially but still be electronically damaged? There's multiple chips on the PCB, so I was thinking that it certainly could. The fact that disk behaves differently in different systems could indicate a communication issue as you suggest.

That having said, I asked the same question on HDDGuru forum, since it was mentioned here to be a better place to ask:


Depending on where you are in the world, PC Image (in the UK) is what I would suggest.

Edit: you know what, I think if you have the time and the tools, just swapping the board for the sake of it might be a cheap and fun experiment, as long as you don't strictly need the data back.
I am in Poland, but it's exactly as you said yourself: the data is not critical at all, it was backups only. Still, the disk itself is still quite expensive, even second hand, so if I can salvage it by spending $15-20 on the board and resoldering the chip myself, it can be a fun experiment. I just wanted to avoid doing that if it was actually possible to tell already at this point that it was a 100% mechanical damage.

In any case, I appreciate your help here a lot!
 

oneplane

Well-Known Member
Jul 23, 2021
874
532
93
It's indeed a bit of a chicken-and-egg issue without expensive recovery hardware to load debug firmware into the device.

As WebClaw wrote on HDDGuru, it is indeed likely that the SoC and the flash chip are fine, the SoC is fine because the SATA stack and SPI access are working, the flash chip is fine because it still holds the correct model ID, version and capacity data, but everything dynamic is wrong. HDDs tend to put all the dynamic stuff like extra firmware modules and runtime metrics (like SMART) on the platters and not on the chips, and since SATA is pretty limited (it's mostly just I/O and a tiny bit of command/metrics) every SATA transaction would likely make the SoC try to read the platters, and if it times out the entire SATA stack might get reset.

Anyhow, you can turn this around: now you have a working PCB to sell to someone else for $20 :p
 

cromo

Member
Jun 6, 2019
97
28
18
It's indeed a bit of a chicken-and-egg issue without expensive recovery hardware to load debug firmware into the device.

As WebClaw wrote on HDDGuru, it is indeed likely that the SoC and the flash chip are fine, the SoC is fine because the SATA stack and SPI access are working, the flash chip is fine because it still holds the correct model ID, version and capacity data, but everything dynamic is wrong. HDDs tend to put all the dynamic stuff like extra firmware modules and runtime metrics (like SMART) on the platters and not on the chips, and since SATA is pretty limited (it's mostly just I/O and a tiny bit of command/metrics) every SATA transaction would likely make the SoC try to read the platters, and if it times out the entire SATA stack might get reset.
That's indeed what I understood from further researching the modern HDDs, i.e. that a lot of data is stored on the HDD by the firmware for its internal use, and without a mechanically functional disk it may not establish SATA communication at all. Which sucks because what's the point of SMART if it can't actually be used for diagnosis when real trouble happens? ;)

Either way, thanks for the input and yeah, I guess I can sell that PCB. But first I wonder if freezing it or any other low-skill hack, including opening it and maybe manually fixing heads could help?
 

oneplane

Well-Known Member
Jul 23, 2021
874
532
93
Freezing is mostly a myth in that it does shrink metal parts that might come lose if they were seized and immobile, but if you have clicks and spins that means that all moving parts are indeed moving.

That leaves a few options:

- Physical head damage
- Physical platter damage
- Actuator damage

All of them require a cleanroom (or clean box) and spare parts (well, technically not all variations of platter damage require new parts), and specialised hardware and firmware modules.

You can open it up, but once you do it will never work again, it can however but useful to see what has actually happened. If you see a tiny bit of metal dangling off of an arm or sitting on the platter, that's something you don't need special tools or a microscope for to see that a head is gone.
 

Railgun

Active Member
Jul 28, 2018
150
57
28
Take this as you will. I’ve swapped HDD internals once successfully and once not. No, I was not in a clean room and it worked long enough to get stuff off. I don’t recall details as it was over 15 years ago now.

It can be a DIY, but YMMV. If it’s critical, don’t try and get it done professionally.