Hi, I have 10pcs of ST4000LM016 in a btrfs raid and some of them are failing. Weird thing about it is the disks that fails will remove itself from the controller. Then I usually have to connect the usb adapter the disk came with and run the seatools using the long test and then the disk is ok again. Run a read/write test for a 4-5days and nothing is wrong. Put it back into the server and after a couple of hours it fails again. What gives?
This is pretty simple - do you remember the whole TLER fiasco back when WD Red drives first became a thing? These drives aren't meant for RAID, so when they go into data recovery mode after finding a bad block, it can take them up to 7-10 minutes to complete the recovery cycle - during this period they're effectively non-responsive to the controller, and the controller (correctly) boots them out of the array as a failed drive.
[Edit: Whether you're running a hardware RAID controller, or a software btrfs RAID setup, the result is the same - but you might be able to tweak the timeouts of your btrfs setup to bypass this problem, if you can set the drive timeouts to 5-7 minutes instead of 30-60 seconds]
The drive completes its bad block recovery process, the then long test you run externally allows it an opportunity to remap some more bad blocks - but it's only a matter of time before it runs into another one once you put it back into the array & start rebuilding/scrubbing data.
Short version: this happens because the drive's worn out (too many bad blocks) & it's an indication that you should replace the drives. How old are yours? The 4TB units seem to be much less durable than the 5TB.