We had a physical problem in our backup server (we suppose an electrical shock due to the thunderstorm). The system is Linux Debian on a disk and the storage in on a two 4 TB disks RAID-1 ZFS (ZFS On Linux) pool. The first symptom we discovered was a frozen system. After we had multiple erratic boot we could not go beyond the BIOS. So we moved the system disk on another computer which booted without problem and seemed stable but when we tried to move the ZFS storage in it we discovered that only one disk was detected as a ZFS part pool but could be loaded/mounted with
Note : the commands and the results tested are given below
So we tried to test the healty of the two disks with the SMART tools
For now the only problem observed is the lack of the MBR but we do not found a tool to recover a ZFS MBR. How to do this ?
Additionally as we have an outdated external clone disk (every month we replace one disk by another which 'resilver" itself so we can externalize the replaced disk), we asked to ourselves if we can copy its MBR to replace the one on the faulty disks. We are not sure if the MBR are exactly the same on a ZFS pool part disk and its mirror or if the differences spring after the MBR execution. If it possible to clone it how to do this with
The tests and the results
=> no ZFS Filesystem detected
=> nothing special returned by the disk's internal components
=> no read errors
=> no block error
=> an empty partition ... not detected as ZFS
zpool
and data were there (lsblk -f
simply indicated that the other disk is not partitioned). After several tests to load the second disk the first one showed us that it was no more loadable and was detected as unpartioned.Note : the commands and the results tested are given below
So we tried to test the healty of the two disks with the SMART tools
smartctl
but nothing wrong was returned, the disks seemed operational. So we tried to read data with dd
with success because no read error was returned. So we tried badblocks
which indicated too that everything was ok. Finally we tried gpart
which for the moment discovered a possible partition Windows NT/W2K empty but the process is not terminated as the disks are big.For now the only problem observed is the lack of the MBR but we do not found a tool to recover a ZFS MBR. How to do this ?
Additionally as we have an outdated external clone disk (every month we replace one disk by another which 'resilver" itself so we can externalize the replaced disk), we asked to ourselves if we can copy its MBR to replace the one on the faulty disks. We are not sure if the MBR are exactly the same on a ZFS pool part disk and its mirror or if the differences spring after the MBR execution. If it possible to clone it how to do this with
dd
?The tests and the results
Code:
root@CZ-LIVE:~# lsblk -o NAME,SIZE,FSTYPE
NAME SIZE FSTYPE
...
sda 3,6T
...
Code:
root@CZ-LIVE:~# smartctl -t long /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-2-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 54 minutes for test to complete.
Test will complete after Fri Sep 9 17:24:14 2022 UTC
Use smartctl -X to abort test.
root@CZ-LIVE:~# smartctl -l selftest /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-2-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 4660 -
root@CZ-LIVE:~# smartctl -A /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-2-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0003 100 100 006 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 16
4 Start_Stop_Count 0x0002 100 100 020 Old_age Always - 100
5 Reallocated_Sector_Ct 0x0003 100 100 036 Pre-fail Always - 0
9 Power_On_Hours 0x0003 100 100 000 Pre-fail Always - 1
12 Power_Cycle_Count 0x0003 100 100 000 Pre-fail Always - 0
190 Airflow_Temperature_Cel 0x0003 069 069 050 Pre-fail Always - 31 (Min/Max 31/31)
root@CZ-LIVE:~# smartctl -A /dev/sda | \
grep -iE "Power_On_Hours|G-Sense_Error_Rate|Reallocated|Pending|Uncorrectable"
5 Reallocated_Sector_Ct 0x0003 100 100 036 Pre-fail Always - 0
9 Power_On_Hours 0x0003 100 100 000 Pre-fail Always - 1
dd
show if there is read errors (source) :
Code:
root@CZ-LIVE:~# dd if=/dev/sda of=/dev/null bs=64k conv=noerror status=progress
4000784842752 octets (4,0 TB, 3,6 TiB) copiés, 104555 s, 38,3 MB/s
61047148+1 enregistrements lus
61047148+1 enregistrements écrits
4000785948160 octets (4,0 TB, 3,6 TiB) copiés, 104556 s, 38,3 MB/s
Code:
root@CZ-LIVE:~# date ; badblocks -svn /dev/sda ; date
ven. 16 sept. 2022 17:00:06 UTC
Checking for bad blocks in non-destructive read-write mode
From block 0 to 3907017526
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: done
Pass completed, 0 bad blocks found. (0/0/0 errors)
dim. 18 sept. 2022 01:54:49 UTC
Code:
root@CZ-LIVE:~# gpart /dev/sda
Begin scan...
Possible partition(Windows NT/W2K FS), size(0mb), offset(345079mb)