Napp-IT Replication Integrity After NIC Failure

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

sonoracomm

New Member
Feb 10, 2017
7
0
1
67
Hi all,

We have been using the Napp-IT Replication Feature for a long time as a (multi-level) backup. We've been lucky and haven't needed to restore from it, so I'm not well versed in this feature/technology.

We had an incident where a NIC failed after moving a storage server (third-level backup) to another location with much slower connectivity (VPN) to the two source servers.

Obviously there were a few Replication jobs that failed or were incomplete.

Do I need to do anything to verify or repair the replicated datasets after the same/existing jobs have resumed successful scheduled operation with no ongoing errors?

Thanks in advance,

G
 

gea

Well-Known Member
Dec 31, 2010
3,431
1,335
113
DE
The key point is that you need identical snapshots pairs on source and destination (same repli_nn snapnumber) to continue incremental replications. With correct snapshot pairs you can simply restart or reverse a replication (set old destination filesystem to rw and create there a new replication job with same job id).

If an incremental replications fails ex due a network error, you can just restart/retry. In rare cases the last destination snap is damaged. As a napp-it replication preserves at least last three snap pairs you can destroy the newest destination snap. The next job run will then be based on the former pair.

If you do not have a snappair with same number, rename destination filesystem ex to filesystem.old and start the job for a new full initial transfer. After success destroy the filesystem.old that you had kept simply as backup.

If you (re)run a replication without errors, it was checksum protected and ok. No need to verify or repair anthing due former errors.
 

evawillms

New Member
Oct 6, 2023
2
0
1
Drift Boss said: The key point is that you need identical snapshots pairs on source and destination (same repli_nn snapnumber) to continue incremental replications. With correct snapshot pairs you can simply restart or reverse a replication (set old destination filesystem to rw and create there a new replication job with same job id).

If an incremental replications fails ex due a network error, you can just restart/retry. In rare cases the last destination snap is damaged. As a napp-it replication preserves at least last three snap pairs you can destroy the newest destination snap. The next job run will then be based on the former pair.

If you do not have a snappair with same number, rename destination filesystem ex to filesystem.old and start the job for a new full initial transfer. After success destroy the filesystem.old that you had kept simply as backup.

If you (re)run a replication without errors, it was checksum protected and ok. No need to verify or repair anthing due former errors.
How do I know if DFS replication is working?
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,431
1,335
113
DE
ZFS cares about data integrity, distributed filesystems organizes shares.
If the first is true, the second should as well.
 

gea

Well-Known Member
Dec 31, 2010
3,431
1,335
113
DE
With ZFS it is quite simple.
A replication finishes without an error and a new destination snap created or not (replication failed).
In the last case (network or any other problem), just restart to retry
 

Lother00001

New Member
Dec 5, 2024
1
0
1
Great to see Napp-IT replication in action! If jobs resumed with no errors, datasets should be fine since ZFS ensures integrity. ️ For extra peace of mind, run zfs diff to check for discrepancies and consider a periodic scrub to keep backups solid! ✅.
 

gea

Well-Known Member
Dec 31, 2010
3,431
1,335
113
DE
An incremental replication based ex on common snap nr 100 creates a new snap 101 on source to transfer the difference. If snap 101 appears on destination, replication was successful.
 

gea

Well-Known Member
Dec 31, 2010
3,431
1,335
113
DE
ZFS replications are checksum protected. If a replication runs through with a new base snap on destination it successfully finished otherwise on any error (lan or other) replication stops with an error and no new base snap on destination is generated. This is not napp-it specific, this is how ZFS replication works.

When you start a incremental replication, a rollback of the destination filesystem to the last common base snap (result of a former successful replication) is always done so a former unsuccessful replication run or any other modification or write access of a destination filesystem does not matter as they are discarded due the rollback.

Resumable transfers is a method to reduce amount of data to transfer on errors mostly for large initial transfers, rollback is the method to protect incremental replications.

System checkpoints is a method to go back to a former pool vdev state ex prior adding a vdev, not related to replications.

Redundant hardware, does not matter if psu, network, hba, sas, raid etc improves availability. ZFS data protection works without.