More NappIT replication errors

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
@gea, please help.

I'm getting replication errors again between my Omnios servers.

Both servers are running OmniosCE r151030cm and NappIT 21.01b4.
The machines are called ServerA and ServerB.
I had been having problems with replication jobs from ServerA to ServerB. Please see this forum post:

I thought I had solved the problem by limiting the ZFS send stream to 30Mb/s.
However the failed replication jobs have returned.
So I have upgraded the storage pool on ServerB from onboard SATAII to a Broadcom 9300-8i and SATAIII disks.
But the upgraded pool has not stopped the failed replications.

These errors seem to have restarted after I upgraded both machines to r151030cm and NappIT 21.01b4 but I can't be sure.

I've attached the NappIT logs for the most recent failed replication job, please can you help me fix what ever is going wrong.

ServerA is the source.
ServerB is the destination

If I start the failed replication job manually it will complete OK.

ServerA storage pool has 2.04TB available and 604GB in use, there are 337 snapshots.
ServerB storage pool has 2.14TB available and 501GB in use, there are 217 snapshots.

Please note, the source snap tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11217 exists on the source machine.

ServerA remote log
zfs remote log for 1451725974 newest entry on top


## 20:20 04 grouplib 1878: call job-repli-send.pl for id 1451725974 on interface /usr/bin/nc -w 40 192.168.1.9 56499


## 20:20 03 grouplib 1810: zfs snapshot tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11217


## 20:20 03 grouplib 1763: remote call zfs send 1451725974 initiated
_________________________________________________________


## 20:20 01 grouplib 2018: remote call destroy snap : zfs destroy tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11214

ServerB Job Log
Log: Job 1451725974 (newest first)
error time: 2021.03.01.20.20.43 line: 2258 my_log end 1551: 887 next destination snap tank/backup/beth@1451725974_repli_zfs_ServerB_nr_11217 from ServerA was not created

error check network, systemlog, poolstate, capacity, timeouts
on recursive transfers optionally delete newest target snap and retry with the former error check network, systemlog, poolstate, capacity, timeouts
on recursive transfers optionally delete newest target snap and retry with the former
end receive time: 2021.03.01.20.20.37 line: 1014 zfs receive 1451725974 running 35 s
receiver message
time: 2021.03.01.20.20.37 line: 1002
receive finished with warning 'failed to read from stream',
can be a message only: check if target snap is created
time: 2021.03.01.20.20.37 line: 992
receiver message time: 2021.03.01.20.20.37 line: 985 cannot receive: failed to read from stream
error, monitor info time: 2021.03.01.20.20.37 line: 333 replication terminated: local receive=1, remote send=0 - check zpool status
incremental send time: 2021.03.01.20.20.03 line: 1135 ServerA: zfs send -i tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11216 tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11217 | /var/web-gui/data/tools/nc/nc -b 131072 -w 40 192.168.1.9 56499
start receiver time: 2021.03.01.20.20.02 line: 960 /var/web-gui/data/tools/nc/nc -b 131072 -d -l -p 56499 | /usr/sbin/zfs receive -Fv tank/backup/beth 2>&1
zfs destroy time: 2021.03.01.20.20.02 line: 904 tank/backup/beth@1451725974_repli_zfs-kh2021022816_ServerB_nr_11210
request remote snap destroy time: 2021.03.01.20.20.01 line: 855 tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11214
next replication started time: 2021.03.01.20.20.00 line: 136

ServerB Last run Log
og: Last run of Job 1451725974 (newest first)
2021.03.01.20.20.43 &my_log_last line 2259
<- &my_log line 2017
<- &my_end_all line 1986
<- &my_fast_end line 1065
<- &my_remote_delta_replication line 442
error;
2021.03.01.20.20.41 &my_log_last line 1063
<- &my_remote_delta_replication line 442
<- &my_run line 172
887 next destination snap tank/backup/beth@1451725974_repli_zfs_ServerB_nr_11217 from ServerA was not created

2021.03.01.20.20.39 &my_log_last line 2464
<- &my_postreplication line 1022
<- &my_remote_delta_replication line 442
<- &my_run line 172
end my postreplication
2021.03.01.20.20.39 &my_log_last line 2453
<- &my_postreplication line 1022
<- &my_remote_delta_replication line 442
<- &my_run line 172
my postreplication, zfs set readonly=on tank/backup/beth
2021.03.01.20.20.39 &my_log_last line 2393
<- &my_postreplication line 1022
<- &my_remote_delta_replication line 442
<- &my_run line 172
zfs set nfs/smb shares=off tank/backup/beth
2021.03.01.20.20.39 &my_log_last line 2373
<- &my_postreplication line 1022
<- &my_remote_delta_replication line 442
<- &my_run line 172
zfs set readonly=off tank/backup/beth
2021.03.01.20.20.39 &my_log_last line 2345
<- &my_postreplication line 1022
<- &my_remote_delta_replication line 442
<- &my_run line 172
1522 update id.par: last=01.mar_20_20
2021.03.01.20.20.39 &my_log_last line 2325
<- &my_postreplication line 1022
<- &my_remote_delta_replication line 442
<- &my_run line 172
next_src_snap from host ServerA
tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11217 0 - 1.42G
2021.03.01.20.20.38 &my_log_last line 2313
<- &my_postreplication line 1022
<- &my_remote_delta_replication line 442
<- &my_run line 172
begin my_postreplication
2021.03.01.20.20.38 &my_log_last line 2090
<- &my_end_proc line 335
<- &my_monitor line 1183
<- &my_remote_delta_replication line 442
<- &my_run line 172
kill proc 19410 ? S 00:00 /usr/sbin/zfs receive -Fv tank/backup/beth (19410)
2021.03.01.20.20.37 &my_log_last line 1015
<- &my_remote_delta_replication line 442
<- &my_run line 172
end receive 35 s
2021.03.01.20.20.37 &my_log_last line 984
<- &my_remote_delta_replication line 442
<- &my_run line 172
error;, receiver message
cannot receive: failed to read from stream
2021.03.01.20.20.37 &my_log_last line 2064
<- &my_end_proc line 335
<- &my_monitor line 1183
<- &my_remote_delta_replication line 442
<- &my_run line 172
kill proc 19409 ? S 00:00 /var/web-gui/data/tools/nc/nc -b 131072 -d -l -p 56499 (19409)
2021.03.01.20.20.37 &my_log_last line 2064
<- &my_end_proc line 335
<- &my_monitor line 1183
<- &my_remote_delta_replication line 442
<- &my_run line 172
kill proc 19408 ? S 00:00 sh -c /var/web-gui/data/tools/nc/nc -b 131072 -d -l -p 56499 | /usr/sbin/zfs re (19408)
2021.03.01.20.20.37 &my_log_last line 334
<- &my_monitor line 1183
<- &my_remote_delta_replication line 442
<- &my_run line 172
error;, monitor info, replication terminated: local receive=1, remote send=0 - check zpool status
2021.03.01.20.20.07 &my_log_last line 313
<- &my_monitor line 1183
<- &my_remote_delta_replication line 442
<- &my_run line 172
Monitor: Remote proc:
remote nc and zfs send not running or finished
2021.03.01.20.20.04 &my_log_last line 1154
<- &my_remote_delta_replication line 442
<- &my_run line 172
source snap1,2: size=0,0

2021.03.01.20.20.03 &my_log_last line 1138
<- &my_remote_delta_replication line 442
<- &my_run line 172
start sender
id=1451725974
src_interface=/usr/bin/nc -w 40 192.168.1.9 56499
src_interface2=/var/web-gui/data/tools/nc/nc -b 131072 -w 40 192.168.1.9 56499
snap_name1=tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11216
snap_name2=tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11217
send_inc=-i
snap_rec=
send_rec=
send_dedup=
send_i=-i
transfer_zero=
send_option=
pv_limit=-L 30m
2021.03.01.20.20.02 &my_log_last line 963
<- &my_remote_delta_replication line 442
<- &my_run line 172
start receiver
/var/web-gui/data/tools/nc/nc -b 131072 -d -l -p 56499 | /usr/sbin/zfs receive -Fv tank/backup/beth 2>&1
2021.03.01.20.20.00 &my_log_last line 776
<- &my_remote_delta_replication line 442
<- &my_run line 172
start remote incremental replication
2021.03.01.20.20.00 main line 76 start job-replicate with parameter:
id=1451725974, action=run, par='run_1451725974

many thanks
bob
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
hello bob
The replication starts the zfs receive then tries to start the zfs send remotely. The monitor process then indicates a problem with send and the zfs receive finally ends with a

receiver message time: 2021.03.01.20.20.37 line: 985 cannot receive: failed to read from stream

This indicates that either the send does not start correctly, there was a zfs problem on either side or there was a transmission problem.

Check first if remote control works properly. Goto menu Extensions > Appliance group (receiver side) and klick on zfs and snaps in the line of the sender machine. This should give a zfs and snap list. Check for a long delay of listings. Is there a long delay on a local listing when you click on localhost.

Check this menu on both sides for duplicate entries ex host names and different ip

Check send process log on sender side for send process in menu Jobs > Remote log (enter job id)

check network for Jumbo Frames. On any network problens first deactivate Jumbo

Reboot both servers and retry

What irritates me a little is that a manual start works and not the auto start as the process is quite identical. Is this the only job that is started at that time.

To rule out a timing problem with 21.01 you may downgrade napp-it to the former release.
 
Last edited:

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
Hello Gea, thank you for the reply, I hope you're well.

Check first if remote control works properly. Goto menu Extensions > Appliance group (receiver side) and klick on zfs in the line of the sender machine. This should give a zfs list.
Done. The appliance groups on both machines show up correctly, all green dots. I've clicked the ZFS list button and they work properly.
Neither machines show duplicate listings.

Check send process log on sender side for send process in menu Jobs > Remote log (enter job id)
I did that, I attached the log to my first post.
I'll copy the entry here...
ServerA remote log
zfs remote log for 1451725974 newest entry on top

## 20:20 04 grouplib 1878: call job-repli-send.pl for id 1451725974 on interface /usr/bin/nc -w 40 192.168.1.9 56499
## 20:20 03 grouplib 1810: zfs snapshot tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11217
## 20:20 03 grouplib 1763: remote call zfs send 1451725974 initiated
_________________________________________________________
## 20:20 01 grouplib 2018: remote call destroy snap : zfs destroy tank/storage/beth@1451725974_repli_zfs_ServerB_nr_11214

check network for Jumbo Frames. On any network problens first deactivate Jumbo
The Nics on both machines are running 1500 MTU

Reboot both servers and retry
I've restarted the destination server, ServerB, but not the sending server. I will restart the sending server tomorrow.

Is this the only job that is started at that time.
The destination machine is only running one job when the failed replication occurs.
The sending machine had a snapshot job running when this replication job failed.
There are other jobs on the destination machine that fail, it doesn't seem to be one specific job that fails.

To rule out a timing problem with 21.01 you may downgrade napp-it to the former release.
I will try this.

thank you for the reply,

regards
Bob
 

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
...both systems are now running NappIT 20.06a3 Pro, I used the "activate last" option.
I have not restarted the sending machine.
I have confirmed the Applicance Groups are working OK on NappIT 20.06a3 Pro.

regards
Bob
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
If the problem remains, please update to 21.dev where I have increased waiting time for the monitoring process. This should avoid problems on a slower pool or with many snaps/filesystems.
 
Last edited:

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
Hello Gea,
Switching to NappIT 20.06a3 seems to be OK. 10 replications have completed successfully. I'll report back if they start to fail again.
Many thanks
Bob
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
Main difference between 20.06 and 21.01 regarding replication is a rewrite of the keep/hold behaviour of target replication snaps like keep a snap per hour/day/month/year per last n days/months/years. This can increase the time that is needed for postprocessing. On slow systems with low ram or many snaps/filesystems it may happen that the monitoring process to avoid endless waits cancels a replication too early or the default wait of netcat cancels a slow send too early. In the 21.dev I have increased timer settings and wait timers for netcat transfers.
 

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
Hello Gea, 20.06a3 is still working OK.

You often mention "Slow systems with low ram, many snaps/filesystems".

My ServerB system (the replication destination) is a Xeon E3 1230 with 32gig ram and a 9300-8i HBA with x3 SATAIII disks.
The system has 224 snapshots and 11 filesystems.
Would you expect this to be a slow system?

It may be important for you to know that this system is also hosting x2 Bhyve VM's with their disks on the pool attached to the 9300-8i. The Bhyve VMs consume 12gig of RAM.

Many thanks
bob
 

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
Gea, Are there any system tests I could run to help you understand how my systems respond? It may be possible to create some sort of replication benchmark that could be used to fine tune the timers within NappIt.

Regards
Bob
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
Hello Bob
I have identified a possible problem when starting replications remotely. I have modified this in current napp-it 21.01 b5. The 21.dev includes this as well and some more.

May you update (source and destination) servers to 21.01b5 or 21.dev and check if your replications fail. The problem may not the replication itself but the remote start.

Main logs to understand replikation are
destination Replication Log and Last Log (menu Jobs)

Source
remote log (Menu Jobs) I have improved this log in 21.dev what helped to find this possible problem

The logs give infos about how long the replication lasts, the transfer rate and the time of postprocessing (mainly the keep hold process ).
 
Last edited:
  • Like
Reactions: Bob T Penguin

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
Hello Gea, i have updated both systems to 21.dev 04.mar.2021

I also noticed the remote log did not contain an entry for starting the sender.

Regards
Bob
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
A dev is like OmniOS bloody work in progress. Current dev should look like (with the send command).
You can supress successfull sends (skip ok=yes) or show only selected log entries, divided per job

21.dev.png
 

Bob T Penguin

Member
Dec 16, 2015
55
1
8
44
Gea,
Here are the log entries for the first replication job with 21.dev
This job completed successfully.

Thank you for investigating this issue.

regards
Bob


zfs remote log for 1451663337 newest entry on top

log job 1451663337

xxx repli-send 217: 2021.03.04.20.05.06 repli-send.pl id 1451663337
remote send finished after 0 s
zfs send (tank/tempstorage): ok

42.1KiB [4.32MiB/s]

xxx repli-send 198: 2021.03.04.20.05.06 repli-send.pl id 1451663337
NAME USED AVAIL REFER MOUNTPOINT
snap1: tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11234 0 - 25.5G -
snap2: tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 0 - 25.5G -
/usr/sbin/zfs send -i tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11234 tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 | /usr/bin/pv -L 30m -frb -i 10000 2> /tmp/53856.log | /var/web-gui/data/tools/nc/nc -w 900 -b 131072 192.168.1.9 53856 2>&1

xxx repli-send 165: 2021.03.04.20.05.06 repli-send.pl id 1451663337 cmd:
/usr/sbin/zfs send -i tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11234 tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 | /usr/bin/pv -L 30m -frb -i 10000 2> /tmp/53856.log | /var/web-gui/data/tools/nc/nc -w 900 -b 131072 192.168.1.9 53856 2>&1

xxx repli send 259: par 1451663337:
id=1451663337, send_dedup=, send_option=, pv_limit=-L 30m, send_rec=, send_inc=-i, snap_name1=tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11234, snap_name2=tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235, src_interface=/usr/bin/nc -w 900 192.168.1.9 53856, send_ip=, port=, transport=, buffer=, send_i=-i,

xxx repli-send 43: zfs send initiated: remote_1451663337

## 20:05 06 grouplib 1881: job-repli-send.pl for id 1451663337 on interface /usr/bin/nc -w 900 192.168.1.9 53856 initiated from 192.168.1.9

## 20:05 06 grouplib 1810: 192.168.1.9 zfs snapshot tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 initiated 192.168.1.9

## 20:05 06 grouplib 1763: 192.168.1.9 remote call zfs send 1451663337 initiated from 192.168.1.9
_________________________________________________________

## 20:05 02 grouplib 2021: remote call destroy snap : zfs destroy tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11232

Log: Job 1451663337 (newest first)
ok time: 2021.03.04.20.05.10 line: 2256 my_log end 1551: ok incremental remote replication nr 11235 finished (time: 3 s)
new snap tank/backup/tempstorage@1451663337_repli_zfs_ServerB_nr_11235, refer: 24.6G
end receive time: 2021.03.04.20.05.06 line: 1027 zfs receive 1451663337 running 3 s
receiving incremental stream of tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 into
tank/backup/tempstorage@1451663337_repli_zfs_ServerB_nr_11235
received 41.8KB stream in 1 seconds (41.8KB/sec)
receiver message time: 2021.03.04.20.05.06 line: 995 receiving incremental stream of tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 into
tank/backup/tempstorage@1451663337_repli_zfs_ServerB_nr_11235
received 41.8KB stream in 1 seconds (41.8KB/sec)
incremental send time: 2021.03.04.20.05.05 line: 1154 ServerA: zfs send -i tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11234 tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 | /var/web-gui/data/tools/nc/nc -b 131072 -w 900 192.168.1.9 53856
start receiver time: 2021.03.04.20.05.03 line: 973 /var/web-gui/data/tools/nc/nc -b 131072 -d -l -p 53856 | /usr/sbin/zfs receive -Fv tank/backup/tempstorage 2>&1
zfs destroy time: 2021.03.04.20.05.02 line: 919 tank/backup/tempstorage@1451663337_repli_zfs-kh24_ServerB_nr_11228
zfs destroy time: 2021.03.04.20.05.02 line: 919 tank/backup/tempstorage@1451663337_repli_zfs-km02_ServerB_nr_11045
request remote snap destroy time: 2021.03.04.20.05.02 line: 870 tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11232
next replication started time: 2021.03.04.20.05.01 line: 130

Log: Last run of Job 1451663337 (newest first)
2021.03.04.20.05.10 &my_log_last line 2257
<- &my_log line 2015
<- &my_end_all line 1984
<- &my_fast_end line 1079
<- &my_remote_delta_replication line 456
ok
2021.03.04.20.05.09 &my_log_last line 1076
<- &my_remote_delta_replication line 456
<- &my_run line 166
ok incremental remote replication nr 11235 finished (time: 3 s)
new snap tank/backup/tempstorage@1451663337_repli_zfs_ServerB_nr_11235, refer: 24.6G
2021.03.04.20.05.09 &my_log_last line 309
<- &my_monitor line 1202
<- &my_remote_delta_replication line 456
<- &my_run line 166
Monitor: Remote proc:
remote nc and zfs send not running or finished, wait for postprocessing
2021.03.04.20.05.07 &my_log_last line 2460
<- &my_postreplication line 1035
<- &my_remote_delta_replication line 456
<- &my_run line 166
end my postreplication
2021.03.04.20.05.07 &my_log_last line 2449
<- &my_postreplication line 1035
<- &my_remote_delta_replication line 456
<- &my_run line 166
my postreplication, zfs set readonly=on tank/backup/tempstorage
2021.03.04.20.05.07 &my_log_last line 2389
<- &my_postreplication line 1035
<- &my_remote_delta_replication line 456
<- &my_run line 166
zfs set nfs/smb shares=off tank/backup/tempstorage
2021.03.04.20.05.07 &my_log_last line 2369
<- &my_postreplication line 1035
<- &my_remote_delta_replication line 456
<- &my_run line 166
zfs set readonly=off tank/backup/tempstorage
2021.03.04.20.05.07 &my_log_last line 2341
<- &my_postreplication line 1035
<- &my_remote_delta_replication line 456
<- &my_run line 166
1522 update id.par: last=04.mar_20_05
2021.03.04.20.05.07 &my_log_last line 2321
<- &my_postreplication line 1035
<- &my_remote_delta_replication line 456
<- &my_run line 166
next_src_snap from host ServerA
tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 0 - 25.5G
2021.03.04.20.05.06 &my_log_last line 2309
<- &my_postreplication line 1035
<- &my_remote_delta_replication line 456
<- &my_run line 166
begin my_postreplication
2021.03.04.20.05.06 &my_log_last line 1028
<- &my_remote_delta_replication line 456
<- &my_run line 166
end receive 3 s
2021.03.04.20.05.06 &my_log_last line 994
<- &my_remote_delta_replication line 456
<- &my_run line 166
receiver message
receiving incremental stream of tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235 into
tank/backup/tempstorage@1451663337_repli_zfs_ServerB_nr_11235
received 41.8KB stream in 1 seconds (41.8KB/sec)
2021.03.04.20.05.06 &my_log_last line 1173
<- &my_remote_delta_replication line 456
<- &my_run line 166
source snap1,2: size=0,0

2021.03.04.20.05.05 &my_log_last line 1157
<- &my_remote_delta_replication line 456
<- &my_run line 166
start sender
id=1451663337
src_interface=/usr/bin/nc -w 900 192.168.1.9 53856
src_interface2=/var/web-gui/data/tools/nc/nc -b 131072 -w 900 192.168.1.9 53856
snap_name1=tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11234
snap_name2=tank/tempstorage@1451663337_repli_zfs_ServerB_nr_11235
send_inc=-i
snap_rec=
send_rec=
send_dedup=
send_i=-i
transfer_zero=
send_option=
pv_limit=-L 30m
2021.03.04.20.05.03 &my_log_last line 976
<- &my_remote_delta_replication line 456
<- &my_run line 166
start receiver
/var/web-gui/data/tools/nc/nc -b 131072 -d -l -p 53856 | /usr/sbin/zfs receive -Fv tank/backup/tempstorage 2>&1
2021.03.04.20.05.01 &my_log_last line 790
<- &my_remote_delta_replication line 456
<- &my_run line 166
start remote incremental replication: read snaps
2021.03.04.20.05.01 main line 76 start job-replicate with parameter:
id=1451663337, action=run, par='run_1451663337
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
Fine that it works now for you.

btw.
The reorganisation of the remote log where I display logs now per job in 21.dev gave me a hint about a possible timing problem. For a final test I started all replications on my three backup server and around 12 source machines at the same time (over 100 replications) and they ended without a problem.