True NAS Replication Issue - checksum mismatch or incomplete stream

Hi,

Apologies if i have posted this in the wrong location or in a bad format. I have come here as the Lawrence System you tube videos have been my bible for everything TrueNAS and I am a little stuck with my replication issue.

To give you some background I have 2 TrueNAS systems running on Dell r510. They are running on separate sites and are connected via a VPN. They are also running TrueNAS-12.0-U6.1. I am trying to set up some replication jobs between them. For the sake of this explanation the machine running on site A is called Asgard and the one running on site B is called Hephaestus.

The intention is to have the production datasets on Asgard to be replicated to Hephaestus and the production datasets on Hephaestus replicated on Asgard. My original intention was to run these both as push jobs. So that each production system is responsible to push it’s own data to the other location.

However the job that replicates data from Hephaestus → Asgard keeps failing with the error “checksum mismatch or incomplete stream” after 242MB of data has been transferred. This happens consistently at the same point and even happens after I delete and recreate the replication location on Asgard. When I try to resume the job it also fails before any data can be copied with the same error.

Strangely, the job from Asgard → Hephaestus is working as intended. Which in my mind eliminates networking/hardware issues as a cause.

I have also attempted to pull the data from Hephaestus → Asgard instead of the original push job but it fails in the same way.

When looking for information about this online people have suggested that this might happen when the pool hasn’t been scrubbed for a while I have scrubbed it to be sure but it doesn’t appear to have made a difference.

Any troubleshooting advice would be appreciated.

I tried to put more pictures than this but am limited as i am a new user
Here is a picture of the replication settings page on Hephaestus

The log from Hephaestus when attempting push
"
[2021/11/15 22:03:50] INFO [Thread-138] [zettarepl.paramiko.replication_task__task_1] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
[2021/11/15 22:03:51] INFO [Thread-138] [zettarepl.paramiko.replication_task__task_1] Authentication (publickey) successful!
[2021/11/15 22:03:51] INFO [replication_task__task_1] [zettarepl.paramiko.replication_task__task_1.sftp] [chan 5] Opened sftp connection (server version 3)
[2021/11/15 22:03:51] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task ‘task_1’: doing push from ‘Forge/Forge’ to ‘Asgard/Hephaestus’ of snapshot=‘auto-2020-08-11_23-00’ incremental_base=None receive_resume_token=None encryption=False
[2021/11/15 22:03:52] INFO [replication_task__task_1] [zettarepl.transport.ssh_netcat] Automatically chose connect address ‘192.168.60.2’
[2021/11/15 22:07:22] ERROR [replication_task__task_1] [zettarepl.replication.run] For task ‘task_1’ unhandled replication error SshNetcatExecException(ExecException(1, ‘checksum mismatch or incomplete stream.\nPartially received snapshot is saved.\nA resuming stream can be generated on the sending system by running:\n zfs send -t 1-f17f55bb9-f0-789c636064000310a500c4ec50360710e72765a52697303024f141d460c8a7a515a796806472e0f26c48f2499525a9c52036bf3f1f36fd25f9e9a599290c0c5b9f444905bc960b724092e704cbe725e6a63230b8e517a5a7ea834987c4d2927c5d230323035d030b5d43c37823635d03030684fbb81910fe49cecf2d284a2d2ececf6680030090372036\n’), ExecException(1, ‘I/O error\n’))
Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 164, in run_replication_tasks
retry_stuck_replication(
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/stuck.py”, line 18, in retry_stuck_replication
return func()
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 165, in
lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 258, in run_replication_task_part
run_replication_steps(step_templates, observer)
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 592, in run_replication_steps
replicate_snapshots(step_template, incremental_base, snapshots, encryption, observer)
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 687, in replicate_snapshots
run_replication_step(step, observer)
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 764, in run_replication_step
ReplicationProcessRunner(process, monitor).run()
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/process_runner.py”, line 33, in run
raise self.process_exception
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/process_runner.py”, line 37, in _wait_process
self.replication_process.wait()
File “/usr/local/lib/python3.9/site-packages/zettarepl/transport/ssh_netcat.py”, line 198, in wait
raise SshNetcatExecException(connect_exec_error, self.listen_exec_error) from None
zettarepl.transport.ssh_netcat.SshNetcatExecException: checksum mismatch or incomplete stream.
Partially received snapshot is saved.
A resuming stream can be generated on the sending system by running:
zfs send -t 1-f17f55bb9-f0-789c636064000310a500c4ec50360710e72765a52697303024f141d460c8a7a515a796806472e0f26c48f2499525a9c52036bf3f1f36fd25f9e9a599290c0c5b9f444905bc960b724092e704cbe725e6a63230b8e517a5a7ea834987c4d2927c5d230323035d030b5d43c37823635d03030684fbb81910fe49cecf2d284a2d2ececf6680030090372036
I/O error
"

The log from Asgard when attempting in pull.
"
[2021/11/15 00:00:01] INFO [Thread-25] [zettarepl.paramiko.replication_task__task_3] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
[2021/11/15 00:00:01] INFO [Thread-25] [zettarepl.paramiko.replication_task__task_3] Authentication (publickey) successful!
[2021/11/15 00:00:02] INFO [replication_task__task_3] [zettarepl.replication.run] Resuming replication for destination dataset ‘Asgard/Hephaestus’
[2021/11/15 00:00:02] INFO [replication_task__task_3] [zettarepl.replication.run] For replication task ‘task_3’: doing pull from ‘Forge/Forge’ to ‘Asgard/Hephaestus’ of snapshot=None incremental_base=None receive_resume_token=‘1-10a72f70db-f0-789c636064000310a500c4ec50360710e72765a52697303024f141d460c8a7a515a796806472e0f26c48f2499525a9c540ba4222980f9bfe92fcf4d2cc140686ad4fa2a4025ecb053920c97382e5f31273531918dcf28bd253f5c1a443626949beae91819181ae8185aea161bc91b1ae810103c27ddc0c08ff24e7e71614a51617e76733c00100d48220b3’ encryption=False
[2021/11/15 00:00:03] INFO [replication_task__task_3] [zettarepl.paramiko.replication_task__task_3.sftp] [chan 5] Opened sftp connection (server version 3)
[2021/11/15 00:00:03] INFO [replication_task__task_3] [zettarepl.transport.ssh_netcat] Automatically chose connect address ‘192.168.60.2’
[2021/11/15 00:00:13] WARNING [replication_task__task_3.stdout_copy] [zettarepl.transport.base_ssh.root@192.168.60.2.shell.5.async_exec.2910] Copying stdout from <paramiko.ChannelFile from <paramiko.Channel 6 (open) window=65536 → <paramiko.Transport at 0x79fdf10 (cipher aes128-ctr, 128 bits) (active; 2 open channel(s))>>> failed: timeout()
[2021/11/15 00:01:32] ERROR [replication_task__task_3] [zettarepl.replication.run] For task ‘task_3’ unhandled replication error SshNetcatExecException(ExecException(1, ‘checksum mismatch or incomplete stream.\nPartially received snapshot is saved.\nA resuming stream can be generated on the sending system by running:\n zfs send -t 1-10972d90d7-f0-789c636064000310a500c4ec50360710e72765a52697303024f141d460c8a7a515a796806472e0f26c48f2499525a9c540da4036940f9bfe92fcf4d2cc140686ad4fa2a4025ecb053920c97382e5f31273531918dcf28bd253f5c1a443626949beae91819181ae8185aea161bc91b1ae810103c27ddc0c08ff24e7e71614a51617e76733c00100b0ed2072\n’), ExecException(1, ‘’))
Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 164, in run_replication_tasks
retry_stuck_replication(
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/stuck.py”, line 18, in retry_stuck_replication
return func()
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 165, in
lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 253, in run_replication_task_part
resumed = resume_replications(step_templates, observer)
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 451, in resume_replications
run_replication_step(step_template.instantiate(receive_resume_token=receive_resume_token), observer,
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py”, line 764, in run_replication_step
ReplicationProcessRunner(process, monitor).run()
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/process_runner.py”, line 33, in run
raise self.process_exception
File “/usr/local/lib/python3.9/site-packages/zettarepl/replication/process_runner.py”, line 37, in _wait_process
self.replication_process.wait()
File “/usr/local/lib/python3.9/site-packages/zettarepl/transport/ssh_netcat.py”, line 198, in wait
raise SshNetcatExecException(connect_exec_error, self.listen_exec_error) from None
zettarepl.transport.ssh_netcat.SshNetcatExecException: checksum mismatch or incomplete stream.
Partially received snapshot is saved.
A resuming stream can be generated on the sending system by running:
zfs send -t 1-10972d90d7-f0-789c636064000310a500c4ec50360710e72765a52697303024f141d460c8a7a515a796806472e0f26c48f2499525a9c540da4036940f9bfe92fcf4d2cc140686ad4fa2a4025ecb053920c97382e5f31273531918dcf28bd253f5c1a443626949beae91819181ae8185aea161bc91b1ae810103c27ddc0c08ff24e7e71614a51617e76733c00100b0ed2072
Command failed with code 1
"

Not an issue I have run into before, you should probably post this in the TrueNAS Community forums.

Appreciate the reply, yes that was the plan. Thanks
Keep up the awesome videos

1 Like