Encrypted zfs plus zfs-send not fit for production?

Hi all,

I spotted this on r/zfs last week.
https://www.reddit.com/r/zfs/comments/1aowvuj/psa_zfs_has_a_data_corruption_bug_when_using/

From the post:
'There are known data corruption bug(s) when using zfs’s native encryption feature along with zfs send/recv. In particular, ‘zfs send’ on an encrypted dataset can cause one or more snapshots to report errors. Sometimes, deleting the affected snapshot(s) then scrubbing twice appears to resolve the situation, but this is little solace if the corrupted portion of the snapshot has some data that you need. "

It seems that ‘everyone knows’ about this bug which has been around for years. I did not know. I recently set up an immutable ZFS repository for Veeam. My plan had been to send the encrypted snapshots to our hosted storage via site to site VPN, knowing they couldn’t be mounted by anyone but us.

Unless I’m reading this wrong, I need to continue with using Veeam directly to the offsite environment instead of pushing the local backups out, if I want to use ZFS at that site. Either that or I decrypt my local ZFS pool and rely on Veeam encryption, which I don’t want to do because I would much rather use open source encryption standards.

Does anyone here use ZFS send in to push data off site in production? What strategies do you have in place to encrypt the data?

S.

I have been using using replication and ZFS send to replicate for a long time and the only issue I have run in to is when there was a bug, that has now been fixed, that would cause ZFS from a TrueNAS Scale system to a TrueNAS Core system to cause the Core system to kernal panic.

A quick read through this Proposal: Consider adding warnings against using zfs native encryption along with send/recv in production · Issue #494 · openzfs/openzfs-docs · GitHub

and someone has mentioned the edge case that sometimes triggers the issue:

  • a quick succession of zfs snapshot + zfs send (without --raw) of the same or different datasets

The reason this issue has been lingering for a while is because it’s so rare and no one has been able to trigger it in a repeatable way.

I am still using encryption and replication.

Thanks Tom. I appreciate your real-world experience of this. Weighing up the risks versus benefits I will probably go ahead with the replication of encrypted snapshots, given that the likelihood that I will need a particular replicated copy of a snap (rather than the local backup copy) that then happens to have this corruption is vanishingly small over the lifetime of the backups. FWIW the issue seems to be unrelated to the Scale/Core issue - there’s evidence of it occurring under Linux to Linux on SPARC too.

If I want to eliminate this risk, I can run test mounts on the remote server after every send, although this means my keys would need to be on the remote machine which is one of the things I wanted to avoid.

Regards,
S.