What happens if SLOG Fails

Brian_S · February 23, 2022, 1:44am

Hello,

I’m running TrueNAS which provides iSCSI storage and have the zvols configured as synchronous mode. The array has 4 mechanical drives mirrored into 2 vdevs. The read performance is ok but the write performance is terrible, so I’m considering an SSD for SLOG.

I’ve read the best practice is to use 2 SLOG devices configured as mirror. However, if these are SSD then they will likely wear out around the same time since exactly the same writes will go to both. Assuming that happens, and both drives die at the same time but the system does not loose power or crash, will TrueNAS switch back to the ZIL on the storage pool like it is right now without data loss? What about the data that might be on the SLOG disks?

brwainer · February 23, 2022, 4:08am

If the drives fail in any way that ZFS can detect (fall offline, fail to acknowledge a write as committed, go to read-only mode) then no data is lost and it will go back to in-pool ZIL. ZIL is only read from when booting up after a sudden shutdown/crash. The data written to ZIL is kept in memory, and is written to the pool from memory. Without some protocol to send data directly between drives, it has to be in memory to be written anyway. The max ZIL usage is simply the max IOPS of the base storage * 5 seconds. Having SLOG is just giving ZFS a chance to treat writes as more sequential than random without the risk of losing data.

David · February 23, 2022, 2:40pm

Nicely worded. It’s not an SSD cache like some say.

Brian_S · February 24, 2022, 7:00pm

Hi, thanks for the explanation so far. That sounds like how asynchronous works though. I thought synchronous (set zvol sync=always) works like this, please correct me where needed:

With SLOG:
Host sends writes to TrueNAS and waits
TrueNAS writes to the SLOG
TrueNAS sends acknowledgement to the host it succeeded
Host assumes success and moves on
Some time later, TrueNAS copies the pending writes in the SLOG to the zvol

Without SLOG:
Host sends writes to TrueNAS and waits
TrueNAS writes to the ZIL, which exists on the same disks as the zvol
TrueNAS copies the writes from the ZIL to the zvol
TrueNAS sends acknowledgement to the host it succeeded
Host assumes success and moves on

brwainer · February 25, 2022, 2:42am

Your understanding is incorrect based one major assumption, and also one slightly-pedantic part of terminology.

First the terminology: ZIL, or ZFS Intent Log, stores the same data and is operated the same way regardless of whether the ZIL is operating within the pool’s main disks, or with a SLOG (Separate Log) device(s). The set of disks to be used to store the ZIL is the only thing that changes between the two cases. This is also true regardless of whether you are forcing all transactions to be synchronous, or allowing a mix of sync and async transactions. The ZIL is only ever used for sync transactions.

It makes no sense to copy data from the ZIL to real storage, regardless of where ZIL lives, because a computer can never copy directly from one storage to another. A copy (or move) always requires reading the data into memory, then writing it to the new storage. Therefore it is just faster to not have purged the data from memory when it was written to the ZIL. Actually, all data operations happen to/from the CPU’s registers, so you’d really prefer that data to stay in L2 or L3 or at worst RAM, because it needs to be inside the CPU to be written to disk. Reading it back from the ZIL would take hundreds of milliseconds, versus nanoseconds or a single millisecond. If you want to see this yourself, run “zpool iostat -v 5” and watch the operations on the SLOG (you can’t see this with in-pool ZIL unfortunately) and it will have writes but no reads.

In normal operation of local storage, the process performing a storage operation defines whether the transaction is sync or async. With network storage protocols however, things get messy. This next part is purely from memory and I’m not going to double check it, but if I’m wrong the overall gist is correct relating to “different protocols can do different things”. iSCSI passes through the sync/async requests from the client application, while NFS converts everything to async, because iSCSI was intended to be used for remote disk storage, and NFS was intended for file server operations. NFS expects normal files, where at most you corrupt or lose one or two, not a disk where bits are being changed all over and corruption, especially of metadata, can be catastrophic or otherwise worse than a single client file. Setting sync=always is a workaround for NFS’ design to make it safe for VM disks. Again, if I have something wrong, its along the lines of putting the wrong protocol name with the wrong description.

Now that we have a foundation about SLOG, ZIL, sync, and async, here’s the flow when ZFS is handed a transaction:

VM sends write to its disk, which it may or may not know is virtual, and if it knows it is virtual it still has no idea whether it is local or remote, and the protocol being used. If the write was async, it waits for an acknowledge of receipt, if it was sync it waits for acknowledgements of receipt and storage. While waiting on the acknowledgement of receipt, it will keep the data in memory for possible resend (this will be in the kernel function that handles I/O).
Hypervisor sends write to storage server via defined protocol. It will proxy back all acknowledgements and/or error responses.
Storage server protocol (process responding to NFS/iSCSI/etc) sends write to ZFS, which it might be accessing as a POSIX-compliant block device (storing disks as files) or directly (storing disks as volumes). The behavior is the same for the purposes of this discussion. If the protocol is one that converts all transactions to async, this is where it takes effect (it could also be on the hypervisor side, but the point is that the transaction is turned to async before going further).
When the full write transaction is within the ZFS part of kernel memory, and if the transaction is async as far as ZFS is told, and if “sync=always” is not set, it will send an acknowledgement of receipt back up the chain. If the transaction was received as sync, or sync=always is set, then no acknowledgement is sent yet.
If an acknowledgement of receipt is sent, and the storage server protocol had converted a sync to async, then the storage server protocol will also acknowledge storage up the chain. This releases the VM, but the protocol should keep the data in its own memory until the real storage acknowledgement is sent.
If the transaction is sync or you told ZFS to always sync, it will write the transaction to the ZIL as fast as it can, not caring where the ZIL is being stored. Once the data is in the ZIL, it will acknowledge storage. The difference between the write speed of a SLOG (especially SSD) versus the main disks you’re already trying to write to, is why a SLOG speeds up sync writes.
As soon as optimal, but no more than 5 seconds (ZFS will throttle incoming data if it is behind, processes upstream will be frozen by the kernel when they try to transfer data into ZFS if ZFS’s incoming buffer is full) it will write the data to the main storage. If the transaction was sync, ZFS sends a single metadata update to the ZIL to mark the transaction as complete (it will also consider these blocks now empty and available for other sync writes, but won’t delete them). If the transaction was async as far as ZFS was concerned, it will send the acknowledgement of storage up the chain.

Brian_S · February 26, 2022, 4:28pm

Thank you for taking the time to explain. That was very good and well written! I get it now, and the question that came to mind after reading was “Why mirror SLOG then if everything is kept in RAM anyway until committed to the ZFS pool?”

If anyone else comes across this post and has the same question, read this:

Thanks everyone, I now have enough information to make decisions.