What are the best practices to keep ZFS from being too fragmented

What are the best practices to keep ZFS from being too fragmented due to ZFS’ COW architecture?

I understand that there is a general, “best practices” for not filling the ZFS pool to > 80%, so that the extra space can be used by ZFS to move stuff around, to minimize the fragmentation, but I would also think that by virtue of ZFS being a COW, that it will invariably and inevitably become highly fragmented anyways, by design, no?

Any best practices/resources that people are aware of that they can share, would be greatly appreciated.

Thank you.

I may be wrong, but I think the daily/weekly scrub will fix this.

1 Like

If that’s the case, then that should be doable.

(Well…maybe not weekly, but maybe monthly. Having ZFS scrub through > 100 TB of data weekly would likely equate to it effectively being a perpetual scrubbing of the ZFS pool.)

ZFS performance tanks when there is little free space. Fragmentation could contribute but that’s a smaller problem than high utilization. You can use the command: zpool list

NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Dozer  12.2T  2.71T  9.49T        -         -     3%    22%  1.13x    ONLINE  /mnt

to show the fragmentation of the system. Scrubbing verifies integrity of the data but this is not your old DOS FAT file system where fragmentation of the data is as big of an issue.

3 Likes

Thank you, Tom.

As you can see from the output below, I’m currently only at about 6% used capacity, but I still have about 9% fragmentation.

NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
share  65.5T  4.20T  61.3T        -         -     9%     6%  1.00x    ONLINE  /mnt

It’s coming down a bit as I am currently in the process of evacuating the data off from the system as the data is being migrated over to my Proxmox server, and the TrueNAS system is being decommissioned entirely.

My concern is going to be that as new data is written, that it is going to result in greater and greater levels of fragmentation with the copy-on-write nature of ZFS, as data is changing over time.

The system will eventually be hosting 32 drive spindles (mechanically rotating hard drives), and the second part of the concern is that as the fragementation increases, that it will mean that the system will have to quote/unquote “look harder” (across the spindles) to locate the fragmented pieces of data, as a result of the copy-on-write nature of ZFS, which will eventually degrade ZFS performance.

It is my understanding that for SSDs, fragmentation may be less of an issue (unless you exceed the modern version of the “file allocation table” where it stores the “map” of where the data is mapped onto the actual logical and/or physical block address of the NAND flash cells for the data being written and/or requested/read), so if you have high(er) levels of fragmentation with SSDs, but because seek times on SSDs are significantly faster than the seek times of HDDs, high(er) fragmentation on SSDs is not as big of a deal, but for HDDs - it can become (over time) a bigger and bigger deal.

And from googling this topic, it would appear that most common recommendation is to offload the data so that you can write it back to the ZFS pool in a more sequential manner, in order to “defrag” the ZFS pool.

But for a > 100TB ZFS pool, that’s just not (always) possible.

Thank you.

To be more clear the “fragmentation” in ZFS is NOT the same from the days of FAT32/NTFS. It has nothing to do with files and used space, but free space. It does not have the same performance impact, but running over 80% will. You can do more reading here on the fragmentation topic: Chris's Wiki :: blog/solaris/ZFSZpoolFragmentationDetails

Thank you for the link.

I read through his blog post and if the FRAG % is “This means that a high fragmentation metric equates to most of the free space being comprised of small segments.”, then wouldn’t the more “traditional” way of thinking about the fragmentation problem with a copy-on-write filesystem like ZFS still be the same?

i.e. think about it this way:

If the ashift of the vdev is 12 (i.e. 2^12 bytes = 4096 bytes), and in Chris’ blog, he also states “The bucket numbers printed represent raw powers of two, eg a bucket number of 10 is 2^10 bytes or 1 KB; this implies that you’ll never see a bucket number smaller than the vdev’s ashift.”, then therefore; if that’s the minimum size of a bucket (i.e. fragmentation value = 100 per " To simplify slightly, each spacemap histogram bucket is assigned a fragmentation percentage, ranging from ‘0’ for the 16 MB and larger buckets down to ‘100’ for the 512 byte bucket"), then combining the two ought to imply that if the vdev’s ashift=12, then a fragmentation percentage of ‘100’ = a free space block that is available = 4096 bytes in size.

Therefore; if you have a file that is larger than 4096 bytes, you would have to find all of the available blocks, and if the free space is highly fragmented (again, per the blog, i.e. you can have 80% free space in the pool, but it is all in tiny, tiny chunks), you’re still going to have the more “traditional” perspective of the fragmentation problem because any file that is larger than your ashift, and therefore; the smallest block of free space, would be “shotgunned” across your pool, which for mechanically rotating hard drives, would mean that ZFS would have to command the drives to find and read all of those dispersed blocks, in order to be able to read the file.

In other words, the “traditional” sense of the problem still exists, even if you meet the 80% free space requirement, AND your fragmentation report shows ‘100’, because all that means is that all of the free space is available in blocks equal to your vdev’s ashift size.

Please correct and/or educate me if I have misread and/or misinterpreted what Chris’ block is saying.

But if I read and understood it correctly, you can basically achieve this, albeit relatively slowly, by just doing random writes to the HDD where the size of the random write = to the vdev’s ashift.

(i.e. if your vdev’s ashift=12, then you can create this condition/scenario where you will meet the 80% free space requirement AND have a “100” fragmentation value via 4 kiB random writes such that once those conditions are met, and then you write a file that is larger than 4 kiB, ZFS would have to find all of those shotgunned 4 kiB blocks and write your new, bigger file, as a dispersed file in those 4 kiB chunks.

So long as the 80% free space ISN’T contiguous, you’re still going to run into the “traditional” fragmentation problem (which reduces performance due to an increase in access time latency).

From Chris’ blog (post spacemap histogram):
“This table defines a segment size based fragmentation metric that will allow each metaslab to derive its own fragmentation value. This is done by calculating the space in each bucket of the spacemap histogram and multiplying that by the fragmentation metric in this table. Doing this for all buckets and dividing it by the total amount of free space in this metaslab (i.e. the total free space in all buckets) gives us the fragmentation metric. This means that a high fragmentation metric equates to most of the free space being comprised of small segments. Conversely, if the metric is low, then most of the free space is in large segments. A 10% change in fragmentation equates to approximately double the number of segments.”

If your vdev is comprised of 4 kiB blocks (i.e. ashift=12 (which is I think the default in TrueNAS Core)), and you still want to meet the 80% free space recommendation, but still hit the 100% fragmentation value, then what it means is that every 5th block out of 5, is a free block.

data-data-data-data-free (repeat)

You have that block pattern across your entire HDD, and you will be able to statisfy both conditions: 80% free space AND 100% fragmentation

Thus, when you write a 16 kiB file, you will need to write it on blocks 5, 10, 15, and 20.

(if you numbered the blocks with the first block of the vdev starting from 1)

Having 80% free space isn’t going to help with this problem.

Per Chris’ blog, having a lower fragmentation value just means that the free space is in larger chunks.

But if you have a repeating pattern as shown above, you can satisfy both conditions AND still have a corresponding access time/latency increase, which will adversely affect your ZFS pool’s performance, and thus; you still end up with the same/more “traditional” fragmentation problem.

I need to try and come back to this when I have time, I’m guessing my big server will be in the double digits for fragmentation. I also have not seen a performance issue because I think we are under 20% utilization right now.

1 Like

If fragmentation is an issue due to access time, wouldn’t adding a metadata cache help remedy this?

Not necessarily.

I wouldn’t think so.

If the ZFS pool has an ashift=12 (i.e. a block = 4 kiB)

And you have a block pattern like this:

data-data-data-data-free (rinse. repeat. for the entire map of the platter)

You’d still meet the 80% free space “rule of thumb”, and if I understood Chris’ blog correctly, he states that a fragmentation value of ‘100’ can’t be smaller than the size of the block (which in this case, would be ashift=12, i.e. 4 kiB), therefore; a 4 kiB block that’s free would have a fragmentation value of ‘100’, since, again, per Chris’ block, the fragmentation value that’s reported isn’t the % (or some quantitative representation) of the number of file fragments, but rather the fragmentation of the free space, as a function of the largest, contiguous block.

Thus, if you write a new file that is > 4 kiB in size, (let’s take for example a 8 kiB file or a 16 kiB file), your new data pattern will become:

data-data-data-data-newdata-data-data-data-data-newdata

And to read the newdata blocks, you’d have to skip over four other data blocks.

And if they are linearily written like that, access time might be less of an issue.

But if the newdata blocks are now spread over multiple physical drives and you have multiple, interspersed newdata blocks being written, due to multiple, new files that are being written, such that newdata1 is interspersed with say newdata2, then you’d have to skip over more data and newdata(i) blocks other than the newdata(i) of interest.

The metadata cache isn’t going to have a generally positive effect on the interspersion of this newdata.

I hope this makes sense.

(And the interspersion can occur, naturally, on ZFS, due to it’s copy-on-write nature, where again, you can start out with data-data-data-data-free (repeat) block pattern, and then by virtue of the CoW, it’s going to try and find a new, largest contiguous block for the size of your new file (or changed file), and when it can’t find that (due to the aforementioned block layout/pattern), it will have to intersperse the write to the AVAILABLE free blocks that are also already interspersed.)

If you have a 16 MiB ZFS pool on a slow HDD (e.g. a laptop 4200 2.5" HDD), and you write random data to it, and then “prune” it by deleting every 5th block, you can create and simulate this.

And then when you want to write changed data to it, the math, on the surface, ought to be able to show the increase in access time that arises as a result of the ZFS CoW nature.

And this, in turn, is to show what happens if you have a fair bit of dynamically changing data, being written/changed to the disk, for a relatively decent portion of the time in any given day.

(i.e. you have a VM running that has 64 GB of RAM, running Windows. The swap space is a part of the either a provisioned VM disk file (regardless of whether it’s think or thin provisioned), and swap keeps doing its thing, eventually, you will cycle through enough data in swap that you’re going to end up with this divergent solution. This is how I ended up killing my Intel 750 Series 400 GB NVMe 3.0 x4 AIC SSD. On average, I was only writing about 2.29 GiB/day. But because it was ALL random writes, it burned through the write endurance limit of the NAND flash memory cells/modules in about 3-4 years, which resulted in me RMAing the SSD back to Intel for a warranty replacement. Now that I have my system running as a VM, the swap all happens inside the VM disk image file, which sits on top of ZFS, which has CoW. The 16 MiB test sample is to make it so that the dd if=/dev/urandom of=swap bs=4k count=1024 doesn’t take forever (vs. writing out a 64 GB swap file and testing this with that).)