Thank you for the link.
I read through his blog post and if the FRAG % is “This means that a high fragmentation metric equates to most of the free space being comprised of small segments.”, then wouldn’t the more “traditional” way of thinking about the fragmentation problem with a copy-on-write filesystem like ZFS still be the same?
i.e. think about it this way:
If the ashift of the vdev is 12 (i.e. 2^12 bytes = 4096 bytes), and in Chris’ blog, he also states “The bucket numbers printed represent raw powers of two, eg a bucket number of 10 is 2^10 bytes or 1 KB; this implies that you’ll never see a bucket number smaller than the vdev’s ashift.”, then therefore; if that’s the minimum size of a bucket (i.e. fragmentation value = 100 per " To simplify slightly, each spacemap histogram bucket is assigned a fragmentation percentage, ranging from ‘0’ for the 16 MB and larger buckets down to ‘100’ for the 512 byte bucket"), then combining the two ought to imply that if the vdev’s ashift=12, then a fragmentation percentage of ‘100’ = a free space block that is available = 4096 bytes in size.
Therefore; if you have a file that is larger than 4096 bytes, you would have to find all of the available blocks, and if the free space is highly fragmented (again, per the blog, i.e. you can have 80% free space in the pool, but it is all in tiny, tiny chunks), you’re still going to have the more “traditional” perspective of the fragmentation problem because any file that is larger than your ashift, and therefore; the smallest block of free space, would be “shotgunned” across your pool, which for mechanically rotating hard drives, would mean that ZFS would have to command the drives to find and read all of those dispersed blocks, in order to be able to read the file.
In other words, the “traditional” sense of the problem still exists, even if you meet the 80% free space requirement, AND your fragmentation report shows ‘100’, because all that means is that all of the free space is available in blocks equal to your vdev’s ashift size.
Please correct and/or educate me if I have misread and/or misinterpreted what Chris’ block is saying.
But if I read and understood it correctly, you can basically achieve this, albeit relatively slowly, by just doing random writes to the HDD where the size of the random write = to the vdev’s ashift.
(i.e. if your vdev’s ashift=12, then you can create this condition/scenario where you will meet the 80% free space requirement AND have a “100” fragmentation value via 4 kiB random writes such that once those conditions are met, and then you write a file that is larger than 4 kiB, ZFS would have to find all of those shotgunned 4 kiB blocks and write your new, bigger file, as a dispersed file in those 4 kiB chunks.
So long as the 80% free space ISN’T contiguous, you’re still going to run into the “traditional” fragmentation problem (which reduces performance due to an increase in access time latency).
From Chris’ blog (post spacemap histogram):
“This table defines a segment size based fragmentation metric that will allow each metaslab to derive its own fragmentation value. This is done by calculating the space in each bucket of the spacemap histogram and multiplying that by the fragmentation metric in this table. Doing this for all buckets and dividing it by the total amount of free space in this metaslab (i.e. the total free space in all buckets) gives us the fragmentation metric. This means that a high fragmentation metric equates to most of the free space being comprised of small segments. Conversely, if the metric is low, then most of the free space is in large segments. A 10% change in fragmentation equates to approximately double the number of segments.”
If your vdev is comprised of 4 kiB blocks (i.e. ashift=12 (which is I think the default in TrueNAS Core)), and you still want to meet the 80% free space recommendation, but still hit the 100% fragmentation value, then what it means is that every 5th block out of 5, is a free block.
data-data-data-data-free (repeat)
You have that block pattern across your entire HDD, and you will be able to statisfy both conditions: 80% free space AND 100% fragmentation
Thus, when you write a 16 kiB file, you will need to write it on blocks 5, 10, 15, and 20.
(if you numbered the blocks with the first block of the vdev starting from 1)
Having 80% free space isn’t going to help with this problem.
Per Chris’ blog, having a lower fragmentation value just means that the free space is in larger chunks.
But if you have a repeating pattern as shown above, you can satisfy both conditions AND still have a corresponding access time/latency increase, which will adversely affect your ZFS pool’s performance, and thus; you still end up with the same/more “traditional” fragmentation problem.