Permanent errors have been detected in the following files: <metadata>:<0x1b>

LIGISTX · September 9, 2025, 12:18am

Well, thats not the email you want to get at the completion of a resilver of new larger drives…

This is quite confusing since I have never had any errors in my array previously, and I run weekly scrubs. I recently added a L2Arc drive and set it to use metadata only, and run this pre-init script: echo 0 > /sys/module/zfs/parameters/l2arc_headroom

I also added a SLOG. Both L2 and SLOG are SAS enterprise SSD’s, used of course. I can’t say they are in perfect shape (they are not reporting any errors), but aren’t SLOG and L2 not pool critical, so even if something with them was wonky, would that result in metadata errors on a scrub/resilver?

Last night I popped 2 new 8 TB WD Reds in (both verified via a bad blocks run, no errors found) to replace 2 of my 4 TB drives (I am going through and replacing drives from 4’s to 8’s), and this is the error I woke up to. Last night prior to the resilver, no errors in zpool status. Upon the resilver, I am seeing:

  pool: pergamum
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 4.65T in 09:12:38 with 3 errors on Mon Sep  8 09:50:43 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	pergamum                                  ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    ab0351e8-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     6
	    670dfb97-13fc-4611-bb0f-6680649d4089  ONLINE       0     0     0
	    8c42800d-40d9-432f-b918-bd4138714187  ONLINE       0     0     0
	    6ebdcf54-ac93-11ec-b2a3-279dd0c48793  ONLINE       0     0     6
	    72baec5e-e358-4bbe-a8b0-dd75494f725d  ONLINE       0     0     6
	    8a6e6dd2-465c-4311-b62e-cce797796faf  ONLINE       0     0    12
	    7a9b8d5e-a28d-11ee-aaf2-0002c95458ac  ONLINE       0     0     6
	    d9238765-4851-48c5-b3cc-1650c8de1364  ONLINE       0     0     0
	    d3a5a104-011f-4602-ab04-90149d8863e8  ONLINE       0     0     6
	    b1d949c1-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     6
	logs
	  d4c96b7f-9ca8-46ab-836a-ca387309ac56    ONLINE       0     0     0
	cache
	  8e380a80-b813-448b-9704-ed5689983c76    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x1b>

At this point, I am not entirley sure what to do. I run ECC RAM, LSI SAS9305-16I, in a HL15 case (so the drives are all plugged into a backplane, SAS cables have not been physically touched or adjusted in months).

Any thoughts on how I should proceed? At this point I think the best course of action is to shut down the system, but I really don’t want to do anything that could result in further damange.

LTS_Tom · September 9, 2025, 11:05am

I am not clear on what you mean here, you have a single drive for metadata setup? If so that is really not a good idea.

I believe that error means ZFS has detected permanent corruption in its own metadata, which it cannot fix because the pool has no redundancy to get a good copy of it.

LIGISTX · September 9, 2025, 1:54pm

Thanks for the reply Tom. I added an L2arc device and used a ZFS flag (if that’s the correct terminology) of secondarycache=metadata, basically the less risky and less performant version of a metadata special vdev. It only helps with read performance but at the recommendation of many folks, it seemed to be a good option for my needs. And since it’s only an L2arc device, it isn’t pool critical, doesn’t need to be redundant, etc. All metadata is still written to the main pool,

After a few more hours or research and some investigation, it appears the culprit was an overheating HBA. With adding the SSD’s to the array which blocked more airflow from the HL15 fans, even tho I have a 120mm over the PCIe cards, the HBA was still fairly hot to the touch without any load on it. I have a script running to up the fan speed upon harddrive and CPU temp increases, but it appears those they within reason but the HBA was getting fairly toasty. I turned my noctuas up to 100%, re-ran a scrub all night, and it appears everything is now happy, no more errors being reported and the permanent error is no longer being reported.