Fresh hardware build: TrueNAS Scale crashing after a few days

BKman · August 29, 2023, 4:33am

Hi there!

I’ve recently finished a new hardware build. Everything works fine, except TrueNAS that is crashing after running a few days.

I’m running my boot-pool out of single NVME drive (Transcend 128GB Nvme PCIe Gen3 X4 MTE110S M.2 SSD Solid State Drive TS128GMTE110S if that matters).

HW build details:

Fractal Design Node 804 Black Window Aluminum/Steel Micro ATX Cube Computer Case 1
Asrock Rack D1541D4U-2O8R Server Motherboard Intel Xeon D1541 SFP DDR4 ECC DIMM 1
WD Red Plus 12TB NAS Hard Disk Drive - 7200 RPM Class SATA 6Gb/s, CMR, 256MB Cache, 3.5 Inch - WD120EFBX 8
CORSAIR - RMe Series RM750e 80 PLUS Gold Fully Modular Low-Noise ATX 3.0 and PCIE 5.0 Power Supply - Black 1
Corsair Dual SSD Mounting Bracket (3.5” Internal Drive Bay to 2.5", Easy Installation) Black 1
Corsair CP-8920186 Premium Individually Sleeved SATA Cable, Black PSUs 29.5 inches 1
SAMSUNG Electronics 870 EVO 2TB 2.5 Inch SATA III Internal SSD (MZ-77E2T0B/AM) 2
Noctua NF-A14 iPPC-2000 PWM, Heavy Duty Cooling Fan, 4-Pin, 2000 RPM (140mm, Black) 1
Transcend 128GB Nvme PCIe Gen3 X4 MTE110S M.2 SSD Solid State Drive TS128GMTE110S 1
Noctua NF-P12 redux-1700 PWM, High Performance Cooling Fan, 4-Pin, 1700 RPM (120mm, Grey), Compatible with Desktop 3

When a crash occurs, I see the following in the syslog:

2023-08-28T16:34:14.000-07:00 truenas zed[2303824]: eid=220 class=io_failure pool=‘boot-pool’
2023-08-28T16:34:14.000-07:00 truenas zed[2303822]: eid=219 class=io_failure pool=‘boot-pool’
2023-08-28T16:34:14.000-07:00 truenas kernel: WARNING: Pool ‘boot-pool’ has encountered an uncorrectable I/O failure and has been suspended.
2023-08-28T16:34:14.000-07:00 truenas zed[2303820]: eid=218 class=data pool=‘boot-pool’ priority=0 err=6 flags=0x8881 bookmark=158:59988:0:0
2023-08-28T16:34:14.000-07:00 truenas zed[2303817]: eid=217 class=data pool=‘boot-pool’ priority=0 err=6 flags=0x808881 bookmark=158:54802:0:0
2023-08-28T16:34:14.000-07:00 truenas kernel: WARNING: Pool ‘boot-pool’ has encountered an uncorrectable I/O failure and has been suspended.
2023-08-28T16:34:14.000-07:00 truenas kernel: zio pool=boot-pool vdev=/dev/nvme0n1p3 error=5 type=1 offset=11820916736 size=12288 flags=180880
2023-08-28T16:34:14.000-07:00 truenas kernel: nvme0n1: detected capacity change from 250069680 to 0
2023-08-28T16:34:14.000-07:00 truenas kernel: nvme nvme0: Removing after probe failure status: -19
2023-08-28T16:34:14.000-07:00 truenas kernel: nvme 0000:08:00.0: can’t change power state from D3cold to D0 (config space inaccessible)
2023-08-28T16:34:14.000-07:00 truenas kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

Is this just a sign of a buggy NVME or there’s something else at play?

I have two NVME slots on the motherboard - one of them for the boot device, another one is currently housing Intel Optane for LOG for the spinning rust pool (not shown on the pictures below, haven’t had a chance to take the updated one yet):

LTS_Tom · August 29, 2023, 9:52am

Sounds like a buggy NVME to me.

xMAXIMUSx · August 29, 2023, 2:58pm

It’s possible your boot pool drive has issues. You could try to boot into a linux distro and test the drive.

Greg_E · August 29, 2023, 3:06pm

If you have room to mount another drive, toss one in there and mirror the boot drive. Then you can remove the NVME drive and see if it stays running longer.

If it stays running, then you know that something is wrong with that drive, could be defective, could be driver support, could just need a firmware update.

BKman · August 29, 2023, 10:06pm

Tx everyone for chiming in!

I do have space for an additional 3.5" HDD - I can use it to host a new cage that will have two SSD’s for the boot pool, the motherboard has 4 more SATA connectors available.

However, I want to get to the bottom of this. After some lurking in the interwebs, I found a similar issue reported against Linux 5.x kernels: [Kernel-packages] [Bug 1942624] Re: NVME "can't change power state from D3Cold to D0 (config space inaccessible)"

Not sure if TrueNAS Scale kernel (5.15.107+truenas) has this patch or not - if someone has an idea where the sources for TrueNAS scale kernels are, please point me to a right direction.

If this is something power-saving related, I will likely experience this issue with a different brand of NVME, so I adjusted the mode from AHCI to RAID for that NVME and will continue monitoring.
Here are the related BIOS settings I’m running with at the moment:

Greg_E · August 30, 2023, 1:20pm

If you need to boot to Windows to apply new firmware, I’d suggest using this
https://www.hirensbootcd.org/

Copy drivers into the drivers folder and copy your firmware utilities into the root. I have SD cards set up like this in my HP DL360p G8 servers so I can deal with BIOS and other firmware tools. It gets skipped at boot unless you tell it to boot that device. Worked out fine when I first got them.