So I’ve got 3 MS-01 servers (firmware 1.22) running in a single XCP-ng pool with storage provided by a TrueNAS device. I’ve finally (after months) got time to resolve issues, and the resolution isn’t going as I expected.
Problems I’ve been seeing
- I get random reboots of servers every few weeks or months. All the servers are at risk of this.
- All were initially installed with md RAID mirrors of Samsung 990 Pro NVMe drives (all with firmware 4B2QJXD7), but all are degraded because nvme1n1 disappeared on all the servers. It’s just not seen.
- I’ve needed to upgrade XCP-ng anyway
The Plan
- Fix the degraded RAID first. Odd that all 3 identical drives failed in the same slot, but oh well.
- Upgrade firmware on the MS-01 servers
- Upgrade XCP-ng
What Happened
- I brought down the first machine, replaced the failed NVMe, rebooted and saw the new NVMe in the list of devices. Added it to md127 and watched it rebuild until about 50% complete. Temperature never got above 45 C
- At that point the rebuild failed, and nvme1n1 again disappeared.
That means it’s probably firmware, right?
Firmware upgrade
- Upgraded to the latest 1.27 firmware, and when booting XCP-ng got an error a few minutes into the boot. I’ll detail this below
- Reverted to 1.26 and still won’t boot with the same error.
- Back to firmware 1.27, and after toggling a few settings hoping that might make a difference I gave up and came here.
Current Error Message
Booting XCP-ng in safe mode, the logs look like:
[ OK ] Mounting Configuration File System
[ OK ] Started Show Plymouth Boot Screen
[ OK ] Reached Target Paths
[ OK ] Reached Target Basis System
[ 144 ] Mounted Configuration File System
[144.267384 ] dracut-initqueue[294]: Warning: dracut-initqueue timeout - starting timeout scripts
That last one repeats lots of times. Eventually I get this and it dumps me to a prompt where the keyboard doesn’t respond:
[ 205.134740 ] dracut-initqueue[407]: Warning: Could not boot.
[ 205.153475 ] dracut-initqueue[407]: Warning: /dev/disk/by-label/root-phrnjf does not exist
Starting Dracut Emergency Shell
What to do next?
All that changed from a booting system to one that wouldn’t was the firmware update, so I figure there’s gotta be some setting in there that’s toggled different that I can’t find. Any ideas?
If not, does it make more sense to just do a fresh install of the newest XCP-ng and start a new pool, then remove old servers from the old pool using the same process and migrate them over as well?
I’m really at a loss here.
I’d consider doing an upgrade to the newest XCP-ng on the affected server, hoping it would take the old settings and make the system bootable, but I know that’s the wrong way to upgrade Xenserver and I have no idea what additional problems that would cause between power it up and then promoting it to be pool master. If I could even get that far.
I doubt a screenshot helps, but here’s what I see in safe mode when the problem starts: