Weird issues with MS-01 and XCP-ng - scratching my head

So I’ve got 3 MS-01 servers (firmware 1.22) running in a single XCP-ng pool with storage provided by a TrueNAS device. I’ve finally (after months) got time to resolve issues, and the resolution isn’t going as I expected.

Problems I’ve been seeing

  • I get random reboots of servers every few weeks or months. All the servers are at risk of this.
  • All were initially installed with md RAID mirrors of Samsung 990 Pro NVMe drives (all with firmware 4B2QJXD7), but all are degraded because nvme1n1 disappeared on all the servers. It’s just not seen.
  • I’ve needed to upgrade XCP-ng anyway

The Plan

  • Fix the degraded RAID first. Odd that all 3 identical drives failed in the same slot, but oh well.
  • Upgrade firmware on the MS-01 servers
  • Upgrade XCP-ng

What Happened

  • I brought down the first machine, replaced the failed NVMe, rebooted and saw the new NVMe in the list of devices. Added it to md127 and watched it rebuild until about 50% complete. Temperature never got above 45 C
  • At that point the rebuild failed, and nvme1n1 again disappeared.

That means it’s probably firmware, right?

Firmware upgrade

  • Upgraded to the latest 1.27 firmware, and when booting XCP-ng got an error a few minutes into the boot. I’ll detail this below
  • Reverted to 1.26 and still won’t boot with the same error.
  • Back to firmware 1.27, and after toggling a few settings hoping that might make a difference I gave up and came here.

Current Error Message

Booting XCP-ng in safe mode, the logs look like:

[ OK ] Mounting Configuration File System
[ OK ] Started Show Plymouth Boot Screen
[ OK ] Reached Target Paths
[ OK ] Reached Target Basis System
[ 144 ] Mounted Configuration File System
[144.267384 ] dracut-initqueue[294]: Warning: dracut-initqueue timeout - starting timeout scripts

That last one repeats lots of times. Eventually I get this and it dumps me to a prompt where the keyboard doesn’t respond:

[ 205.134740 ] dracut-initqueue[407]: Warning: Could not boot.
[ 205.153475 ] dracut-initqueue[407]: Warning: /dev/disk/by-label/root-phrnjf does not exist
Starting Dracut Emergency Shell

What to do next?

All that changed from a booting system to one that wouldn’t was the firmware update, so I figure there’s gotta be some setting in there that’s toggled different that I can’t find. Any ideas?

If not, does it make more sense to just do a fresh install of the newest XCP-ng and start a new pool, then remove old servers from the old pool using the same process and migrate them over as well?

I’m really at a loss here.

I’d consider doing an upgrade to the newest XCP-ng on the affected server, hoping it would take the old settings and make the system bootable, but I know that’s the wrong way to upgrade Xenserver and I have no idea what additional problems that would cause between power it up and then promoting it to be pool master. If I could even get that far.

I doubt a screenshot helps, but here’s what I see in safe mode when the problem starts:

Have you tested these in another system? Maybe they are not compatible.

Maybe the MS-01s don’t like Samsung 990 Pros, so maybe.

Replacing with something different is likely wise.

More than a few problems with the Minis Forum machines in threads on Serve the Home forums. Not enough to be consistent, but enough for me to stay away. I think you are going to need to work with the manufacturer to find the problem and get it fixed.

Or get them working and sell them for something else like a Lenovo Tiny with a similar processor.

1 Like