XCP-NG Host Console Frozen-VMs still running

Hi I have an XCP-NG 8.2.1 server that the host console has completely frozen up on. All the vms on the host are still running and can be accessed, but the management interface is frozen.

I can’t ping it, can’t ssh in, the IMPI console and a physical keyboard and monitor show the same stuck screen asking any key to be pressed to access the console. This of course doesn’t work. I can change the status of num lock and Caps Lock on the keyboard though.

Xen Orchestra can’t connect saying giving the error
connect EHOSTUNREACH XXX.XXX.XXX.XXX:443

My plan is to remote into each running VM and shut them down 1 by one and then reboot the server but I wanted to

  1. make sure that this isn’t the type of error that lights a RAID array on fire during a reboot and

  2. that it isn’t me missing an obvious solution.

Editing in the “solution” for posterity.

Well I’m not sure what caused it or why but asking xcp-ng to rebuild the array listing the previously dev/sdd (now /dev/sdf) as the last drive with the command:

mdadm --assemble --verbose /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdf

made it happy and the array is rebuilt.

Weird issue, that process sounds good. Shut down running VM’s and reboot host.

How many ethernet connections are you using? Is it possible that you have VM on one set of NICs and management on another set of NICs and that the management network or NICs are down?

Well it turned out that the disks of my raid 5 local storage got new names sdf instead of sdd and now the raid won’t work.

What’s my best move from here?

You might get lucky and remove any drives that have recently be added to the host and it will go back.

Do you know whether its possible to tell xcpng to use the new disk id for the same array?

I also noticed that it is showing up as an inactive raid 0 array rather than a degraded raid 5 array. Does that give anyone any clues?

I think I need more information. Did you create a raid with madam or did you create a zfs pool?

I created an mdadm raid array, raid 5 with 4 disks. Now the command mdadm -D /dev/md0 is showing a 3 Device Raid 0 in an inactive state.

I don’t know much about mdadm but this looks like what you are going through. This might help

software raid - MDADM says can't assemble RAID5 because missing disks but all disks are there - Super User

1 Like

Well I’m not sure what caused it or why but asking xcp-ng to rebuild the array listing the previously dev/sdd (now /dev/sdf) as the last drive with the command:

mdadm --assemble --verbose /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdf

made it happy and the array is rebuilt.

This is the thanks I get for making the local storage repo a RAID5 instead of the default RAID0. Remember kids, backups, backups backups, BACKUPS, backups.

Build a Truenas machine and run NFS for your VMs?

What would happen if you built a RAID 5 in hardware and just exposed a volume to XCP-NG local storage? That might be an answer too.

Have to agree with Greg_E , hardware raid is probably the best solution for XCP-NG with local drives

Seems to be the way to go. I like having the OS on local storage and then data on NFS and I’ve never been an enormous fan of hardware raid but if that’s the best practice I’m willing to do that instead.

Have a look at this thread - Suggestions for new servers | XCP-ng and XO forum

Hardware raid is the way to go.

If the server has a management card (Idrac , Ilo , or ipmi) you can get alerts on issues with the raid / hard drives

Truenas is totally different, it can handle the zfs raid configurations

1 Like

I recommend you watch these. Hardware raid isn’t a good idea now days.