Recover downed XCP-NG host

Hi.

After doing a lot of Googling I did not find (Maby lack of Googling knowledge) any step by step guides for restoring a XCP-NG pool with one host, where you have to reinstall XCP-NG and recover the storage pools if the boot disk failed.

Setup:
Single disk 256G for XCP-NG boot/Local Storage (Local storage not used for anything important)
Two raid arrays handled by a raid controller. Each about 3TB of storage. 1 array with SSDs and one with HDDs.

Pool with single host.

I know it would be better using for example a mirrored boot disk.
Even so, first i thought it would just be to connect the fresh install of XCP-NG to XOA and restore from backup. That did not work as the host got a new UUID. Found a forum post on the XCP-NG forum, but it just said to replace the backup UUID with the new one. (Restoring a downed host ISNT easy | XCP-ng and XO forum)
Did not manage to get this to work, so I started testing how I could get the SRs and the VMs back up and running. After a lot of trail and error, I got it working. There is probably an easier way, so I would appreciate any input.

This is based on that you have a pool backup. Either by doing a manual backup from the command line, or from a XOA backup or

xe pool-dump-database file-name=backup
ll /dev/disk/by-id/ | grep sc
vgs
xe sr-introduce content-type=user name-label="Dell HDD" shared=false uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0cdb71541FFFF
xe pbd-plug uuid=fec33099-8a25-7383-069e-9e0f6e6142e3 

xe sr-introduce content-type=user name-label="Dell SSD" shared=false uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0d2b00aabFFFF
xe pbd-plug uuid=9fd88523-8794-b947-d22c-a5434f3c2006 

### Upload backup file. I uploaded the file from a XOA backup.
xe pool-restore-database file-name=data dry-run=true
xe pool-restore-database file-name=data --force

### Host will reboot

### The local storage for the boot disk will not work
### We then need to detach and retach the correct volume
xe pbd-list 
xe pbd-unplug uuid=2e59bc55-cfe1-e240-2c5f-22403622bb34 
xe sr-forget uuid=f617db0d-d15d-ca7f-efed-dfcf4b0058b1 

ll /dev/disk/by-id/ | grep sdd
vgs
xe sr-introduce content-type=user name-label="Local Storage" shared=false uuid=35f12fc0-d581-FFFF-48a2-992096ce47be type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=35f12fc0-d581-FFFF-48a2-992096ce47be device-config:device=/dev/disk/by-id/wwn-0x5002538e4019bff9
xe pbd-plug uuid=009b255a-3309-3d4d-f9dd-476b75ea8fc2 

Now all SRs and VMs are back up and running.

Hope this can help others that have a similar problem.

1 Like

Thanks, this is a process I have been thinking about doing a write up on and a feature request to fix because as you noted, itā€™s a bit of a pain to do. XO has a backup procedure but restoring after a fresh install poses the issues and requires the steps issued above.

I had to reinstall xcpng a little while back and found that it was easier to install the XOA from source and then do the restore in the GUI. Then once everything is restored I went back removed the orphaned VM to do the restore.

It was not a fun experience but it worked.

A little update @LTS_Tom .

I am now confident this works. I had to split the HDD pool. Transfered everything over to the SSD pool. Did a fresh install and followed the steps. No data loss. :slight_smile:
Had to create a new pool for the HDD pool, but that is just done the usual way.

We are also trying to document recovery of a single host pool.
Lots of experience with VMware and KVM/QEMU, new to xcpng.
My test case assumes host mainboard failure. Iā€™ve installed xcpng on new host. Itā€™s different hardware. Same socket and chipset, but that may be about it. Figure if a host dies four years after deployment, weā€™d prefer not to be searching ebay for an identical replacement.
Pool metadata restore failed with ā€˜RESTORE_TARGET_MISSING_DEVICE(eth1)ā€™. So I added a PCIe 2-port NIC. Confirmed host how has eth0, eth1 & eth2. Restore still fails with same error. And quite possibly, this is just the first of many. ?
These are fairly small deployments: single host, internal storage, 2 - 5 VMs. One to three XO users. Not impossible to rebuild from scratch, but, thereā€™s got to be a better way. Looking forward to hearing more on the topic.

If itā€™s a different host and you donā€™t even have the same network interfaces you can just reload the host clean, setup networking, and then import the VMā€™s from backup.

Tom, thanks for the reply. Correct me if I misunderstand. Iā€™m just having a hard time accepting a plan for host failure that involves a restoreā€¦ I know you said ā€œImportā€ but if OS & Data volume(s) contents are coming from backup media, is potentially missing four to sixteen hours of new data, involves moving terabytes of data from one media to an other, thatā€™s a restore in my mind, and not something Iā€™d willing do when the original (up to date) OS and data volumes are perfectly healthy.
At the very least, and Iā€™ve done this with KVM/QEMU migrations, Iā€™d try creating an identical new VM on the new host, but direct itā€™s storage to the original virtual disk files.
I do understand that XCPng may be far better at making more frequent backups than other platforms and perhaps that makes a restore more palatable.
An other thought I had: if we were to utilize shared network storage for the VMs, would it be possible to configure a second host that we donā€™t plan to use? That is, stage two hosts and shared storage, but only deploy one host and the storage, leaving the other on the shelf as a spare. And could it be the spare for multiple unrelated deployments, even if that means a dedicated boot media XCPng installation for each deployment?

I donā€™t understand how a total host failure would not need a restore.

If a host in a pool with other hosts fails you can just remove it from the pool and restore the VMā€™s that were on that hosts local storage. If the VMā€™s on that host were using shared pool storage then you would not need to restore them, just start them on a working host in the pool.

If a pool with a single host completely fails, just reload a new host with XCP-ng, load XO, connect to the backups, restore the VMā€™s.

Solved: It is possible to move an internal SR to a fresh XCP-ng install on different hardware and start those VMs on the new host.
But Why?
We understand both VMware and XCP-ng are designed around multi-host deployments on shared storage.
Small businesses, like Dentists, Optometrists, Chiropractors, other shops having two to 20 users, could run on a single bare-metal server. We choose to use a hypervisor for easier backup, migration, upgrade, recovery, and for the economy of being able to run additional VMs on the same host.
Our single host hypervisor deployments all boot from a single SSD and use RAID1 internal storage for virtual machine storage. While extremely rare, this leaves a few single points of failure including: Mainboard, CPU, RAM, hypervisor boot media, RAID Controller (if hardware RAID). On VMware and KVM/QEMU, we can pick up our internal guest storage hardware, move it to a new host on different hardware, and recover the guests w/o resorting to re-writing virtual disk files from backup.
In the case of a Host hardware failure, where the storage and virtual disks are found to be perfectly healthy, we find it ridiculous to resort to re-writing entire virtual disks from a backups that may be two to sixteen hours old. If forced to do so, weā€™d probably replace the physical media with new (more cost and delay) so we retain ability to recover something if necessary.
Solutions:
Citrix Knowledge Center CTX120962 is brief but covers re-indroducing the storage to a new Host.
Citrix support article CTX136342 adds some important details about installing a replacement XCP-ng host, and having saved a VMEXPORT from the original host while it was still running. CTX136342 is available to Citrix customers but perhaps it can be found elsewhere. The document title is ā€˜How to Reinstall a XenServer Host and Preserve Virtual Machines on the Local Storageā€™
If your NICs are different youā€™ll still need to reassign networking for VMs, but if running a single host pool, you likely have only a few VMs to fix.