Recover downed XCP-NG host

Hi.

After doing a lot of Googling I did not find (Maby lack of Googling knowledge) any step by step guides for restoring a XCP-NG pool with one host, where you have to reinstall XCP-NG and recover the storage pools if the boot disk failed.

Setup:
Single disk 256G for XCP-NG boot/Local Storage (Local storage not used for anything important)
Two raid arrays handled by a raid controller. Each about 3TB of storage. 1 array with SSDs and one with HDDs.

Pool with single host.

I know it would be better using for example a mirrored boot disk.
Even so, first i thought it would just be to connect the fresh install of XCP-NG to XOA and restore from backup. That did not work as the host got a new UUID. Found a forum post on the XCP-NG forum, but it just said to replace the backup UUID with the new one. (Restoring a downed host ISNT easy | XCP-ng and XO forum)
Did not manage to get this to work, so I started testing how I could get the SRs and the VMs back up and running. After a lot of trail and error, I got it working. There is probably an easier way, so I would appreciate any input.

This is based on that you have a pool backup. Either by doing a manual backup from the command line, or from a XOA backup or

xe pool-dump-database file-name=backup
ll /dev/disk/by-id/ | grep sc
vgs
xe sr-introduce content-type=user name-label="Dell HDD" shared=false uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0cdb71541FFFF
xe pbd-plug uuid=fec33099-8a25-7383-069e-9e0f6e6142e3 

xe sr-introduce content-type=user name-label="Dell SSD" shared=false uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0d2b00aabFFFF
xe pbd-plug uuid=9fd88523-8794-b947-d22c-a5434f3c2006 

### Upload backup file. I uploaded the file from a XOA backup.
xe pool-restore-database file-name=data dry-run=true
xe pool-restore-database file-name=data --force

### Host will reboot

### The local storage for the boot disk will not work
### We then need to detach and retach the correct volume
xe pbd-list 
xe pbd-unplug uuid=2e59bc55-cfe1-e240-2c5f-22403622bb34 
xe sr-forget uuid=f617db0d-d15d-ca7f-efed-dfcf4b0058b1 

ll /dev/disk/by-id/ | grep sdd
vgs
xe sr-introduce content-type=user name-label="Local Storage" shared=false uuid=35f12fc0-d581-FFFF-48a2-992096ce47be type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=35f12fc0-d581-FFFF-48a2-992096ce47be device-config:device=/dev/disk/by-id/wwn-0x5002538e4019bff9
xe pbd-plug uuid=009b255a-3309-3d4d-f9dd-476b75ea8fc2 

Now all SRs and VMs are back up and running.

Hope this can help others that have a similar problem.

1 Like

Thanks, this is a process I have been thinking about doing a write up on and a feature request to fix because as you noted, it’s a bit of a pain to do. XO has a backup procedure but restoring after a fresh install poses the issues and requires the steps issued above.

I had to reinstall xcpng a little while back and found that it was easier to install the XOA from source and then do the restore in the GUI. Then once everything is restored I went back removed the orphaned VM to do the restore.

It was not a fun experience but it worked.

A little update @LTS_Tom .

I am now confident this works. I had to split the HDD pool. Transfered everything over to the SSD pool. Did a fresh install and followed the steps. No data loss. :slight_smile:
Had to create a new pool for the HDD pool, but that is just done the usual way.

We are also trying to document recovery of a single host pool.
Lots of experience with VMware and KVM/QEMU, new to xcpng.
My test case assumes host mainboard failure. I’ve installed xcpng on new host. It’s different hardware. Same socket and chipset, but that may be about it. Figure if a host dies four years after deployment, we’d prefer not to be searching ebay for an identical replacement.
Pool metadata restore failed with ā€˜RESTORE_TARGET_MISSING_DEVICE(eth1)’. So I added a PCIe 2-port NIC. Confirmed host how has eth0, eth1 & eth2. Restore still fails with same error. And quite possibly, this is just the first of many. ?
These are fairly small deployments: single host, internal storage, 2 - 5 VMs. One to three XO users. Not impossible to rebuild from scratch, but, there’s got to be a better way. Looking forward to hearing more on the topic.

If it’s a different host and you don’t even have the same network interfaces you can just reload the host clean, setup networking, and then import the VM’s from backup.

Tom, thanks for the reply. Correct me if I misunderstand. I’m just having a hard time accepting a plan for host failure that involves a restore… I know you said ā€œImportā€ but if OS & Data volume(s) contents are coming from backup media, is potentially missing four to sixteen hours of new data, involves moving terabytes of data from one media to an other, that’s a restore in my mind, and not something I’d willing do when the original (up to date) OS and data volumes are perfectly healthy.
At the very least, and I’ve done this with KVM/QEMU migrations, I’d try creating an identical new VM on the new host, but direct it’s storage to the original virtual disk files.
I do understand that XCPng may be far better at making more frequent backups than other platforms and perhaps that makes a restore more palatable.
An other thought I had: if we were to utilize shared network storage for the VMs, would it be possible to configure a second host that we don’t plan to use? That is, stage two hosts and shared storage, but only deploy one host and the storage, leaving the other on the shelf as a spare. And could it be the spare for multiple unrelated deployments, even if that means a dedicated boot media XCPng installation for each deployment?

I don’t understand how a total host failure would not need a restore.

If a host in a pool with other hosts fails you can just remove it from the pool and restore the VM’s that were on that hosts local storage. If the VM’s on that host were using shared pool storage then you would not need to restore them, just start them on a working host in the pool.

If a pool with a single host completely fails, just reload a new host with XCP-ng, load XO, connect to the backups, restore the VM’s.

Solved: It is possible to move an internal SR to a fresh XCP-ng install on different hardware and start those VMs on the new host.
But Why?
We understand both VMware and XCP-ng are designed around multi-host deployments on shared storage.
Small businesses, like Dentists, Optometrists, Chiropractors, other shops having two to 20 users, could run on a single bare-metal server. We choose to use a hypervisor for easier backup, migration, upgrade, recovery, and for the economy of being able to run additional VMs on the same host.
Our single host hypervisor deployments all boot from a single SSD and use RAID1 internal storage for virtual machine storage. While extremely rare, this leaves a few single points of failure including: Mainboard, CPU, RAM, hypervisor boot media, RAID Controller (if hardware RAID). On VMware and KVM/QEMU, we can pick up our internal guest storage hardware, move it to a new host on different hardware, and recover the guests w/o resorting to re-writing virtual disk files from backup.
In the case of a Host hardware failure, where the storage and virtual disks are found to be perfectly healthy, we find it ridiculous to resort to re-writing entire virtual disks from a backups that may be two to sixteen hours old. If forced to do so, we’d probably replace the physical media with new (more cost and delay) so we retain ability to recover something if necessary.
Solutions:
Citrix Knowledge Center CTX120962 is brief but covers re-indroducing the storage to a new Host.
Citrix support article CTX136342 adds some important details about installing a replacement XCP-ng host, and having saved a VMEXPORT from the original host while it was still running. CTX136342 is available to Citrix customers but perhaps it can be found elsewhere. The document title is ā€˜How to Reinstall a XenServer Host and Preserve Virtual Machines on the Local Storage’
If your NICs are different you’ll still need to reassign networking for VMs, but if running a single host pool, you likely have only a few VMs to fix.

Hi @LTS_Tom, I am probably missing something here but I am not sure what you mean by ā€œImport the VM from backā€.

I have my VMS backed up to my QNAP using the backup option in Xen Orchestra. The import VM imports OVA and XVA files which the backups are not.

image

As I now have a new host, I can no longer see any of my backups via the Xen dashboard so cannot use that restore option.

As explained in my XCP-ng Backups video above, you setup up the ā€œRemotesā€ and then you can use the restore option.

1 Like

I’m new to the forums and XCP-NG and have a similar issue to OP. The ssd in my Host is failing and I’ll be wanting to install a replacement.

I have two VMs that are located on my NAS via NFS (with backups) , so I’m assuming the SSD only has the files needed by XCP-NG? If i do a metadata backup from the host configuration menu (to the NAS) before replacing the SSD, will I be able to simply restore the metadata backup (after creating the NFS to get things working again?

Yes that should work, the only challenge with restoring is the local mounts storage mounts will be different. I would still make sure you have a proper backup of the VM’s as well.

1 Like