Recover downed XCP-NG host

Hi.

After doing a lot of Googling I did not find (Maby lack of Googling knowledge) any step by step guides for restoring a XCP-NG pool with one host, where you have to reinstall XCP-NG and recover the storage pools if the boot disk failed.

Setup:
Single disk 256G for XCP-NG boot/Local Storage (Local storage not used for anything important)
Two raid arrays handled by a raid controller. Each about 3TB of storage. 1 array with SSDs and one with HDDs.

Pool with single host.

I know it would be better using for example a mirrored boot disk.
Even so, first i thought it would just be to connect the fresh install of XCP-NG to XOA and restore from backup. That did not work as the host got a new UUID. Found a forum post on the XCP-NG forum, but it just said to replace the backup UUID with the new one. (Restoring a downed host ISNT easy | XCP-ng and XO forum)
Did not manage to get this to work, so I started testing how I could get the SRs and the VMs back up and running. After a lot of trail and error, I got it working. There is probably an easier way, so I would appreciate any input.

This is based on that you have a pool backup. Either by doing a manual backup from the command line, or from a XOA backup or

xe pool-dump-database file-name=backup
ll /dev/disk/by-id/ | grep sc
vgs
xe sr-introduce content-type=user name-label="Dell HDD" shared=false uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0cdb71541FFFF
xe pbd-plug uuid=fec33099-8a25-7383-069e-9e0f6e6142e3 

xe sr-introduce content-type=user name-label="Dell SSD" shared=false uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0d2b00aabFFFF
xe pbd-plug uuid=9fd88523-8794-b947-d22c-a5434f3c2006 

### Upload backup file. I uploaded the file from a XOA backup.
xe pool-restore-database file-name=data dry-run=true
xe pool-restore-database file-name=data --force

### Host will reboot

### The local storage for the boot disk will not work
### We then need to detach and retach the correct volume
xe pbd-list 
xe pbd-unplug uuid=2e59bc55-cfe1-e240-2c5f-22403622bb34 
xe sr-forget uuid=f617db0d-d15d-ca7f-efed-dfcf4b0058b1 

ll /dev/disk/by-id/ | grep sdd
vgs
xe sr-introduce content-type=user name-label="Local Storage" shared=false uuid=35f12fc0-d581-FFFF-48a2-992096ce47be type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=35f12fc0-d581-FFFF-48a2-992096ce47be device-config:device=/dev/disk/by-id/wwn-0x5002538e4019bff9
xe pbd-plug uuid=009b255a-3309-3d4d-f9dd-476b75ea8fc2 

Now all SRs and VMs are back up and running.

Hope this can help others that have a similar problem.

1 Like

Thanks, this is a process I have been thinking about doing a write up on and a feature request to fix because as you noted, it’s a bit of a pain to do. XO has a backup procedure but restoring after a fresh install poses the issues and requires the steps issued above.

I had to reinstall xcpng a little while back and found that it was easier to install the XOA from source and then do the restore in the GUI. Then once everything is restored I went back removed the orphaned VM to do the restore.

It was not a fun experience but it worked.

A little update @LTS_Tom .

I am now confident this works. I had to split the HDD pool. Transfered everything over to the SSD pool. Did a fresh install and followed the steps. No data loss. :slight_smile:
Had to create a new pool for the HDD pool, but that is just done the usual way.