After doing a lot of Googling I did not find (Maby lack of Googling knowledge) any step by step guides for restoring a XCP-NG pool with one host, where you have to reinstall XCP-NG and recover the storage pools if the boot disk failed.
Setup:
Single disk 256G for XCP-NG boot/Local Storage (Local storage not used for anything important)
Two raid arrays handled by a raid controller. Each about 3TB of storage. 1 array with SSDs and one with HDDs.
Pool with single host.
I know it would be better using for example a mirrored boot disk.
Even so, first i thought it would just be to connect the fresh install of XCP-NG to XOA and restore from backup. That did not work as the host got a new UUID. Found a forum post on the XCP-NG forum, but it just said to replace the backup UUID with the new one. (Restoring a downed host ISNT easy | XCP-ng and XO forum)
Did not manage to get this to work, so I started testing how I could get the SRs and the VMs back up and running. After a lot of trail and error, I got it working. There is probably an easier way, so I would appreciate any input.
This is based on that you have a pool backup. Either by doing a manual backup from the command line, or from a XOA backup or
xe pool-dump-database file-name=backup
ll /dev/disk/by-id/ | grep sc
vgs
xe sr-introduce content-type=user name-label="Dell HDD" shared=false uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ee4d5892-da0a-FFFF-0283-9c339309ba69 device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0cdb71541FFFF
xe pbd-plug uuid=fec33099-8a25-7383-069e-9e0f6e6142e3
xe sr-introduce content-type=user name-label="Dell SSD" shared=false uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=ceafe196-ebef-FFFF-1ef7-913579eca68c device-config:device=/dev/disk/by-id/scsi-3644a842006a56e002cf0d2b00aabFFFF
xe pbd-plug uuid=9fd88523-8794-b947-d22c-a5434f3c2006
### Upload backup file. I uploaded the file from a XOA backup.
xe pool-restore-database file-name=data dry-run=true
xe pool-restore-database file-name=data --force
### Host will reboot
### The local storage for the boot disk will not work
### We then need to detach and retach the correct volume
xe pbd-list
xe pbd-unplug uuid=2e59bc55-cfe1-e240-2c5f-22403622bb34
xe sr-forget uuid=f617db0d-d15d-ca7f-efed-dfcf4b0058b1
ll /dev/disk/by-id/ | grep sdd
vgs
xe sr-introduce content-type=user name-label="Local Storage" shared=false uuid=35f12fc0-d581-FFFF-48a2-992096ce47be type=lvm
xe pbd-create host-uuid=59fc5441-5835-405c-ba57-8ffd6bb7b78e sr-uuid=35f12fc0-d581-FFFF-48a2-992096ce47be device-config:device=/dev/disk/by-id/wwn-0x5002538e4019bff9
xe pbd-plug uuid=009b255a-3309-3d4d-f9dd-476b75ea8fc2
Now all SRs and VMs are back up and running.
Hope this can help others that have a similar problem.
Thanks, this is a process I have been thinking about doing a write up on and a feature request to fix because as you noted, itās a bit of a pain to do. XO has a backup procedure but restoring after a fresh install poses the issues and requires the steps issued above.
I had to reinstall xcpng a little while back and found that it was easier to install the XOA from source and then do the restore in the GUI. Then once everything is restored I went back removed the orphaned VM to do the restore.
I am now confident this works. I had to split the HDD pool. Transfered everything over to the SSD pool. Did a fresh install and followed the steps. No data loss.
Had to create a new pool for the HDD pool, but that is just done the usual way.
We are also trying to document recovery of a single host pool.
Lots of experience with VMware and KVM/QEMU, new to xcpng.
My test case assumes host mainboard failure. Iāve installed xcpng on new host. Itās different hardware. Same socket and chipset, but that may be about it. Figure if a host dies four years after deployment, weād prefer not to be searching ebay for an identical replacement.
Pool metadata restore failed with āRESTORE_TARGET_MISSING_DEVICE(eth1)ā. So I added a PCIe 2-port NIC. Confirmed host how has eth0, eth1 & eth2. Restore still fails with same error. And quite possibly, this is just the first of many. ?
These are fairly small deployments: single host, internal storage, 2 - 5 VMs. One to three XO users. Not impossible to rebuild from scratch, but, thereās got to be a better way. Looking forward to hearing more on the topic.
If itās a different host and you donāt even have the same network interfaces you can just reload the host clean, setup networking, and then import the VMās from backup.
Tom, thanks for the reply. Correct me if I misunderstand. Iām just having a hard time accepting a plan for host failure that involves a restoreā¦ I know you said āImportā but if OS & Data volume(s) contents are coming from backup media, is potentially missing four to sixteen hours of new data, involves moving terabytes of data from one media to an other, thatās a restore in my mind, and not something Iād willing do when the original (up to date) OS and data volumes are perfectly healthy.
At the very least, and Iāve done this with KVM/QEMU migrations, Iād try creating an identical new VM on the new host, but direct itās storage to the original virtual disk files.
I do understand that XCPng may be far better at making more frequent backups than other platforms and perhaps that makes a restore more palatable.
An other thought I had: if we were to utilize shared network storage for the VMs, would it be possible to configure a second host that we donāt plan to use? That is, stage two hosts and shared storage, but only deploy one host and the storage, leaving the other on the shelf as a spare. And could it be the spare for multiple unrelated deployments, even if that means a dedicated boot media XCPng installation for each deployment?
I donāt understand how a total host failure would not need a restore.
If a host in a pool with other hosts fails you can just remove it from the pool and restore the VMās that were on that hosts local storage. If the VMās on that host were using shared pool storage then you would not need to restore them, just start them on a working host in the pool.
If a pool with a single host completely fails, just reload a new host with XCP-ng, load XO, connect to the backups, restore the VMās.
Solved: It is possible to move an internal SR to a fresh XCP-ng install on different hardware and start those VMs on the new host.
But Why?
We understand both VMware and XCP-ng are designed around multi-host deployments on shared storage.
Small businesses, like Dentists, Optometrists, Chiropractors, other shops having two to 20 users, could run on a single bare-metal server. We choose to use a hypervisor for easier backup, migration, upgrade, recovery, and for the economy of being able to run additional VMs on the same host.
Our single host hypervisor deployments all boot from a single SSD and use RAID1 internal storage for virtual machine storage. While extremely rare, this leaves a few single points of failure including: Mainboard, CPU, RAM, hypervisor boot media, RAID Controller (if hardware RAID). On VMware and KVM/QEMU, we can pick up our internal guest storage hardware, move it to a new host on different hardware, and recover the guests w/o resorting to re-writing virtual disk files from backup.
In the case of a Host hardware failure, where the storage and virtual disks are found to be perfectly healthy, we find it ridiculous to resort to re-writing entire virtual disks from a backups that may be two to sixteen hours old. If forced to do so, weād probably replace the physical media with new (more cost and delay) so we retain ability to recover something if necessary.
Solutions:
Citrix Knowledge Center CTX120962 is brief but covers re-indroducing the storage to a new Host.
Citrix support article CTX136342 adds some important details about installing a replacement XCP-ng host, and having saved a VMEXPORT from the original host while it was still running. CTX136342 is available to Citrix customers but perhaps it can be found elsewhere. The document title is āHow to Reinstall a XenServer Host and Preserve Virtual Machines on the Local Storageā
If your NICs are different youāll still need to reassign networking for VMs, but if running a single host pool, you likely have only a few VMs to fix.