Migrating VMs from Dead Host to Live Host with XCP-ng/XO

Donovan · March 22, 2023, 9:37pm

I’ve been experimenting with XCP-ng and XO over the last few weeks and have run into a scenario I can’t find the solution to. I’m trying to simulate the loss of a host in a pool and starting its VMs on another host in the pool without having HA enabled. My test lab setup is a single pool containing 2 identical hosts and all VM storage on a remote NFS share.

When I unplug the member host (not the master) XO will show it as halted after approximately 10 minutes, but the VM running on that host continues to show as running. When I try to migrate that VM to the other host in the pool it fails because the “host is offline”. I assumed that because the VM disk was on a remote NFS share I could simply migrate it another host in the pool.

Any advice on what I’m doing incorrectly or not understanding?

LTS_Tom · March 23, 2023, 11:16am

If the host you lose is set as Master then the pool will lose track of those VM’s. You have to promote the other host to master as it does have a copy of the VM meta data and you may also have to update the status of the VM from running to not running it that database so it can start. The command to reset the VM power state looks like this:

xe vm-reset-powerstate vm=29384515-9728-261d-a278-a66b1e0eddd3 --force

Just replace the UUID with your VM UUID that needs to be reset.

Donovan · March 23, 2023, 4:06pm

Thanks Tom. I followed your instructions which did put the VM in a halted state, but when I try to power it on I get this error:

“SR_BACKEND_FAILURE_46(, The VDI is not available [opterr=[‘HOST_OFFLINE’, ‘OpaqueRef:b25f90a4-c65c-4fcb-aaa8-26e80255302c’]], )”

The VDI is on a Synology NFS share which is still online so I’m not sure what is hanging things up. I’ll keep searching or post this error on the XO forums where this may be related to my environment.

But just for clarity, normally when a host goes down in a pool, it is just a matter of making sure a live host is set to master, reset the power state of the dead VM, then start the VM on another host. Is this correct?

LTS_Tom · March 23, 2023, 11:49pm

Yup, that “SR_BACKEND_FAILURE” error makes me think the working host has lost connection to the storage.

Donovan · March 25, 2023, 8:26pm

Updating my post as I found a working solution.

The “SR_BACKEND_FAILURE… HOST_OFFLINE” error was because XCP-ng was trying to start the VM on the dead host. I’m sure this is by design to prevent corruption issues. In addition to resetting the power state of the VM as Tom suggested, I ran commands to remove the dead host and its SRs from the pool. Here are all the steps I followed to get the VM started on another host in the pool.

Reset the power state of each VM on the dead host. You can find the UUID of your VMs with xe vm-list.
xe vm-reset-powerstate vm=the-uuid-of-your-vm --force
Forget the dead host. You can find the UUID of your hosts with xe host-list.
xe host-forget uuid=the-uuid-of-your-host --force
Forget the SRs such as remote storage, DVD storage, and removable storage. You can find the UUID of your SRs with xe sr-list. They’ll be easy to spot as they’ll be listed as “Not in database”.
xe sr-forget uuid=the-uuid-of-your-sr

Once these steps are completed, you can start the VMs and they’ll start on a live host in the pool.

If the dead host is repaired and will be re-added to the pool, do a fresh install of XCP-ng on the host before adding it back to the pool. Otherwise, you’ll get an error.

Solution was found in this old blog post from 2013. Worked like a charm.