Making "simple" XO pool more resilient?

dbsoundman · August 12, 2022, 11:34am

Hi all, had a bit of a pucker moment this morning. Received notifications that two of my VMs were down due to no ICMP response this morning, so I jumped on the VPN to see what was going on. I have Xen Orchestra running as a VM on my TrueNAS machine fortunately, so I can independently monitor the two Protectli FW2Bs that are in the XCP-ng pool together.

The VMs were indeed down, but one of the XCP-ng hosts was also not responding to Xen Orchestra commands, and it happened to be the pool master…uh oh. I effectively had no control of the pool, so I couldn’t change the pool master to the second machine. The VMs reside on the local storage of each host, so normally migrating takes a few minutes but it’s no big deal. But since I basically didn’t have a pool at that moment, I couldn’t migrate anything, and because the master was acting up, I couldn’t seem to do much with VMs on the second host either.

SSH’d into the master to see what was up…basically any command I ran produced an “Input/output error”, including “reboot”. Major pucker moment. Ended up following this guide to force a reboot by echoing characters into /proc, and after several nail-biting minutes, the pool and the master came back up in XO. I immediately swapped the master to the secondary host and migrated all my critical VMs to that host.

Looking at the logs in the “bad” host, I don’t see anything out of the ordinary. It had been up and running for like two months before this. I did recently make some networking changes but that’s all.

My point here, ultimately, is to ask how I can make this setup more resilient in the future? I’m trying to avoid making it too complex by, for example, using external storage for my SRs. I did make the mistake of setting up the old host with thick provisioning; I set up the new one with thin provisioning. I’m thinking my next task will be to reinstall or reconfigure XCP-ng to use thin provisioning on the local SR. Beyond that, I wanted to know if there was anything else I could do to avoid this kind of major failure in the future. I’m especially worried because the AD I plan to deploy (NethServer) doesn’t really do primary and secondary servers; there’s a hot sync option but it still requires some manual intervention. Really don’t want to lose AD (and DNS, since it will be the primary DNS server)!

Paul · August 12, 2022, 12:58pm

Not the answer you probably do not want - have a dedicated server for XCP-NG, do not virtualize it.

Yes, you can run virtual machines on truenas , but is not really design for hosting virtual machines - especially when the server is a vm hosting server (XCP-NG)

You could setup the existing server as XCP-NG, and virtualize Truenas if you can setup passthru to access the drives directly via Truenas - before you do this , petty of horror stories where people have lost all the data of the Truenas doing this setup

You best solution is a server for XPC-NG and another one for Truenas

gsrfan01 · August 12, 2022, 1:10pm

Running your XO VM on TrueNAS in this case is OK. I’m not sure how those Protecli boxes are with XCP-NG but I think there is a partnership with them and Vates so it should be fine. I would confirm that the SSD on the bunk host is healthy first and foremost. Then perform all the updates, there were a few in the last month you’re probably missing if you have 2+ months of up time.

As for resiliency, you could create a Disaster Replication backup to synchronize VMs between the storage on both hosts. In the event of a host failure you could SSH into the remaining host and start the VMs up manually. Though this isn’t automatic, it is an option.

You could also look into HA but there are some caveats there.

Greg_E · August 12, 2022, 1:13pm

Dedicated servers for XCP-NG, a storage pool or two, and run the host in HA mode (requires a storage pool for heartbeat). It seems like you have the dedicated servers for the hosts, is this correct? Since you have a Truenas then you could have the VM storage pool, assuming disk space and saturation level.

My intended production layout for work with XCP-NG is 3 servers for the hosts, at least one Truenas for vm storage, ISO share, and a third for heartbeat. I intend to run this as an HA which isn’t strictly required for most systems, but allowing it to spawn a second master and automatically restart any VMs that crashed when the previous host failed is something I’ve wanted for a long time.

Also I think you did one thing very well, you put your XO on a different machine. My test lab is set up with a little HP T630 running Debian because it is away from the XCP-NG system. If the system is down, you can’t even look for it with XO if it is stored on the same XCP-NG system that crashed. You can always run the paid version in the hypervisor, and the “community” version on a different computer. To me it just feels right to have XO live off of the hypervisor, or live in two places.

gsrfan01 · August 12, 2022, 1:20pm

This would be a better solution than mine if the 1gbe uplink is sufficient for the use case. Alternatively, you could use local storage for non-critical VMs and shared storage + HA for critical VMs.

Your XCP-NG proposed setup is very similar to mine and it’s been working perfectly!

dbsoundman · August 12, 2022, 1:49pm

DR seems like a good next step. Only issue is it seems if the pool master goes sideways I lose communication with the second host (non master) for some reason. SSH with xe commands seems like the only way to bring up my DR image I guess?

dbsoundman · August 12, 2022, 1:50pm

To be clear, xen orchestra is a VM on TrueNAS, XCP-ng runs on two pieces of dedicated hardware (two hosts in one pool).

gsrfan01 · August 12, 2022, 2:15pm

I believe any solution will require some CLI. You could force the second node to become the pool master after a failure, but that would require SSHing into the node. This post has the needed commands with some more context but TLDR it’s

xe pool-emergency-transition-to-master
xe-toolstack-restart
xe pool-designate-new-master host-uuid=<Slave UUID>
xe-toolstack-restart

You might have some luck adding both hosts into XO; you’ll get a warning on the second one that “this pool is already connected” but, you could disable and re-enable it with a master down and it might assume the role.

Otherwise, yes, you would have to SSH / CLI into the remaining hosts and bring the VMs online with XE commands.

Greg_E · August 15, 2022, 2:32pm

I haven’t had much luck adding the other hosts to XO, once the master goes down I still lose connection to the other hosts. This is currently in a NON-HA configuration. HA was starting to get in the way as I was making changes getting things set. On my production system I think HA will be valuable to get servers up from a host crash in a few minutes. Not sure how this will affect running updates, but I guess I’ll find out soon.