XO Scheduled snapshots failing due to 10 second timeout?

Hi,

I’ve been running XO and XCPng since about 2019. Recently, I decided to set up a secondary location to run some of the VMs. I set up a similar scenario as my main office with an Epyc machine for the XCPng host and TrueNAS for the data storage using NFS shares. I have the two sites connected via a Wireguard peer to peer setup with pfSense on both ends. One side has a 500 Mbps fiber WAN connection, the other has a 1 Gbps fiber connection. The XO instance is running at the main site and can see both hosts. I can ping the XCPng server on the secondary office from the XO machine at the main office in only about 45ms.

I was able to migrate a few VMs over to the secondary site and, initially everything was going fine. The VMs I have running there are still running just fine with no issues and have been for a couple of weeks. The only issue I have is scheduled snapshot jobs are suddenly failing for the machines that are hosted on the secondary site. XO keeps telling me: “Error: Connect Timeout Error (attempted address: [IP of secondary XCPng host]:443, timeout: 10000ms)”. If I go to the individual VMs and choose to create a snapshot, it works fine. Only the scheduled snapshots are failing and I can’t figure out why. I’ve tried the workaround of adding the config.httpInactivityTimeout.toml file under /etc/xo-server/ (which didn’t exist). I also tried it under /etc/xo/xo-server/ and under another spot I can’t remember right now and made sure to restart the xo-server service and even reboot XO all together, but I continue to get the 10000ms error.

I did more digging today and found I’m also getting the same error when XO is trying to check the time consistency for the secondary site (“method”: “host.isHostServerTimeConsistent”, “result”:
“name”: “ConnectTimeoutError”,
“code”: “UND_ERR_CONNECT_TIMEOUT”)

I checked the pfSense logs and nothing’s being blocked. I made sure to set up Hybrid Outbound NAT so I could force static ports for connections between the two ends of the Wireguard tunnel.

I’m not sure where else to look or why it just started to be an issue seemingly out of nowhere after operating fine for 2 weeks. Does anyone have any ideas?

I use my local XO to manage other systems over a VPN but I don’t have any auto snapsshots setup. If no one here has any suggestions I would post in the XCP-ng forums.

You don’t? I’ve followed all your XCPng videos and you’ve always talked about scheduled backup jobs and scheduled rolling snapshots, etc.. These are what I’m referring to - specifically the rolling snapshots in this case. Sorry if my initial post was confusing on that front.

Or, if there’s a different way of handling this, maybe I’m confused on how to go about it. But, I thought having one instance of XO controlling multiple hosts at different locations was the right way to go about managing them. To clarify, the VMs at the different locations are stored on the NAS at their respective location and the rolling snapshots are not being transferred over VPN, only being called to run by the XO instance at the main office since that’s the unified control plane.

Have you checked if there is any time drift on the each of the servers? I would check each server via ssh rather than using Xen Orchestra since clearly it’s having trouble.

Also I don’t use it but I wonder if this is what XO proxy is for?

I’m going to be setting up two sites like this soon so im interested in how this goes.

For some VM’s I have the backup keep the snapshots but not for all as snapshots for VM’s that have a high level of writing will lose some performance with snapshots (this is not an XCP-ng issue, it’s a delta tracking issue in other hypervisors as well.

All backups should be done using either a local XO or remote XO that talks to a XO-Proxy that is local to the backups. I only use XO connected remotely for general VM management.

Awesome, I’ll check this out. I was unaware of the XO Proxy concept. Thanks!

I did check the time, but both match and I made sure to have both of them tied to NTP of their respective pfSense routers which, in turn, both reference the same external NTP source. I think, as you and Tom suggested, the XO Proxy is what I’m going to need to configure. I’ll let you know how it goes. Thanks for the suggestions.

I don’t need it, but this seems to be what XO Proxy was built to do and should make a better first step in trying to get you system doing what you need.

So, after setting up a local XO instance on the secondary site I began looking into XO Proxy. I didn’t realize that was a paid feature requiring ~$4700 a year since it also requires at least the XOA Essential+ subscription if I want all the features I have already (for free), including Smart Backup, etc.. It may be even more since that only includes one XO Appliance and “Extra Xen Orchestra Appliance” is listed as “optional”. I’ll probably just manage them separately and if I still have issues, it will have to be down to something misconfigured on the secondary site.

I haven’t tried it but it looks like you can use the same script used to make a Xen Orchestra install, to make a XO Proxy install.

1 Like

I think I found the problem and it’s not to do with the VPN. It’s something to do with my backup Remotes in XO. I have a 10G direct connection between XCPng and TrueNAS servers which the VMs run over and then I access them on my LAN through a standard 1Gbps connection. I also set the backup Remote to go through the 10G connection and that’s where the timeouts are occurring. Haven’t found out why, yet, as the hosted VMs are running over that just connection just fine. But, when I test the Remote, it just spins forever.

I found on my main location, where I have the same setup, I forgot to change the Remotes over to the 10G connection when I set it up, but if I try changing to that now, I get the same timeout issue. I’m guessing it’s a permissions issue.

Seems it’s a permissions issue, but still not 100% on understanding what’s going on. I ended up having to set the NFS share permissions in TrueNAS to root for ‘Mapall User’ and ‘Mapall Group’ and add the dataset permission for root to ‘Full Control’. Not sure why that wasn’t the case to begin with.

Anyway, once that was done, I was able to set up my Remote from XO over the 10G interface…for one of the sub-folders of the share, but not more than one. The first one works every time, but the second one (yes, I’ve triple-checked the permissions at the folder level via command line) will just simply not mount and times out. But for now, whatever, at least I’ve got one folder. It would be nice to separate things out for organizational purposes, but I’ve spent all day on this so at least part of it is working.

I still also have this same issue on my main setup - I can mount multiple Remotes in XO over the 1G interface instantly, but not the 10G interface even though the VM Storage Repositories are already working of the 10G interface just fine. I don’t get it. I’ve made sure the TrueNAS has the subnet for the 10G interface allowed alongside the 1G interface allowance, but when testing the mounts, I can get one to mount (which takes over a full minute to test successful), but not more than one. Doesn’t seem to make sense on a direct 10G connection (CAT6, no switch, static IPs set on both interfaces, /30 subnet). This all seems pretty straightforward, but I must be missing something.

If it helps, I also time out when trying to run sudo showmount -e [IP of the 10G interface], but get an immediate response with the same command to the IP of the 1G interface. But, I can still mount the one Remote to the 10G interface. It lags for a couple minutes before it will connect, but once connected it responds immediately. Maybe it’s a NIC problem? I’m using one of the 10G built-in interfaces on the ASRock BERGAMOD8-2L2T and the other is a built-in 10G interface on the Supermicro AS-2014CS-TR . I thought these should be fine without having to resort to discrete NICs.

Are the interfaces each on their own subnet?

And I have a video here on NFS & XCP-ng where I talk about permissions

Yeah, the main access 1G management interface is on the 10.22.3.0/24 subnet. The 10G interface where the VM Storage Repositories and attempted backup Remotes live is on a point-to-point (no switch, directly connected) 10.0.24.0/30 subnet.

Something of note - if I ssh into the XO server and run showmount -e to the 1G interface of the TrueNAS I get all the NFS exports listed immediately. If I try the same showmount -e to the 10G interface of the TrueNAS, it just sits there for a few minutes and then errors out with “clnt_create: RPC: Timed out”.

The weird part is that the VMs running on the SRs over the 10G interface are running fine. It’s only the Remotes that get funky. I tried shutting down everything last night and rebooting the TrueNAS and then the XCPng server and then the one Remote I had working stopped working altogether, so now I just have it set back on the 1G interface and it’s functioning, but I would prefer it to operate at 10G. I’ll check out the video again to see if I missed something.

A thought, and maybe I’m confused, but does XO also need to bind to the 10G interface even though it’s just controlling the XCPng server’s actions? Maybe the Remotes for backups aren’t simply controlling what XCPng does, but are tied directly to the XO instance? I’d have to pass it through from the XPNng host.

Okay, sorry for the long delay on status, but I finally got time to travel to my other location.

Turns out my suspicion was correct. I had to do a PCI pass through on a dedicated 10G NIC from XCPng to the XO VM and now the backup remote works perfectly and with no timeout issues. This is a separate 10G NIC from the one I’m using between TrueNAS and XCPng for the VM Storage Repositories, which I didn’t have to dedicate and pass through to XO.

In the process, I also learned that I had to adjust the MTU on my site-to-site Wireguard connection down to 1420 to keep a stable connection between sites, but that’s a different issue. Just throwing that in here as a nugget of information in case it helps anyone, though I realize it’s a random addendum to this thread.

Anyway, thanks everyone for your guidance along the way!

1 Like