This is a bit hard to describe in the title, I’m hoping to clarify here.
I have an XCP-NG host with a few VMs that use SRs that are connected to the XCP-NG host via NFS. Meaning: to the VM, it’s just a normal disk, but the host connects to the physical storage device via NFS.
Occasionally, some issue will occur, and the VM loses connectivity to the NFS-connected SR for a period of time. (Note that these are Debian Bullseye VMs.) When this happens, the VM usually puts the SR into read-only mode. In all cases, the NFS SRs are secondary disks I use for storing larger files (ex. nextcloud data directory), so it’s acceptable to attempt to remount them as read-write while the VM is up. However, this never works, at least insofar as I’ve attempted it.
- Can I tweak the VM’s fstab to force it to automatically remount read-write?
- Can I also give the VM more “tolerance” for errors on this disk, kind of like a normal NFS share where it will cache a certain amount of operations before it considers the disk failed?
- Is there anything I should do on the XCP-NG host of the storage device (TrueNAS) to mitigate these issues? They don’t happen often but frequently they occur when I don’t realize it, until I suddenly realize Nextcloud is offline for some reason or some other infrastructure fails.
The XCP-NG system as most hypervisors expect a persistence for their storage, I have never tried to mitigate the issues that come with storage loss at the hypervisor level, we generally try to make the storage more robust.
I have a backup TrueNAS server and a powershell script that if 30 seconds it sees in your case the cloud is down it switches dns records on all dc’s and GC’s. Enterprise servers fail very few times, i also have a co-loc that i replicate changes to for DR, (living in Florida hurricanes are a threat) my backup server is a VM and i have another server thst i can just switch over my 24 ssd drives to and upload my backup config. But after almost 2 years i have never had a failure at the host level. Only one memory stick failed due to ecc errors. But im redundant at every single point in the companies interal mini datacenter and if hurricane comes i switch to dr where i have every user workstation and the co-loc machine is constantly updated. I also do backups via backup exec to the windows shares daily and replicate the small amount of daily changes. And it goes to tape every 2-3 days. Also do snapshots every 4 hours to protect against Ransom ware attacks. Which hasnt happened . Also im using old gear r720xd for the TrueNAS servers but the disks are fast and the data is almost all text. Not bigger stuff like video editing etc… always be redundant. I dont run VM’s on TrueNAS either i have 2 host servers where i split my DC’s and other servers of importance.
Thanks Tom. To be fair, I think my problem is that my UPS batteries are 6 years old…I did a quick “pull the plug” test the other day and went from 100% to 68% in 9 seconds! I’m thinking it’s self-testing and causing my primary issue.