Dan,
Thanks for posting about this, as I too was plagued with this ongoing NFS SR issue for a while and found it hard to describe (…not anymore). After some digging, I found in the XCP-ng NFS python code (/opt/xensource/sm/nfs.py ) that there are two main "other-config " options that you can pass to an NFS-based SR mount at the hypervisor level. You can pass an "nfs-timeout " value (which correlates to the NFS "timeo " option) and/or a “nfs-retrans” value (which correlates to the NFS "retrans " option).
The defaults in XCP-ng should be:
- nfs-timeout (timeo) = 200 (this is in tenths of a second, i.e. 20 seconds)
- nfs-retrans (retrans) = 4
You should see in your current NFS mounts at the command line (just type mount ) “…,timeo=200,retrans=4 ,…” in the NFS options if these defaults have not been changed.
You can change either of these values to make an NFS-based SR "more resilient " to NFS/network outages. The math/algorithm behind the settings should also be in the same “nfs.py” code in the header comments between lines ~27 and 46. I used this info and ChatGPT to compute tables for a fixed nfs-timeout and varying nfs-retrans values and with a fixed nfs-retrans and varying nfs-timeout values.
To gauge how long I would need an SR backend to wait for the target to reboot/recover, I rebooted all of my FreeNAS/TrueNAS appliances one by one and timed how long it took each for a full recovery (then add about 10% for good measure). We have several older servers that we use for testing/experimenting, so the slowest reboot we had was 6 min 30 sec. So, my nfs-timeout and nfs-retrans settings are based on bridging a 7 minute outage or 4200 tenths of a second (i.e., 420 seconds).
I tested both changing nfs-timeout = 4200 and keeping nfs-retrans = 4 and keeping the default nfs-timeout = 200 and increasing the nfs-retrans = 24. Both of these worked! None of our VM disks using the rebooted NFS-backed SR went into a read-only state. You might see some processes/operations not working during this outage period (i.e., no new ssh logins to an impacted VM, existing sessions stayed open), but once the NFS target comes back, all queued-up operations should be flushed out and continue. I did not need to reboot any of my VMs.
You should still see 1 entry in /var/log/kern.log indicating that the NFS server is not responding, e.g.
==> /var/log/kern.log <==
DATE HYPERVISOR kernel: [#] nfs: server IP not responding, timed out
but you should NOT see any ERRORS in the daemon.log indicating that there is an I/O issue, e.g.,
==> /var/log/daemon.log <==
DATE HYPERVISOR tapdisk[#]: ERROR: errno -5 at __tapdisk_vbd_complete_td_request: … - Input/output error
To get a list of your NFS-based SRs you can use,
xe sr-list type=nfs
To list the current parameters of an NFS SR you can use,
xe sr-param-list uuid=INSERT_YOUR_SR_UUID_HERE
Note the “other-config” line in the output. If you want to match my settings you can use the following. Note: I set both nfs-retrans and nfs-timeout explicitly just in case the XCP-ng defaults get changed between updates.
xe sr-param-set other-config:nfs-retrans=120 uuid=INSERT_YOUR_SR_UUID_HERE
xe sr-param-set other-config:nfs-timeout=200 uuid=INSERT_YOUR_SR_UUID_HERE
NOTE: You will need to shutdown all VMs using the NFS SR after making changes and unplug/replug the NFS SR, or simply reboot the hypervisor(s) to take effect. On reboot, you can verify that these new other-config settings were used with the mount command, e.g., “…,timeo=200,retrans=120 ,…”.
A copy of my helper script to change all NFS SRs on a hypervisor to these settings can be found at the github link below,