AMD-Vi: IO_PAGE_FAULT - Network Freeze - XCPNG

Here is my home server specification

Xcpng 8.2 running on
Ryzen 7 2700, MSI Tomohawk Max Motherboard
48GB of DDR4 Corsair memory
3 4TB Seagate Ironwolf HDD (passed through to truenas vm)
240 GB Samsung SSD as boot and iso store
3
TP-LINK TG-3468 NIC

I run pfsense, truenas , home assistant virtualized.

This setup has been up and running for past 3 years without any glitches. But recently my server’s network freezes randomly during heavy network activity the server is up and running though but I am not able to connect to it or to any vms. The only way to get back connectivity is by hard restarting the server.

I started looking into the logs. I checked kern.log, boot.log, dmesg and /xen/hypervisor.log and I could find the following errors getting spitted out when I read/write a huge file to my truenas samba drive

[2023-01-07 13:09:27] (XEN) [ 585.899984] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x301, fault address = 0xfffffffdf8000000, flags = 0x8
[2023-01-07 13:10:00] (XEN) [ 618.410794] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x301, fault address = 0xfffffffdf8000000, flags = 0x8
[2023-01-07 13:10:46] (XEN) [ 664.593883] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x301, fault address = 0xfffffffdf8000000, flags = 0x8

When the network activity stops, the errors also stop appearing. But the network does not crash immediately when the io_page_fault appears, it crashes randomly (maybe under heavy load?). Last time when the network crashed I saw a cpu iowait of 60 in glances and immediately lost connectivity to the server.

So far I have tried following

  • Tried a Intel I350 T4 Gigabit NIC for the Truenas vm, assuming this might be because of the Realtek driver the Tplink NICs are using. No Luck

  • I also added a nvme ssd few months back, and I guess the problems started from that point, but I cannot be so sure, coz I removed the nvme today and tried simulating the error and still got the errors.

  • Upgraded the BIOS firmware - No Luck

  • Tried reseating all RAM sticks - No Luck

  • Tried Unplugging and replugging SATA cables - No Luck

  • Tried a iperf to the truenas vm and that did not trigger any io_faults…so it might not be the network or the nic or the memory(as memtest came out good).

  • dumped smartctl -a for all hdds in my truenas pool, and found nothing obvious.

Finally I tried writing a 20GB file into the zfs dataset using dd from within truenas shell (to test a file writewithout using the network) and I saw io_faults popping up.

So my guess is one of the disks failing, but smart data could not pick it up still? But Im not sure if AMD-Vi: IO_PAGE_FAULT is even related to disk errors.

I’ll be grateful if anyone could provide me a solution.

Not an issue I have encountered, might want to also post in Home | XCP-ng and XO forum

yeah I have posted in xcpng forum as well. AMD-Vi: IO_PAGE_FAULT - Network Freeze | XCP-ng and XO forum

Hoping for a solution !