The strangest TCP Out-of-order, Dup ACK and retransmission on virtualized pfSense

Hi All,

I have the strangest network issue that I cannot seem to fix after many many hours of debugging. First (the relevant part of) my setup:

  • Dell R210ii
    • ESXi 6.7
      • pfSense virtualized
      • many virtualized VM’s (all Debian/Ubuntu based)

The network is segmented with VLAN’s.

Now the problem. When I do a pcap (Promiscuous mode) in pfSense on the LAN interface (which is the parent interface of all the VLANs) I get many many TCP retransmission, TCO Out-of-order and TCP Dup ACKs.

The problem happens across many devices (all?) but to focus on just 2 machines and track down the issue (both are Debian based VM’s, and both are in a different VLAN such that data is forced to route over pfSense, note that all data is flowing over a virtual/SW switch in ESXi). I logged data in one virtual machine and in pfSense simultaneously and tracked the packets down in Wireshark and “synced” them to show them side by side:

As you can see all data inside one of the VM’s is perfectly fine. Also using ethtool (in that same VM) shows no issues at all. No even a dropped packet.

While doing a lot a reading I tried things like:

  • Disable all HW offloading options in pfSense (was already the case)
  • Change Network adapters in ESXi/Vmware from VMXNET3 to E1000e (both VMs)
  • Disable TSO and TX checksumming using ethtool in one of the VMs
  • Disable LRO in ESXi using Net.Vmxnet3HwLRO and Net.TcpipDefLROEnabled
  • Disable TSO in ESXi using Net.UseHwTSO
  • Explicitly set MTU to 1500 on all interfaces in ESXi, pfSense and checked the VMs

All without any success nor improvement.

Hence I’m out of options. Since no networking HW seems to be involved (vSwitch) I suspect an issue in the hypervisor config or VM’s.

I really hope somebody can help. Maybe the packet capture screenshot gives a hint?

I seem to have the exact same problem. Did you find any solution??

I switched to Proxmox, which seemed to solve it. Although not 100% sure ESXi was the problem.