Inter-VLAN Communication Issues with XCP-NG, pfSense VM, and Unifi Setup

chrizz · December 1, 2023, 11:03am

Hello everyone,

I’ve recently encountered a challenging issue with my network setup involving VLAN’s, and I’m hoping to get some insights or solutions from this knowledgeable community. Here’s a brief overview of my current setup and the specific problem I’m facing.

Host System:
XCP-NG 8.3-beta1 running on a cwwk mini-PC with an Intel i3-n305 CPU and 8 2.5G NICs (i-226V). 6 LAN Firewall Appliance 2.5G Router 12th Gen Intel i3-N305/N100 DDR5 – cwwk

The reason for running 8.3-beta1 is support for the NICs and graphics on this pc. 8.2 does not install at all.

pfSense VM (version 2.7.1) with 4 VIFs configured as follows:

vif0: External (eth0)
vif1: Internal (eth1, 192.168.88.1/24)
vif2: Guests (eth1, 192.168.55.1/24, VLAN 55)
vif3: SFW (eth1, 172.16.0.1/24, VLAN 20)

Network Setup:
VLANs are defined in XCP-NG with Xen Orchestra for the pool (one XCP-NG host). tx offloading is disabled for all VIFs connected to pfSense and other FreeBSD vm’s.

I have an 8 port Unifi switch (US-8-60W) connected to the host system, and a couple of Unifi AP’s connected to the switch. These are controlled via unifi controller running on a Debian VM. In unifi controller I have configured three networks and three Wifi’s (default (no VLAN), SFW (VLAN 20) and Guest (VLAN 55))

The Issue:
While general browsing from my laptop works fine over all wifis (and wired), I’m experiencing packet loss and SSH connection drops when communicating between VLANs. When browsing xen orchestra from default network to xen orchestras SFW VLAN 20 IP-address I get logged out after a short while. Works fine if I’m within the same VLAN or just the default network ip-range. This issue didn’t exist in my previous setup on a Supermicro server where the same VM’s were migrated from (XCP-NG 8.2).

Specific Scenario:
When connected to the SFW VLAN (172.16.0.0/24) over WIFI and I SSH into a Debian server on the default network (192.168.88.0/24), the connection initially works but hangs after about 15-30 seconds, then drops: “client_loop: send disconnect: Broken pipe”

Observations and troubleshooting done so far:

SSH connection to a server within the same VLAN is stable.
Issue persists across different laptops and both WiFi and direct connections to switch.
Created a bridge in pfSense to be able to connect directly to SFW VLAN via cable. This was stable.
Bounded a port on the Unifi switch to connect directly to SFW VLAN via cable. This was NOT stable.
Replacing the switch with a Netgear GS108 did not resolve anything.
Fresh installation and configuration of pfSense did not resolve anything.
Disabled specific rules in pfSense to allow unrestricted inter-VLAN communication.
Wireshark captures show “Spurious Retransmissions” and “Duplicate ACKs” right before the disconnect.

Wireshark log just before and when the SSH connection hangs:

No. Time Source Destination Protocol Length Info

468 17.397715 192.168.88.5 172.16.0.104 SSH 166 Server: Encrypted packet (len=100)

469 17.397718 192.168.88.5 172.16.0.104 SSH 102 Server: Encrypted packet (len=36)

470 17.397959 172.16.0.104 192.168.88.5 TCP 66 64059 → 22 [ACK] Seq=1 Ack=2449 Win=2048 Len=0 TSval=2342211817 TSecr=1567302730

550 18.421836 192.168.88.5 172.16.0.104 SSH 166 Server: Encrypted packet (len=100)

551 18.421839 192.168.88.5 172.16.0.104 SSH 102 Server: Encrypted packet (len=36)

552 18.422079 172.16.0.104 192.168.88.5 TCP 66 64059 → 22 [ACK] Seq=1 Ack=2585 Win=2048 Len=0 TSval=2342212841 TSecr=1567303754

560 18.465260 192.168.88.5 172.16.0.104 TCP 102 [TCP Spurious Retransmission] 22 → 64059 [PSH, ACK] Seq=2549 Ack=1 Win=501 Len=36 TSval=1567303797 TSecr=2342211817

561 18.465433 172.16.0.104 192.168.88.5 TCP 78 [TCP Dup ACK 552#1] 64059 → 22 [ACK] Seq=1 Ack=2585 Win=2048 Len=0 TSval=2342212884 TSecr=1567303797 SLE=2549 SRE=2585

564 18.685180 192.168.88.5 172.16.0.104 TCP 202 [TCP Spurious Retransmission] 22 → 64059 [PSH, ACK] Seq=2449 Ack=1 Win=501 Len=136 TSval=1567304017 TSecr=2342211817

565 18.685384 172.16.0.104 192.168.88.5 TCP 78 [TCP Dup ACK 552#2] 64059 → 22 [ACK] Seq=1 Ack=2585 Win=2048 Len=0 TSval=2342213104 TSecr=1567304017 SLE=2449 SRE=2585

574 19.125225 192.168.88.5 172.16.0.104 TCP 202 [TCP Spurious Retransmission] 22 → 64059 [PSH, ACK] Seq=2449 Ack=1 Win=501 Len=136 TSval=1567304457 TSecr=2342211817

575 19.125462 172.16.0.104 192.168.88.5 TCP 78 [TCP Dup ACK 552#3] 64059 → 22 [ACK] Seq=1 Ack=2585 Win=2048 Len=0 TSval=2342213544 TSecr=1567304457 SLE=2449 SRE=2585

585 20.083755 192.168.88.5 172.16.0.104 TCP 202 [TCP Spurious Retransmission] 22 → 64059 [PSH, ACK] Seq=2449 Ack=1 Win=501 Len=136 TSval=1567305353 TSecr=2342211817

586 20.083957 172.16.0.104 192.168.88.5 TCP 78 [TCP Dup ACK 552#4] 64059 → 22 [ACK] Seq=1 Ack=2585 Win=2048 Len=0 TSval=2342214503 TSecr=1567305353 SLE=2449 SRE=2585

615 21.728818 172.16.0.104 192.168.88.5 WebSocket 80 WebSocket Text [FIN] [MASKED]

616 21.733162 192.168.88.5 172.16.0.104 WebSocket 76 WebSocket Text [FIN]

617 21.733325 172.16.0.104 192.168.88.5 TCP 66 64164 → 80 [ACK] Seq=1019 Ack=419 Win=131328 Len=0 TSval=2478428867 TSecr=1567307065

618 21.781205 192.168.88.5 172.16.0.104 TCP 202 [TCP Spurious Retransmission] 22 → 64059 [PSH, ACK] Seq=2449 Ack=1 Win=501 Len=136 TSval=1567307113 TSecr=2342211817

619 21.781357 172.16.0.104 192.168.88.5 TCP 78 [TCP Dup ACK 552#5] 64059 → 22 [ACK] Seq=1 Ack=2585 Win=2048 Len=0 TSval=2342216200 TSecr=1567307113 SLE=2449 SRE=2585

I’ve spent considerable time troubleshooting this without success. Any insights, suggestions, or similar experiences would be greatly appreciated.

Thanks in advance!

neogrid · December 1, 2023, 11:17am

The obvious thing to try is to install pfSense on the baremetal box and see if you get the same issues. If so then it’s the hardware or config, if not then it’s the virtualisation.

If you have a spare hdd should be easy/quick to test out.

Or how about trying out Proxmox that’s based on the latest Debian release, I’d guess it has the drivers you need, see what happens.

My guess is the virtualisation is the source of your issues.

LTS_Tom · December 1, 2023, 11:20am

Did you turn off checksum-offload on the virtual xen interfaces per the documentation?

chrizz · December 1, 2023, 1:02pm

Yes, this could be a way to find out for sure. I’ll see if it’s feasible to test this shortly if nothing else comes up.

Like I wrote in my first post, the issue exists when connected to vlan via a switch, but not when connected to the host directly (to a free port that I’ve bridged to a vlan interface in pfSense). Not sure if this proves anything, but I thought it was worth mentioning again.

chrizz · December 1, 2023, 1:03pm

Yes. Checksum-offload is turned off on the virtual xen interfaces. Currently I turned it off from xen orchestra. However, I noticed xen orchestra does ethtool-tx=”false” (thats the current setting I have for other-config), but I’ve tried ethtool-tx=”off” also as the documentation recommends.

I’ve also tried this: Common Problems | XCP-ng Documentation

I did have a lot of checksum error for the pif’s prior to following the recommendations there.

My initial issue still remains though.

One observation I’ve forgot to mention is that if I run tcpdump -i -v -nn |grep incorrect on the debian vm I’m testing with, I get a lot of checksum errors. I run the test on debian, not on xcp-ng.

chrizz · December 14, 2023, 3:12pm

Today I reinstalled the system with 8.2.1 LTS as the installation media have recently been updated to support i226 NIC’s and more (previous iso did not install at all on this machine). The installation now works perfectly.

However, the issues remains unfortunately

chrizz · December 31, 2023, 11:29am

I had some time today to further investigate, so I figured I’ll give proxmox a go. However, before I got that far to actually install proxmox on the CWWK mini pc, I moved back to my previous setup with pfsense running on the old supermicro server (also xcp-ng).

To my surprise the issue is still there! That means the new mini-server hardware is not at fault at all. The issue probably lays in the switch and how I configured the vlans in unifi controller, or in how this works together with xcp-ng. Here are all relevant settings I could think of. Does this look correct?

Skärmavbild 2023-12-31 kl. 11.14.00

chrizz · January 21, 2024, 7:26am

@LTS_Tom I’ve ruled out xcp-ng as the issue as the problem exists with proxmox also.

However, it might be pfSense releated, or at least, pfsense causes the timeout. I tried changing the state timeout “TCP Opening” from default 30s to 60s, and the connection drop was delayed to 60s instead. So this is the root of the issue. No idea why this happens though. Any clues?

My setup is pretty basic at the moment. One proxmox host with pfsense vm and a debian vm (using two nics, one for internal and one external). A unifi switch connected to the host via internal port. And two unifi APs connected to the switch. That’s it currently.

If i connect to wifi vlan 20 (ethernet cable to vlan 20 does the same thing) and ssh to a vm on default network the connection gets timedout after “TCP Opening” timeout setting.

LTS_Tom · January 21, 2024, 10:54am

Not sure, try running pfsense on real hardware / bare metal and see if the problem persists.

chrizz · January 21, 2024, 11:36am

I’ll try that. Thanks.

chrizz · January 30, 2024, 5:36pm

Fixed! The issue was asymmetric routing. Thats something I’ve investigated previously allready, but apparently not well enough.