Help with routing between vlans - asymmetric routing problem

kevdog · May 19, 2024, 11:15pm

SUMMARY OF THE PROBLEM FOR THOSE WHO WANT TO SKIP DETAILS

I’m really new to networking but I’m having an asymmetric routing issue, and I can’t quite figure it out. My problem has specifically to do with asymmetric routing – as I’ve learned from consulting the pfSense documentation. Within the pfsense logs getting packets passed with TCP:S, but I get a lot of blocked packets with TCP:PA and TCP:A. In all honesty I don’t know a ton about the tcp handshake, and it would appear through my reading on the asymmetric routing issue, that the SYN handshake is occurring over one route, and the other PA/A occurring over a different route. To confirm this was indeed the issue, I actually implemented the “Manual Fix” documented here: Troubleshooting — Troubleshooting Asymmetric Routing | pfSense Documentation and this resolved my issue.

So I guess the purpose of this post is actually to kind of figure out what the heck is going on since I’m not trying to asymmetrically route. My test setup is the following:

HARDWARE SETUP

Virtualized pfSense (within xcp-ng)
Virtualized arch linux (within xcp-ng)
MacBook connected wirelessly using Unifi AC6Pro and Unifi 8 port 150W switch. I’m self hosting the controller. Unifi ports switch ports are setup as default trunk ports with untagged default VLAN and tagged VLAN 40. Access Point is setup to broadcast a default network and a VLAN 40 network.

NETWORK SETUP
The LAN network configuration consists of untagged network (10.1.0.0/23) and tagged VLAN 40 (10.1.40.0/24).

Per the official xcp-ng documentation in regards to VLANs in virtualized router environments (VLAN Trunking in a VM | XCP-ng Documentation), I utilizing the multiple VIF method as this is the method that is officially supported by xcp-ng.

Pfsense - I’m presenting pfsense two vifs - an untagged vif and a vif that was tagged for vlan 40. All seems well in that pfsense can detect the VIFs and assign each a network (10.1.0.0/23 - untagged network) and (10.1.40.0/24 - tagged network).

Arch Linux VM - I’ve created an Arch Linux VM within xcp and presented the Arch Linux VM with two virtual interfaces as well. Using systemd-networkd I’ve configured each interface to obtain an IP address using DHCP. The network cards receive IP addresses 10.1.1.200 and 10.1.40.200. Because the network cards received DHCP network addresses the routing table has two default routes – which may be the source of the problem. More below

My MacBook pro is connected wirelessly using the untagged wireless network as is assigned an IP address of 10.1.1.11.

WHEN THE ERROR OCCURS

I can ssh into arch linux installation from my macbook via two methods (over port 22)
MacBook (10.1.1.11---->ssh---->10.1.1.200) and (10.1.1.11---->ssh—>10.1.40.200). The ssh connection on the untagged network seems to work without issues since no routing is involved since both machines on the same network segment. In the second scenario (where routing has to occur from the 10.1.0.0/23 to the 10.1.40.1/24 network - I can initially ssh into the machine, however after maybe 10-30 seconds, the connection is broken. pfSense shows the following in its firewall logs (snippet):

Default allow LAN to any rule (100000101)	  10.1.1.11:60991	  10.1.40.200:22	TCP:S
...
...
Default deny rule IPv4 (1000000103)|  10.1.1.11:60742|  10.1.40.200:22|TCP:PA
Default deny rule IPv4 (1000000103)|  10.1.1.11:60742|  10.1.40.200:22|TCP:A

PROBABLE SOURCE OF THE ERROR
I believe the ultimate source of the error likely is the routing table created within arch linux.

default via 10.1.40.1 dev eth1.40 proto dhcp src 10.1.40.200 metric 1000
default via 10.1.0.1 dev eth1.0 proto dhcp src 10.1.0.200 metric 1000
10.1.0.0/23 dev eth1.0 proto kernel scope link src 10.1.0.200 metric 1000
10.1.0.1 dev eth1.0 proto dhcp scope link src 10.1.0.200 metric 1000
10.1.40.0/24 dev eth1.40 proto kernel scope link src 10.1.40.200 metric 1000
10.1.40.1 dev eth1.40 proto dhcp scope link src 10.1.40.200 metric 1000

Here is my network configuration for any of those that are interested:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth1.40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether d2:14:29:b2:d8:05 brd ff:ff:ff:ff:ff:ff
    inet 10.1.40.200/24 metric 1000 brd 10.1.40.255 scope global dynamic eth1.40
       valid_lft 6690sec preferred_lft 6690sec
3: eth1.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 62:d0:5d:0c:3f:24 brd ff:ff:ff:ff:ff:ff
    inet 10.1.0.200/23 metric 1000 brd 10.1.1.255 scope global dynamic eth1.0
       valid_lft 2496sec preferred_lft 2496sec

The routing table reveals two default gateways with the same metric which from my understanding is not recommended. Without knowing specifically how to test this assumption, I believe the TCP:S handshake maybe established over one of the network interfaces (or one route), but later packets are sent back to the router through the other network card or the other route.

Perhaps I’m attempting to do something attempting to do something here I shouldn’t which is access the linux vm using two separate networks and expect things to route back to the correct network. Any insights someone may have on the issue would be great since I’m clearly not a networking guru.