Matrix Synapse federated room is killing pfsense WAN
I’m hosting a Federated Matrix Synapse instance, and their Synapse Admin room is taking down my wan connection if the room isn’t visited in over about 24 hours. No other room does this, however this room is typically filled with a lot of dead synapse servers so it generates a high spike in traffic when it catches up.
The problem is, I can’t figure out what exactly is breaking. Here’s a breakdown and hopefully I haven’t forgotten anything.
Setup
(2) Pfsense VM on Proxmox - High availability pair
8Gb Memory
i9-11900K
Hardware checksum offloading disabled
WAN 1Gbps up/down
Behavior
DNS queries spike to about 160k in a minute
Utilization on the interface still seems pretty low during a spike (about 500 Kbps) in relation to available bandwidth
DNS resolution to anything outside the LAN fails
LAN continues to function normally
Tracepaths either timeout or have extremely high latency reaching the WAN’s gateway and anything beyond fails
WAN latency to gateway increases exponentially
Pings continue to work on LAN and WAN
System resources (CPU, Memory, state table, MBUF) are stable. Maybe 25% utilization for any particular resource
Pfsense graphs show what I’d consider minimal usage of bandwidth
No errors in pfsense logs aside from DNS which is unable to resolve & bandwidth
Troubleshooting
Replaced network cables
Replaced network cards (initially had realtek, but switched to intel)
Replaced SFP cables
Implemented a limiter for the Matrix instance to 25% of available bandwidth
Packet capture (this is not my strong suit at all) and the best I was able to determine is there is a lot of TCP retransmissions, but this seems more like a symptom
I’m open for any suggestions on other things to look into because I’ve exhausted my knowledge on things to try.
I don’t know what version you are running in your environment, but it looks like you aren’t alone. I don’t run this setup myself but, the software seems poorly optimized. This isn’t a pfsense issue.
I should have included that. I am on the latest version of Synapse. Unfortunately this has been a thing since I originally set it up, and I originally had some issues to iron out with my DNS. However at this point I’ve essentially rebuilt my entire network sans replacing switches.
maybe increasing the Firewall Maximum Table Entries by some orders helps?
another workaround could be tunneling the matrix traffic over some self-hosted VPN (or just hosting the server in the cloud).
Matrix itself by design is broken beyond repair, and Synapse is overcomplex. The only compelling reason to use Matrix is the availability of all sorts of bridges allowing the integration of all your chat accounts into one Matrix account. As soon as some system with a sane design offers the same capability you can’t blink twice before I dismantle my Synapse homeserver.
So I think the issue I was having is NAT exhaustion on my ISP’s modem. I recently found that my modem’s interface has a section for its NAT Table which includes total sessions (8192) and sessions in use, among other details. When that Synapse room would cause my WAN to go down, the sessions in use were reaching its threshold.
Since I have an upstream reverse proxy I’ve lowered the maximum connections on Matrix backend and so far I haven’t had any WAN drops.
If you have a virtual pfSense, what is the point in running the modem in router mode? Can you switch it to bridge mode and let pfSense establish a PPPOE link?
That would solve the problem with the low NAT connection limit in the modem.