Having issues with Layer 3 Routing on UniFi Switch

I have a network configuration that ran fine for years using a Cisco SG350 (and before that, an SG300) as the core switch. The core switch is responsible for Layer 3 routing between some of my VLANs (6, 7, 31 in my diagram), my router sits on the other side of it and only sees a single VLAN (4040 now due to UniFi requirements). I originally implemented this because when I tried having the router (at the time pfSense, now OPNsense) manage all of the VLAN routing, I was unable to saturate a 1Gbps link when traffic was transiting between two VLANs.

More recently, I “upgraded” my network to a UniFi Pro Max 24 PoE as the core switch. I already had UniFI access points and some auxiliary UniFi switches (an Enterprise 8 PoE on my desk and a Mini behind my TV), so wanted to consolidate the rest of my network into a single UI for management. Here’s a simplified view of what my network currently looks like (hopefully you can read my handwriting). I am running a self-hosted controller on an XCP-NG VM, but it has behaved quite well in general.

Ever since I installed the Pro Max 24 PoE, I have had sporadic issues with performance of any traffic that needs to transit the Layer 3 routes in my core switch. For management isolation reasons (which are probably overkill, but oh well) I do have some non-routed VLANs that my Mac Mini is on with the switches, but are not accessible unless you are connected directly to a port that supports that VLAN. When I perform a speed test across that VLAN to the Plex server or UniFi controller, it consistently saturates the slowest link (usually 1Gbps) in the path. When I run the same test, across the same physical wires, but via a route that goes through the core switch’s layer 3 routing (so VLAN 6 to VLAN 31), I get absurdly slow speeds, usually <1Mbps. This isn’t completely consistent. I also have issues with websites via the internet, though sometimes those issues appear to apply to one device and not others (most often WiFi but not wired).

Other troubleshooting I have done includes looking UniFi Controller logs, but those don’t seem to indicate any problems. I’ve also ran Wireshark on a computer while it was being slow with some traffic and see what appears to be a rather high number of TCP Dup Ack, TCP Retransmit and the occasional other TCP error packet.

I don’t think I have a topology that is particularly unusual, and it performed very well when I was using a Cisco switch, making me think it has to be related to the UniFi switch itself. Searching online has mostly turned up complaints about Layer 3 features, not performance, so I can’t tell if this is a common thing or not, but it seems like if it was that Layer 3 would be completely unusable and I do see reports of other more specific issues, so people are using it?

Does anyone have any ideas of what might be happening here? Is this really just my config somehow or is UniFi hardware really that incapable of performing this function? I wouldn’t expect to need a Enterprise switch to do this when Pro Max also advertises as PoE capable…

I suggest to use draw.io to produce a diagram; difficult for me to read this.

Ok, here’s a quick attempt at creating this in draw.io (it wasn’t loading for me earlier).

As post elsewhere indicated to turn on flow control since I have different ports operating at different speeds. I have done so and it appears to have improved a couple clients, but others are still experiencing significant troubles. Monitoring both ends of the traffic between two nodes on different VLANs (Mac Mini to Plex) I can clearly see where packets were dropped. The Mac Mini performs retransmits and the other end only sees a single packet being received. It does look like the packet is finally received after the 3rd retransmit, but it created over 1 second of delay in that simple exchange. It’s difficult to compare every packet exchange, but that behavior does appear to exist multiple times in the short (<20 seconds) log that I have.

I guess I’m just amazed that the layer 3 behaviors of these devices can be this bad, given that I see information about them being used in large deployments. My understanding is that offloading layer 3 routing of VLANs to a switch is a common topology, so I would have expected that the basic aspects of forwarding packets between VLANs (which I think just requires updating the MAC address) would work in a reliable fashion, even if more advanced features like port-specific ACLs and BGP aren’t implemented.

One other interesting quirk. Both the Pro Max PoE and Enterprise 8 PoE are currently setup to route specific networks (non-conflicting networks that are not meant to be capable of communicating with each other). The route from the OPNsense Router to the Enterprise 8 PoE goes through the Pro Max PoE first, but according the gateway monitoring in OPNsense, the Enterprise 8 PoE consistently has a faster ping response time.

SWCORE_GW is the Pro Max 24 PoE
The other one is the Enterprise 8 PoE (name redacted because it wasn’t generic)

Maybe this is just a CPU capability thing, but it does seem unexpected to me.

Hey @Andne - did you get any further with this?

I’m in a very similar situation - my UniFi upgrade plans for a (voluntarily maintained, but very much) production network got accelerated at short notice when the L3 3000 series Catalyst switch unexpectedly died. We still have a Cisco ISR router handling NAT and a couple of DMZ subnets (+ VPNs etc).

I to replaced the L3/core switching with UniFi - also a Pro Max (48 in my case). There’s a separate Pro 24 hung off of it for additional access.

I think the same as you, I had long since already migrated the WiFi to UniFi - which has been rock solid.

The issues I’m experiencing in symptom are sometimes similar - terrible (and I mean truly terrible) throughputs, feels like dropped packets, etc. But it’s seemingly at random - I can’t find any pattern at all. It can be fine for a few days, or it can fail within 10 minutes of recovery-by-reboot.

The Cisco ISR is doing nothing spectacular or novel. There’s a static route to 10.255.253.2 via VLAN 4040.

Seemingly related to my issue is the Pro Max appears to “forget” it’s 10.255.253.2 … reloading the Cisco config (oddly - but I have an unproven suspicion as to why - see below) seems to fix it, at least for a while, or otherwise rebooting the Pro Max itself.

I have an unproven suspicion that the Pro Max is “losing sight” (for some definition of that phrase) of the Cisco ISR on 10.255.253.1. Pinging 10.255.253.1 from the UniFi switch seems to wake everything back up.

Even more oddly, when the L3 switching grinds to a terrible rate (as you describe - or drops goes as far as dropping out almost entirely…) … some subnets remain routable via the UniFi Pro Max. Again, I can’t find any rhyme or rhythm to it. Additionally - if I try to ICMP ping from (in your example network, something off the side of OPNsense, e.g., 10.7.90.2 … as a made up example), I can see it reaches the “inside” of the UniFi Pro Max OK … but replies get lost somewhere in transit.

If it falls over again when I’m available to try sniffing all the traffic, I’ll see if I can capture where it’s going wrong.

I’m ripping my hair out here. The Cisco kit, despite being prehistoric in IT terms … and second/third hand at that, has been absolutely rock solid. The L3 switching on the Pro Max has felt like nothing but an unreliable and forsaken nuisance from day 1.

I struggle to believe Ubiquiti would ship something either “so untested” or “knowingly broken” - but I’m at a total loss as to what I’m doing wrong.

Any suggestions would be welcome … I will happily post and dump any logs etc. Are you getting the same experience that “everything is fine” for a period of time?

The only other “thread of suspicion” I have is the management IP of the UniFi switch (the core/L3 switch specifically) is currently set to an IP and subnet “on the Cisco” side of the network … exactly as (I think) you’ve got yours. It’s an unproven stretch of hypothesis - but I’m wondering if the UniFi switch is suddenly deciding to route 0.0.0.0 via that instead, which my ISR (or your OPNsense) is misaligning the responses for.

I’m still working through this. I have an open ticket with Ubiquiti, and they’ve recommended to factory reset the switch and readopt it. However, that’s about as messy as one would expect, so I need to make specific plans and haven’t had the opportunity yet to setup my network so I can access the controller UI without that switch in place (which is probably a capability I should have now that I think about it), so I haven’t been able to test that it helps.

All of that said, I think I’m still experiencing similar behavior to yours, I get ~50hours of uptime on a switch before it starts to degrade, then several more hours before the degradation becomes significant enough to be a problem. I have noticed some messages in the switch’s kernel output (dmesg) that appear to start around the time the degradation starts (though it’s difficult to directly correlate that). The switch log is receiving lots of “unknown rt error code f028” messages while it’s in a degraded state. This is what Ubiquiti is pointing to to indicate that I need to reset and readopt the switch.

I have noticed that during a reboot, the switch responds with the wrong MAC address to 10.255.255.x. During normal operation, MAC address for that IP (as seen by my router) ends in :d6. The management IP has a MAC address ending in :d4. During a boot cycle of the switch, the router picks up the :d4 MAC address in it’s ARP table and I loose all internet connectivity until that entry expires and is updated correctly again. This may explain your “fails within 10 minutes” issue? I’ve otherwise not experienced anything like that.

Thank-you. Yeah, factory resetting and re-adopting doesn’t feel entirely sane to me … at least without explanation as to what’s wrong and why a factory reset will fix it (and prevent reoccurrence).

I had wondered about an ARP table issue too - I’m curious about your observation of the hardware address(es). If it does repeat for me, I’ll see if I can rig a way of sniffing the traffic.

Whether it’s merely “too soon to tell” or not - I have setup a Cron job pinging the Cisco’s 10.255.253.1 IP 3 times every minute from “inside” the network (as in, from a VM connected to the problematic Pro Max switch). So far I’ve had 2.5 days uptime without issue. It’s a terrible attempt at a band-aid for a terrible bug - my hypothesis is that the continual traffic may keep the ARP table “fresh”, avoiding the issue.

I’ll take a look at the switch’s dmesg log as well, if it reoccurs.

Thank-you for replying as well. It helps not feeling like I’m completely on your own, if nothing else!

Well, I performed the reset. Since I used the physical reset switch, my controller just recognized it and adopted it without much intervention (beyond messing with port VLANs so it could be discovered while not configured). So now I’m waiting to see if the issue starts again. I have asked Ubiquiti to share some more details about what those errors mean and which component they would be coming from, but I don’t know if they will or not.

I don’t think the ARP table is directly related to the connectivity issue, I actually get pretty good performance yet on certain devices to the internet (which the ARP table would impact) while at the same time those same devices struggle to talk to things in another VLAN. However, other devices will barely be able to do anything - either internet or local - at the same time. That said, it may be another symptom of the same underlying problem or the fact it exists at all may indicate a configuration issue that is leading to the problem.

I forgot to mention last time, I also have an Enterprise 8 PoE that as best I can tell is running the same version of software and it does not exhibit this MAC address issue. It’s configured for L3 routing yet because it was my test bed for how does such configuration work before I bought the Pro Max 24 PoE. Guess I should have stress tested that more before I fully converted.

Boom. See screenshot.

May I ask if UniFi given any reason for hope since your factory reset and conversing with them?

Possible progress - I guess only time will tell. UniFi have taken a look at my logs and support file etc - and have instructed me to raise an RMA. The engineer suspects it’s a hardware fault based on what they’ve seen(?).

On the one hand, I’m suspicious that we’ve both had the same error and similar symptoms with similar topologies - on the other hand, it would seem reasonable that the reason the only reference to this error on the internet appears to be the two of us - if it is indeed a hardware fault!

shrug. I can’t fault the support experience at least … they were quick and thorough - I just hope it does prove a hardware fault and I don’t just get this again with the replacement! Watch this space…

No, it’s back in the bad state again. I hadn’t been around much until today, so hadn’t noticed anything, but it’s definitely having the same issues again (one system trying to pull from the web was getting about 1Mbps download, another one on a different VLAN saw it’s usual ~30Mbps download of the same file, limited by the source server I think). Checked the switch, and those same errors are occurring again.

RMA might help? But I suspect the device is fine hardware-wise and there’s a firmware issue with not handling something from the ASIC correctly. My Enterprise 8 PoE has been doing similar tasks for yet another VLAN (yes, I need to simplify parts of my network), though at much lower bandwidth, today and other days and it has been running for months with no issues, so dunno…

I’m really hoping I don’t have the send the current switch in first for an RMA, that’s going to be extremely disruptive to my network to lose it completely. Maybe I should first buy an Enterprise 24 PoE and see if there really is just something different between the two…

Yeah, I too am half skeptical. There is, possibly coincidentally, an error caught from the switch in our support file - which shows a paging fault which caused a kernel panic … so maybe I’ve got two problems - only one of which is physical! Who knows.

Just very tired of it all atm if I’m honest - lot of late nights lost to it. If it arrives and proves to solve the problem - I will let you know.

They found an issue in my latest support file and have instructed me to RMA my switch as well. As long as I can get the shipping done in the right order (send me new switch, then I’ll send back old switch) this isn’t the worst next step. If I have to send back in the bad switch first, that’s a whole different mess and I’ll need to think hard on how I want to move forward (and likely on the long-term viability of Ubiquiti hardware in my network).

I contacted UniFi by emailing rma@ … and explained my situation (that it is a mission critical switch, and I can’t simply send it off without a replacement already in hand).

Within 48 hours they acknowledged my email and have agreed without hesitation to send me one ahead. Only it’s currently on back order - doh!

But, otherwise, I can’t complain (given I’m not paying for any sort of support with them beyond the capital cost). YMMV - but I’d say it’s worth trying a polite but firm email, pointing out it’s core to your network - and you’re not “just some home user”.

Out of curiosity - do you know what they found? I did some digging in mine - there seems to have been a kernel panic relating to virtual memory paging … so I’m guessing it’s faulty memory(?)