None of the Unifi switches below the $3999 support MLAG, so I’m considering building a legacy-style layer 2 network using RSTP for fault tolerance. I think I’m comfortable with how I’d connect all of the switches together, but I’m stuck on how I’d connect a HA pair of UDM pros to the rest of the network. I know I could connect each UDM to one aggregation switch, but if that aggregation switch went down, the UDMs wouldn’t failover and the network would lose connectivity to the internet and inter-vlan routing. I’d like to be able to connect each UDM pro to both agg switches so the network could tolerate a failure of any single piece of equipment, but I’m unsure of if this is possible. Can the UDMs be configured such that both of the 10g ports are in the same VLAN so they can be connected to both agg switches and let RSTP block one of the connections? I don’t currently own a UDM so I can’t test this myself. I would greatly appreciate any feedback on this design.
Thank you!
You will see some weird network issues as I tried to do the same thing with the 8 port LAGG switches UniFi offers. I don’t recommend doing it this way. MCLAG is the proper way to handle this. Sometimes doing it right costs money and this is one I would highly recommend to not implement RSTP.
The main issue preventing the plan you have is that the Unifi gateways don’t run RSTP, but also seem to block the packets from passing between interfaces that are on the same internal virtual switch, so they end up preventing RSTP from disabling one of the ports. I’m not sure what’s different about making an MC-LAG pair that allows it to detect the loop and block ports by STP, per the Unifi documentation on MC-LAG when used with a pair of non-EFG gateways.
Now just for my understanding - we used to build networks like this, no? I know this configuration can become problematic with larger networks, but what is the problem with a smaller network with maybe three access switches? Is it a Unifi specific bug or limitation? Can you share the weird issues you saw?
There is too much to unpack with this. The TLDR, the devices you need HA on need to support load balance mode that is not LACP. @jmarmorato nailed it that RSTP will not help you with proper failover and that is where the weirdness comes in. If you reboot a switch or any network device in your stack then all of them start going wonky and trying to decide where to flow traffic. Then your backend storage not working properly and network to your VM’s also have issues.