Strange router behavior every time I reboot the main switch

I have a client with a Netgate 6100 and a bunch of UniFi equipment, the main switch being a USW48 POE. Both are on the latest firmware, and this has been happening every time I update the firmware on the switch or reboot the switch for any reason. Anytime I reboot the USW48POE, everything initially appears to be coming back up, other switches begin reporting back to the UniFi controller, and then bam - everything goes offline, including the router. I can’t VPN into it, all reporting back to Zabbix and the UniFi controller stop, all internet traffic is dead, and it stays offline until I have someone power cycle the router. I checked the Zabbix logs and the only thing that I can see is, right before everything goes offline, the router’s CPU utilization skyrockets from it’s usual ~10% to around 280% and the states tables report jumping from between the normal 5k to 7k to double that in a period of about 2 minutes, right before everything cuts off. I’ve never seen this behavior at any other site. Does anyone have any idea what’s going on? I’m totally at a loss and this is really annoying my client because every time there’s an update for the switch, which I do after-hours, of course, they have no internet in the morning. Rebooting other switches in the organization, further down the line doesn’t cause any issue. It’s only the main switch that causes this.

Do you have the proper STP priority set on your switches? Have you looked to see if there are any loops?

As far as I’ve been able to understand RSTP, I’m pretty sure I’ve got them all set up correctly, with the main switch at 4096. They’ve only got 6 other switches (set at 16384 or 32768), all of which are connected directly to the main switch with a home run; none wired to each other. There are no loops and, even if there were, I’ve seen UniFi handle that with RSTP before where it will show a port with a warning about there being a potential loop and having that port temporarily blocking until the issue is resolved. I don’t see that happening here. The router connects to the switch with a single CAT5e cable, so no loop there, either. Once it does this, it never comes back until the router is power cycled.

You might want to look at the logs on pfsense and see if there is something out of place.

Here’s a condensed version of what I see in the pfSense logs after the switch reset:

Nov 10 18:53:30 php-fpm 3981 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1731282810] unbound[13748:0] error: bind: address already in use [1731282810] unbound[13748:0] fatal error: could not open ports’
Nov 10 18:53:30 php-fpm 42671 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1731282810] unbound[17860:0] error: bind: address already in use [1731282810] unbound[17860:0] fatal error: could not open ports’
Nov 10 18:53:30 php-fpm 592 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1731282810] unbound[17563:0] error: bind: address already in use [1731282810] unbound[17563:0] fatal error: could not open ports’
Nov 10 18:53:30 php-fpm 81763 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1731282810] unbound[17186:0] error: bind: address already in use [1731282810] unbound[17186:0] fatal error: could not open ports’

Nov 10 18:55:39 php-fpm 591 /rc.newwanip: Gateway, NONE AVAILABLE

Nov 10 18:55:44 php-fpm 591 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - 10.16.0.1 → 10.16.0.1 - Restarting packages.

Nov 10 18:55:49 php-fpm 592 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - 172.16.19.1 → 172.16.19.1 - Restarting packages.

Nov 10 18:55:57 php-fpm 42671 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - 172.16.3.1 → 172.16.3.1 - Restarting packages.

Nov 10 18:56:00 php-fpm 78433 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - 172.16.22.1 → 172.16.22.1 - Restarting packages.

Nov 10 18:56:07 php-fpm 5975 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - 192.168.19.1 → 192.168.19.1 - Restarting packages.

Nov 11 07:06:31 dpinger 37708 WAN_DHCP [public IP redacted]: sendto error: 65

It looks odd to me that it’s trying to associate each of the local VLANs to the WAN, but that could certainly be a problem, unless I’m interpreting the logs incorrectly. If I’m not, why would it try to do that? The fiber ONT doesn’t go offline during this process and is only directly connected to the router, so I don’t get why the router would lose the Gateway connection in the first place.

What is interesting is the unbound errors. It’s like something on your network already has this IP assigned rather than your firewall interface addresses.

Which makes since because if the switch reboots then whatever is on your network has a static IP on your interface IP on your firewall then your firewall will flip out that it can’t bind to that IP.

Then it comes back when you reboot the firewall because the interface IP will be first again and will allow it to bind.

That makes sense, but I can’t imagine what would have it set static. I set up the network, myself, quite a while ago, and I wouldn’t have done that. But, three there have been other vendors for security cams and VoIP, so maybe one of them screwed up something. I’ll have to figure out how to narrow it down. Maybe some sort of Arp scan or something.

This happens to me as well. I have a slightly different setup than you, but still have a Unifi switch plugged into a 6100. Every time I patch my unifi switch it seems that the 6100 just locks right up - after I power cycle it, it’s fine, however it’s a giant pain.

I haven’t consoled into it to look at the logs while it’s happening though to try and narrow down what’s going on - I thought it was just because of something in my setup, but now you’re seeing the same thing with a 6100, Time to do some digging I guess.

Your last line of the logs you provided are showing no communication to your WAN gateway (dpinger). The other stuff I have seen those errors on other systems with the update of 24.03 if that is what you are on. When the trunked connection from the switch side reboots pfSense detects this and it flaps those interfaces.

I am trying to remember I thought I seen this before. I think it had something to do with the settings on the WAN interface dhcp and flushing of states on gateways. What are the settings like in those areas.

Glad I’m not alone. I’ll certainly post anything I figure out which I can confirm resolves it.

Those are the current settings regarding gateway and the “Rest all states if WAN IP address changes” is unchecked. Should I change the settings to match your example?

Yes I would give it a shot. I think that is what helped when I was working on something similar to this in the past. Also it should help the cpu issue spiking if it does reset those states accordingly.

Lastly look carefully into the wan dhcp logs and settings in the area of what pfSense is doing with the ISP connection during these events.

I will give it a shot and report back. I still don’t understand why the router would kill off the WAN connection just because the LAN side was temporarily unreachable during the switch reboot. Doesn’t make sense to me that the WAN side would be affected at all, being that it’s a separate interface. Hopefully this does the trick, anyway.

Based on some memory. I believe what actually happened in the similar scenario was. It did not really kill off the wan side. I think my situation exhibited an issue where the firewall / router just hung. Only a reboot would bring it back up. At first I had the same thinking as you because the logs were showing ISP / WAN connection having no gateway access. Only till it happened again was I able to determine the hanging issue. While it was happening I happened to attempt a direct connect to the pfSense device through serial console and this would not allow connection either. I figured it must be some bug with the way it handles the flapping of the interfaces overloading it. Based on how the logs showed right before the point of no access issue. Tweaking those settings I mentioned before and some on the ISP / DHCP side seemed to resolve my issue.

I was also little worried at first. I thought the hardware or device was failing. Also I noticed this symptom on the 6100 happening shortly after the latest update. So I was just hoping it might of been a bug or changes from the versions. Its been many months since this issue. The device is still working fine so I am assuming it was something changed in the update.

Yeah, I’m pretty sure when thinking back, that it just hangs. Even though there are log entries showing it’s still doing something in the background, the entire thing doesn’t respond to anything on the LAN or WAN side. Not sure about the console, as I’m hardly ever on-site there. Hopefully the setting change fixes it. I haven’t tried yet because I’m trying not to annoy the client or alarm them that their network equipment might be failing. I’m on the latest firmware and patches, so that’s not going to fix it, but hopefully the settings will. I might try rebooting the switch over the weekend and I’ll report back here. Thanks for your help so far.

Well, that didn’t work, but this time it hung, not because of a switch reboot, but apparently (maybe) because of a pfBlockerNG update. It does these every night, but for some reason last night it dropped the WAN connection, never to return until someone rebooted the router again this morning. The logs show the pfBlockerNG update and then about 10 minutes later, it decided to flap all the interfaces on the router and then lose the gateway. I don’t understand this.

Nov 23 00:15:49 php-fpm 591 /rc.openvpn: Gateway, NONE AVAILABLE
Nov 23 00:15:47 check_reload_status 645 Reloading filter
Nov 23 00:15:47 check_reload_status 645 Restarting OpenVPN tunnels/interfaces
Nov 23 00:15:47 check_reload_status 645 Restarting IPsec tunnels
Nov 23 00:15:47 check_reload_status 645 updating dyndns WAN_DHCP
Nov 23 00:15:47 rc.gateway_alarm 78434 >>> Gateway alarm: WAN_DHCP (Addr:[redacted] Alarm:1 RTT:.503ms RTTsd:.259ms Loss:21%)
Nov 23 00:14:20 php_pfb 38415 [pfBlockerNG] filterlog daemon started
Nov 23 00:14:20 vnstatd 39132 Monitoring (16): pfsync0 (1000 Mbit) pflog0 (1000 Mbit) ovpns1 (1000 Mbit) ix3 (10000 Mbit) ix2 (1000 Mbit) ix1 (10000 Mbit) ix0 (10000 Mbit) igc3 (10 Mbit) igc2 (10 Mbit) igc1 (10 Mbit) igc0.3 (1000 Mbit) igc0.22 (1000 Mbit) igc0.19 (1000 Mbit) igc0.16 (1000 Mbit) igc0 (1000 Mbit) enc0 (1000 Mbit)
Nov 23 00:14:20 vnstatd 39132 Data retention: 48 5MinuteHours, 4 HourlyDays, 62 DailyDays, 25 MonthlyMonths, -1 YearlyYears, 20 TopDayEntries
Nov 23 00:14:20 vnstatd 39132 vnStat daemon 2.11 (pid:39132 uid:0 gid:0, SQLite 3.44.0)
Nov 23 00:14:20 vnstatd 40025 Error: pidfile /var/run/vnstat/vnstat.pid lock failed (Resource temporarily unavailable), exiting.
Nov 23 00:14:19 tail_pfb 38093 [pfBlockerNG] Firewall Filter Service started
Nov 23 00:14:19 php_pfb 35757 [pfBlockerNG] filterlog daemon stopped
Nov 23 00:14:19 tail_pfb 35180 [pfBlockerNG] Firewall Filter Service stopped
Nov 23 00:14:19 vnstatd 25860 SIGTERM received, exiting.
Nov 23 00:14:09 vnstatd 25374 Error: pidfile /var/run/vnstat/vnstat.pid lock failed (Resource temporarily unavailable), exiting.
Nov 23 00:14:02 php 8806 /usr/local/sbin/acbupload.php: Skipping ACB backup for (system): pfblockerng: saving dnsbl changes.
Nov 23 00:13:51 php-fpm 48849 /rc.start_packages: Beginning configuration backup to https://acb.netgate.com/save
Nov 23 00:13:51 check_reload_status 645 Syncing firewall
Nov 23 00:13:51 php-fpm 48849 /rc.start_packages: Configuration Change: (system): pfBlockerNG: saving DNSBL changes
Nov 23 00:13:50 php-fpm 48849 /rc.start_packages: Restarting/Starting all packages.
Nov 23 00:13:50 snmpd 84413 disk_OS_get_disks: adding device ‘mmcsd0boot0’ to device list
Nov 23 00:13:50 snmpd 84413 disk_OS_get_disks: adding device ‘mmcsd0boot1’ to device list
Nov 23 00:13:49 check_reload_status 645 Reloading filter
Nov 23 00:13:49 check_reload_status 645 Reloading filter
Nov 23 00:13:49 check_reload_status 645 Starting packages
Nov 23 00:13:49 php-fpm 591 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - 10.16.0.1 → 10.16.0.1 - Restarting packages.
Nov 23 00:13:47 php-fpm 591 /rc.newwanip: Creating rrd update script
Nov 23 00:13:47 php-fpm 591 /rc.newwanip: Resyncing OpenVPN instances for interface PUBLICWIFI.
Nov 23 00:13:39 php-fpm 591 /rc.newwanip: Gateway, NONE AVAILABLE
Nov 23 00:13:24 php-fpm 94925 /rc.newwanip: rc.newwanip: on (IP address: 172.16.22.1) (interface: VOIPPHONES[opt10]) (real interface: igc0.22).
Nov 23 00:13:24 php-fpm 94925 /rc.newwanip: rc.newwanip: Info: starting on igc0.22.
Nov 23 00:13:24 php-fpm 58870 /rc.newwanip: rc.newwanip: on (IP address: 192.168.19.1) (interface: MAINLAN[lan]) (real interface: igc0).
Nov 23 00:13:24 php-fpm 58870 /rc.newwanip: rc.newwanip: Info: starting on igc0.
Nov 23 00:13:24 php-fpm 43191 /rc.newwanip: rc.newwanip: on (IP address: 172.16.19.1) (interface: SURVEILLANCE[opt8]) (real interface: igc0.19).
Nov 23 00:13:24 php-fpm 43191 /rc.newwanip: rc.newwanip: Info: starting on igc0.19.
Nov 23 00:13:24 php-fpm 51831 /rc.newwanip: rc.newwanip: on (IP address: 172.16.3.1) (interface: CREDITCARD[opt9]) (real interface: igc0.3).
Nov 23 00:13:24 php-fpm 51831 /rc.newwanip: rc.newwanip: Info: starting on igc0.3.
Nov 23 00:13:24 php-fpm 591 /rc.newwanip: rc.newwanip: on (IP address: 10.16.0.1) (interface: PUBLICWIFI[opt7]) (real interface: igc0.16).
Nov 23 00:13:24 php-fpm 591 /rc.newwanip: rc.newwanip: Info: starting on igc0.16.
Nov 23 00:13:23 check_reload_status 645 rc.newwanip starting igc0.22
Nov 23 00:13:23 php-fpm 43191 /rc.linkup: HOTPLUG: Triggering address refresh on opt10 (igc0.22)
Nov 23 00:13:23 php-fpm 43191 /rc.linkup: DEVD Ethernet attached event for opt10
Nov 23 00:13:23 check_reload_status 645 rc.newwanip starting igc0
Nov 23 00:13:23 php-fpm 51831 /rc.linkup: HOTPLUG: Triggering address refresh on lan (igc0)
Nov 23 00:13:23 php-fpm 51831 /rc.linkup: DEVD Ethernet attached event for lan
Nov 23 00:13:23 check_reload_status 645 rc.newwanip starting igc0.19
Nov 23 00:13:23 php-fpm 39064 /rc.linkup: HOTPLUG: Triggering address refresh on opt8 (igc0.19)
Nov 23 00:13:23 php-fpm 39064 /rc.linkup: DEVD Ethernet attached event for opt8
Nov 23 00:13:23 check_reload_status 645 rc.newwanip starting igc0.3
Nov 23 00:13:23 php-fpm 592 /rc.linkup: HOTPLUG: Triggering address refresh on opt9 (igc0.3)
Nov 23 00:13:23 php-fpm 592 /rc.linkup: DEVD Ethernet attached event for opt9
Nov 23 00:13:23 check_reload_status 645 Reloading filter
Nov 23 00:13:23 php-fpm 43191 /rc.linkup: Hotplug event detected for VOIPPHONES(opt10) static IP address (4: 172.16.22.1)
Nov 23 00:13:23 check_reload_status 645 Reloading filter
Nov 23 00:13:23 check_reload_status 645 rc.newwanip starting igc0.16
Nov 23 00:13:23 php-fpm 591 /rc.linkup: HOTPLUG: Triggering address refresh on opt7 (igc0.16)
Nov 23 00:13:23 php-fpm 591 /rc.linkup: DEVD Ethernet attached event for opt7
Nov 23 00:13:23 php-fpm 592 /rc.linkup: Hotplug event detected for CREDITCARD(opt9) static IP address (4: 172.16.3.1)
Nov 23 00:13:23 php-fpm 39064 /rc.linkup: Hotplug event detected for SURVEILLANCE(opt8) static IP address (4: 172.16.19.1)
Nov 23 00:13:23 php-fpm 591 /rc.linkup: Hotplug event detected for PUBLICWIFI(opt7) static IP address (4: 10.16.0.1)
Nov 23 00:13:23 php-fpm 51831 /rc.linkup: Hotplug event detected for MAINLAN(lan) static IP address (4: 192.168.19.1)
Nov 23 00:13:22 kernel igc0.22: link state changed to UP
Nov 23 00:13:22 kernel igc0.3: link state changed to UP
Nov 23 00:13:22 kernel igc0.19: link state changed to UP
Nov 23 00:13:22 kernel igc0.16: link state changed to UP
Nov 23 00:13:22 kernel igc0: link state changed to UP
Nov 23 00:13:22 check_reload_status 645 Linkup starting igc0.22
Nov 23 00:13:22 check_reload_status 645 Linkup starting igc0.3
Nov 23 00:13:22 check_reload_status 645 Linkup starting igc0.19
Nov 23 00:13:22 check_reload_status 645 Linkup starting igc0.16
Nov 23 00:13:22 check_reload_status 645 Linkup starting igc0
Nov 23 00:11:53 check_reload_status 645 Reloading filter
Nov 23 00:11:52 php-fpm 43191 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1732338712] unbound[465:0] error: bind: address already in use [1732338712] unbound[465:0] fatal error: could not open ports’
Nov 23 00:11:51 check_reload_status 645 Reloading filter
Nov 23 00:11:51 check_reload_status 645 Reloading filter
Nov 23 00:11:51 php-fpm 592 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1732338711] unbound[91775:0] error: bind: address already in use [1732338711] unbound[91775:0] fatal error: could not open ports’
Nov 23 00:11:51 php-fpm 39064 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1732338711] unbound[90716:0] error: bind: address already in use [1732338711] unbound[90716:0] fatal error: could not open ports’
Nov 23 00:11:50 php-fpm 591 /rc.linkup: The command ‘/usr/local/sbin/unbound -c /var/unbound/unbound.conf’ returned exit code ‘1’, the output was ‘[1732338710] unbound[83036:0] error: bind: address already in use [1732338710] unbound[83036:0] fatal error: could not open ports’
Nov 23 00:11:24 check_reload_status 645 Reloading filter
Nov 23 00:11:24 check_reload_status 645 Reloading filter
Nov 23 00:11:21 php-fpm 58870 /rc.linkup: DEVD Ethernet detached event for lan
Nov 23 00:11:21 php-fpm 43191 /rc.linkup: DEVD Ethernet detached event for opt10
Nov 23 00:11:21 php-fpm 592 /rc.linkup: DEVD Ethernet detached event for opt9
Nov 23 00:11:21 php-fpm 39064 /rc.linkup: DEVD Ethernet detached event for opt8
Nov 23 00:11:21 php-fpm 591 /rc.linkup: DEVD Ethernet detached event for opt7
Nov 23 00:11:21 php-fpm 592 /rc.linkup: Hotplug event detected for CREDITCARD(opt9) static IP address (4: 172.16.3.1)
Nov 23 00:11:21 php-fpm 39064 /rc.linkup: Hotplug event detected for SURVEILLANCE(opt8) static IP address (4: 172.16.19.1)
Nov 23 00:11:21 php-fpm 58870 /rc.linkup: Hotplug event detected for MAINLAN(lan) static IP address (4: 192.168.19.1)
Nov 23 00:11:21 php-fpm 591 /rc.linkup: Hotplug event detected for PUBLICWIFI(opt7) static IP address (4: 10.16.0.1)
Nov 23 00:11:21 php-fpm 43191 /rc.linkup: Hotplug event detected for VOIPPHONES(opt10) static IP address (4: 172.16.22.1)
Nov 23 00:11:20 kernel igc0.22: link state changed to DOWN
Nov 23 00:11:20 kernel igc0.3: link state changed to DOWN
Nov 23 00:11:20 kernel igc0.19: link state changed to DOWN
Nov 23 00:11:20 kernel igc0.16: link state changed to DOWN
Nov 23 00:11:20 kernel igc0: link state changed to DOWN
Nov 23 00:11:20 check_reload_status 645 Linkup starting igc0.22
Nov 23 00:11:20 check_reload_status 645 Linkup starting igc0.3
Nov 23 00:11:20 check_reload_status 645 Linkup starting igc0.19
Nov 23 00:11:20 check_reload_status 645 Linkup starting igc0.16
Nov 23 00:11:20 check_reload_status 645 Linkup starting igc0
Nov 23 00:01:03 php 52448 [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
Nov 23 00:01:00 php 8507 /usr/local/sbin/acbupload.php: Skipping ACB backup for (system): pfblockerng: saving dnsbl changes.
Nov 23 00:00:09 php 52448 /usr/local/www/pfblockerng/pfblockerng.php: Beginning configuration backup to https://acb.netgate.com/save
Nov 23 00:00:09 check_reload_status 645 Syncing firewall
Nov 23 00:00:09 php 52448 /usr/local/www/pfblockerng/pfblockerng.php: Configuration Change: (system): pfBlockerNG: saving DNSBL changes
Nov 23 00:00:00 php 52448 [pfBlockerNG] Starting cron process.

First I’m sorry to hear you are still battling this issue. Just to cover some obscurity basis in your response. Can you clarify more on the part when you mention someone rebooted the router.

  • Is there an actual router in front of the firewall or are you inferring reboot router as the pfsense firewall?

Another thing that stands out is the timeline from 15 minutes approx. from your pfBlockerNG process and the firewall determining the firewall wan going down and staying down seems kind of long to link the situation together.

Can we determine when this actually happens you can actually log into the console or not. This should determine if the firewall hardware is actually hanging or possibly something else going on maybe with the ISP WAN side or router if there is a router in front of it. Just some thoughts based on your response and issue ,but think they are pretty critical because you need to determine what section of your network is actually acting up. Should help point you in a better direction of solving this issue.

The pfSense is being used as the router & firewall, getting the WAN connection directly from the fiber ONT’s Ethernet port. The ONT is bridged so the pfSense is getting a public IP.

The 15 minute time period you reference is a potentially unique situation that happened that early morning which I put into the forum in case it might add to the behavioral analysis of this issue. But, the usual process of events is when the main switch is rebooted, the pfSense losing connectivity issue occurs just shortly after the switch comes back up; I’d say within the first minute or two of the switch coming back online.

I’m rarely on site with this client, so determining if the NG6100 responds via the console port during an event like this will be something I’ll have to schedule for after business hours on a day when I’m already there, which may not occur until sometime in May or June at this point.