PfSense issues when circuit is bouncing

Hello, so now that everyone is home from school and work from home Comcast is having some real capacity issue in my area. With that being said it takes down my WAN link when that happens DNS traffic seems to bind up. I reset unbound and DHCP services but I cannot get traffic to work over FQDN/URL until I give it a full reboot.

I have also cleared ARP, reset states

DNS logs just show they cannot resolve some of the urls for my filters. The PCAP shows DNS request on the LAN side hitting the GW IP. The WAN side does not show any outbound attempts for DNS. Run it for about 2 mins full mode and tried to ping Cnn.com . Looks like the service is hanging, when a bounce happens.

I can ping via IP but DNS seems messed up, I have attempted to override DNS IP’s also etc

Does anyone know of a less invasive way to get traffic back up?

Some traffic recovers like my comcast app on my smart TV…

Try changing the monitor IP under System -> Routing -> Gateways to something such as 1.1.1.1 and check the gateway logs for the drops to see if that is where the issues is. Also, you can reset the state tables under Diagnostics -> States -> Reset States which clears out states that may be stale due to dropped connections. They do time out on their own eventually.

I updated the monitor IP to 1.1.1.1 and I have attempted in the hung state to reset the states, restart unbound and restart dhcpd services.

I was able to get it to recover once without requiring a reboot but not sure what I did to do it.

It bounced again, my SIP is working, my comcast smart tv app is working. IP Cams offline, and all DNS/FQDN traffic.

Apr 6 09:16:04 dpinger WAN_DHCP 1.1.1.1: Clear latency 10437us stddev 5705us loss 6%
Apr 6 09:08:22 dpinger OPENVPN_VPNV4 1.XXXXXXXX7: Alarm latency 158256us stddev 3533us loss 21%
Apr 6 09:08:22 dpinger WAN_DHCP 1.1.1.1: Alarm latency 9484us stddev 2038us loss 21%
Apr 6 07:46:13 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr XXXXXX.7 bind_addr 10.35.10.6 identifier "OPENVPN_VPNV4 "
Apr 6 07:46:13 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 1.1.1.1 bind_addr XXXXXX identifier "WAN_DHCP "
Apr 6 07:40:40 dpinger OPENVPN_VPNV4 10.35.10.5: Alarm latency 0us stddev 0us loss 100%

RESOLVE: Cannot resolve host address: us-east.privateinternetaccess.com:1197 (hostname nor servname provided, or not known)

Reset states.

ping 8.8.8.8

Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=28ms TTL=53
Reply from 8.8.8.8: bytes=32 time=24ms TTL=53

Ping statistics for 8.8.8.8:
Packets: Sent = 2, Received = 2, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 24ms, Maximum = 28ms, Average = 26ms
ping cnn.com
Ping request could not find host cnn.com. Please check the name and try again.

ping 151.101.129.67

Pinging 151.101.129.67 with 32 bytes of data:
Reply from 151.101.129.67: bytes=32 time=27ms TTL=57
Reply from 151.101.129.67: bytes=32 time=25ms TTL=57
Reply from 151.101.129.67: bytes=32 time=30ms TTL=57

Ping statistics for 151.101.129.67:
Packets: Sent = 3, Received = 3, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 25ms, Maximum = 30ms, Average = 27ms

Still seems to be a DNS issue from the Firewall.

I have similar pfSense problem, after rebooting my fibre modem (WAN). It happens every time. It might be rash to assume Comcast dropping your WAN would trigger the same error… but it sounds too close to ignore.

After rebooting my fibre modem, pfSense appears to truck on for a while, but will always fail in within an hour (or two). Log entries explode with “failed” messages. It seems there is a cached resource that expires and fails to re-establish.

Forums are flooded with this exact report, so its very real.

One comment: I had a vague suspicion is started about the time I added OpenVPN. Can’t say for sure… just one of those niggles… when you try to remember “what has changed since it worked”.

PS. I’m running SG-3100; always kept current.

I am not sure what step “fixed” it but all the stuff I did here did improve the stability a lot. Its not perfect but I have not had any real issues. I am going to at some point lab this out with some VMs and see what the real fix it. Hope my thread with over at Netgate helps my man.