Unsolvable network issue

I know that your thinking but it’s true. I think it best I only give over limited information and we build on it here.-, if anyone thinks they can.

Issue

Maybe once every three weeks, maybe two days in a row, (could be a good month) we will get a short disruption in internet traffic in the form of DNS what face value, but not all websites. Happens to be the most common and disruptive. OneDrive for example with a DNS error. but other not all are working. another problem site seems to be the bbc.co.uk (UK news/tv/radio). traffic to other sites is fast and works without issue. I think youtube has also been on the list. 3,5,10 minutes later it’s all over.

In the past we have bypassed the Smoothwall and that seems to have worked but it might just be the disconnection / reconnection that fixed it, problem is no time to start testing. Feels like the smoothwall but support have given it a full bill of health and the unit has been upgraded by chance, although the config was exported and imported.

I did put a PC on the network between the Smoothwall and draytek so I could test north of the smoothwall but I can’t remember the result, also its not predictable.

We have looking at this on and off for a LONG time.

Setup

  • Largish network, 10 ish vlans, mostly running /20 then a few /24, ACL in play on a fibre HP 3800 switch (gateway)
  • 1500 users with half of them having phones and or laptops +

Internet

the 3800 is the end point for all vLANs. This is then routed on a 10.0.99.2/24 via a bridged monitoring appliance called a Smoothwall, then to a Draytek router, Then Juniper ISP router then gone…

Secret

We do have a 2nd site that is routed from the 3800. they sometimes get the same issue as the same time. I wasn’t going to mention as the issue is happening downstream, but incase any spark of some DNS / routing issue back to the 2nd site, i will include here.

If someone picks up this in it’s basic form i will dig a bit deeper and get some screenshots.
I also have snmp

That is a lot of words and not a lot of details. Let’s start with what exactly do the DNS errors say and who provides your DNS service? Follow the DNS server logs, look for errors, and if it’s an upstream DNS provider then consider finding a new one. Also, please keep in mind that that forum is mostly full of people using pfSense and Unifi equipment, not Smoothwall.

How do your internal clients get their DNS servers? I’m assuming from DHCP but let me know. Also, it’d be nice to know your results north of the firewall. But If I were you, here’s how I’d test.

Grab 3 raspberry Pi’s. Put one on the production south of your firewall, one north, and one at that remote site you mentioned. Write a script that’ll run a nslookup of your problem domains and to alert you when it cannot resolve them. It’d also be helpful to have it include the DNS server it queried to make sure something isn’t hijacking traffic. This will at least let you know whether you’re looking at an issue with upstream DNS or something internal to your network and give you a better idea of where to look next.

1 Like

yes, DHCP server.
I did what you said, had it running for weeks with no hit. took it all down and then we got another blip…
I used PC’s and VM to complete the task.

I’ll dig out some of the error messages and put on here. - will also get some up to day logs from DNS.