Weird UniFi problem


we deployed 80 UAP-HD here on site and facing a really weird issue. We contacted the UniFi support which seems to be uninterested at best in helping. So I’m reaching out here, since I hope - through the widespread use of UniFi products in your customer base - you might be able to give me a pointer where to look.

We had a professional site survey done which resulted in placement and channel/power selection of 80 UAP HD UniFi AP. We run successfully a scaled up version of the controller so it is not freaking out over the number of APs. We configured it with 802.1x and WPA Enterprise and everthing worked out so far until we started seeing weird issues with disconnecting clients which are as follows:

  • out of the blue all clients on a specific AP (not always the same one) suddenly loose IP connection
  • it affects all clients across all SSID at once but only on the AP which shows the problem just then
  • you can see about 10 pings dropping before communication suddenly resumes
  • this happens infrequently and if it happens it happens a few times within about 15 minutes
  • 99% of the clients are connected to 5GHz

What we already tried (with no effect whatsoever):

  • installing a complete new controller with bare minimum configuration and putting four of the APs on that
  • tweaking min RSSI and radio parameters (different channels, DFS, non DFS, etc)
  • analyzing every log available - nothing is ever logged. The clients also dont seem to drop the wifi connection - they just dont receive any data anymore
  • tcpdumps on athX, eth0 and using a wifi sniffer
  • disabling affected APs - but didn’t do anything. The problem just appeared somethere else
  • switching clients to 2.4GHz
  • another partial site survey which confirmed that we have actually a good wifi signal strength (> -60) and good SNR

If anyone has any idea or any hint in which direction I would be very happy to try. I’m totally running out of ideas.

That is very odd, have you disabled broadcast?

Also, what does the dump show? Are there retries? Connection Drops?

Yes we engaged broadcast storm control. The dump I have shows interestingly that the ping has been sent our form the client to the AP which received it on the ath1 interface AND sent out the reply which never reached the client nor the sniffer.

There are retries and tx errors - no to few drops

You didn’t describe the PoE / Network hookup. 80 units across what kind of network gear?

Only because I had a similar experience in the past…APs would be “up” and some clients would disconnect and the AP would randomly “heal” and then another AP might drop clients. I found it was client load on the AP and the wattage load on the switch. I was exceeding the internal Wattage capacity across ports…I failed to RTFM…and went by the advertised wattage. In the end the APs would all work with low client load, but once more clients connected…it always tended to be the 6th port…but it was random at times…clients would just drop connections…but I could ping the AP from me end.

Quick test is if you got the PoE injectors with the units try putting 1 or 2 on the bricks and see if the behavior changes. If you have good switches that show PoE draw then you might not even need to look at this being the issue. I had Netgear’s and LinkSys without might info “port” wise.

1 Like

Its all hooked up to enterprise level cisco gear. PoE is not an issue. we already checked.

Well I don’t think it is a controller issue( I know you tried that already). I have UniFis that I setup with my laptop and they never see it again until I need to change them…they keep working with no controller communication.

The fact that it moves around leads me to believe it something not just AP/WiFi but something else, but we have to start somewhere :slight_smile:

I know these will seem very basic…and not sure how well it will troubleshoot since you don’t have a consistent AP issue, but one that migrates itself to various AP at various times.

  • Put some APs on a crappy layer 2 switch in front of the Cisco stuff (if you have one)
  • Can you try a group without WPA Enterprise and if so does the issue go away for those APs?
  • Try an old or newer firmware on a couple APs
  • Do you have any other APs? could you stand up an old AP where the issue is most consistent and see if appears
  • If the issue seems to move within a single switch…can you swap switches…and see if the issue moves with it
    *Firmware / hardware issues with the wireless NICs and the APs not sure how homogenized your device platform is and it shouldn’t move but be everywhere.

Other then that I don’t know what else to suggest as I don’t “know” your environment like you do, and I am not that heavy a user of UniFi. Since you had a site survey and adjusted the radios my common answers for that of channel interference, overlap, bleed, etc are hopefully all handled already. 802.11r maybe?

You obviously manage something large with 80 AP, so not trying to insult you with stupid/dumb ideas, just trying to offer things I have tried that have helped…and ideas that might spark a clue to something.

Did you use 23 gauge cable or 24 gauge? I only ask because when I first started doing PoE cameras an Axis camera guy told me that the increased resistance of the 24 gauge cable - although it should work - had the long term effect of burning up some cameras, especially on long runs.

I know you have enterprise Cisco switches and in theory power shouldn’t be an issue, but for troubleshooting, you might want to try an injector near the AP.

Like others have said, your setup sounds solid, so it’s going to be something odd. I’d lean towards trouble shooting issues power related though since the controller isn’t needed all the time unless you’re doing captive portal or something.

Thanks for your suggestions. We cannot put a dumb switch in front because we extensivly use VLANs and Radius. We have an open guest network which also shows the same behaviour. As mentioned if a AP decides to go AWOL it hits all SSIDs on that particular AP. We have also old APs by Aerohive which never showed any issue like that over years and years. The issue really wanders but seems to be hitting APs with more (15 lol) users on them. The general load of the network is pretty low (about 50Mbps over all 80 APs). It also affects different switches. We have a test setup running another firmware - same issues showing. 802.11r has never been enabled.

I put a PoE injector right next to the AP now. Lets see if that changes anything. The switch is not terribly far away from that particular AP, but the cabling might be not the best.

Also running one of the old Aerohives in spectrum analyzer mode to see if there is an anomaly on the band while the outages. Sadly I was not able to witness a problem yet while the analyzer is running.

So the SA graphed while two incidents happened. The screenshot shows two disturbances on channel 36/40 which correlate directly to the packet drops.

Btw. thats also the AP with the PoE injector attached to it, so I think we can rule power problems out.

Anyone any idea?

Well I think you have narrowed it way down and the WiFi analysis is kinda key (and sorry I gave you so many hardware ideas now). That roaming thing really made it seem hardware…as interference tends to be location specific, time specific, or device specific.

Did you client load and test the Aerohive? You said they used to work fine, if they also have the issue, then perhaps you have something new pushing out interference affecting everything. If the A.H. clients are unaffected then I would see if you can find a/some loaner different model UniFi and see how that performs.

That will just give you a clue of UniFi HD issue, UniFi line issue, or vendor agnostic issue that you have to handle in your building(s).

Like perhaps you have a new clock, PA, or alarm system that is wireless and that interference if from that. Maybe its spectrum/channels or broadcast times can be adjusted if it something like that.

There could be a gamut of devices causing that interference so stick with the simplest task since you had a working A.H. environment…see if you can test another UniFi model. If it is more resilient then the HD…you have an RMA/Exchange battle. If not then you are kinda stuck finding the interference.

Again, just my “brain” giving you something to go off of. I am admitting right now, spectrum analysis is out of my wheelhouse. I am just giving you my best educated guess ideas. Maybe someone will find this thread and be like Shane is a moron, here is the 2 lines of code to SSH and fix the issue :slight_smile:

Yeah - I kind of am looking more into a traffic issue now. We configured a span port (port mirror) and look at the AP traffic and there are huge spike (>12000 packets/s) in the wireshark graph. All of them are ACK packets… Something is weird.

Well good luck. I will keep checking here, but I think you are beyond my scope until I get through my Cisco books :wink: . Why i went with Meraki and Ubiquity I can do the basics visually…but now days I have issues myself even Meraki doesn’t log…so I am studying.

I don’t think this is your issue, but it reminds me of when I had a user with a laptop that had a wired and wireless connection active. Everyday when he came in to the office and fired up his laptop it would cause a segment of the network to go down. Disabled proxy ARP and problem went away. That was tough to trouble shoot.