Weird issue with Unifi 10Gb, VLAN tagging and VMWare

ad4m1 · December 10, 2020, 4:06pm

I was wondering if anyone else has had issues with specifically VLAN tagging with VMWare and Unifi Switches when using 10Gb Switches or Switch Port ??

I currently have a dialogue going on with Ubiquiti over this, since we’re deploying a new 10Gb network along with new VMware ESX 7.01 hosts.
Normally I’ve never had any issues when using Cisco and VMWare, and if I use 1Gb then I also don’t have issues with VMWare and Unifi either, however I seem to have a weird edge case when we use 10Gb between the VMWare hosts and the 10Gb (US-16-XG) Ubiquiti switches.

If I don’t do any tagging and just use the native VLAN on the port, then everything works fine. We can ping the default gateway from any VM running on the host, but as soon as I change the port profile on the port to one with tagging on it, then move the VM to the tagged network within VMWare, we notice only Layer 2, seems to be coming over onto the switch, but nothing layer 3 or above.

We can see the virtual machines MAC addresses in the MAC table on the switch, so we know at least layer 2 is sort of working, although the MAC addresses appear in both the native and in the tagged VLAN too, which I find odd, and that’s even after i’ve cleared the MAC table from the pervious setup too.

However, we can’t ping out of the VM, and yet if I drop the connection down to 1Gb Full Duplex using the same hardware, same ports on the switch etc, then it all works fine. So the config is correct.

We’ve tried different 10Gb network cards in the hosts, different SFP+ modules and even the 10Gb RJ45 SFP+ modules too with a small length of CAT6 between, and i’ve even tried using the 10Gb ports on the USW-PRO-48-POE switch and the same happens.
Also tried different config on the switch ports, such as setting the MTU to 1500, turning off VLAN ingress filter etc.

What is really odd is I have exactly the same hardware running 2 pfSense servers, and those work perfectly fine with the tagging. So we know it’s not a hardware compatibility issue between the switch and the physical server hardware.
It’s almost like VMWare is using a different 802.1q standard to what the switch is expecting and not splitting the VLAN traffic out properly.

We even tried port mirroring the 10Gb port and using wireshark to inspect the traffic, and we only see ARP. Very little TCP/IP or ICMP traffic, even when we’re trying to run a ping on the server.

Any insight that anyone may have had with this would be appreciated.
I think it may be a really obscure “edge case” type issue with the Unifi firmware, but i’m interested to see if anyone else has run into this issue with similar setups.

LTS_Tom · December 10, 2020, 7:27pm

That is not something I have witnessed before, but have you tried different firmware versions on the UniFi? Perhaps and older version?

ad4m1 · December 11, 2020, 1:57pm

Thanks Tom,
Yes we originally had the issue out of the box when starting with brand new Unifi units, so we were on 4.3.something on the switches and the latest 5.x release on the controller. Can;t rememebr the exact values. I held off upgrading to the 6.0 train on your advise, but Unifi told us to update to the latest firmware on the controller and switches, just after you announced that you were starting to roll out 6.0 anyway, so we’re now on 6.0.41 on the controller, and 5.43.18.12487 on the switches.
Since we’ve been having this issue, there’s been 2 releases of firmware for the switches which i’ve upgraded to along the way in the hope it was Ubiquiti releasing a patch for my issue.

I dare say that if we were using XPC-ng instead of vmware that we probably wouldn’t be having this issue either. I suspect this is very much a compatibility issue with vmware and unifi and i’m not sure how many people would be using those 2 products together, since most people using vmware would likely be using Cisco, which we were using but it was too expensive for 10Gb on Cisco, when Uqibquiti was just under $400 for a 16 port 10Gb switch.

Ubiquiti have gone quiet on me too now. Not heard from them for a few days so might have to gently nudge them and find out what can be done to move forward.
I’d be happy to let them have access to our environment to test on since it’s not in production yet.

Ironically we have an order on hold for another 15 switches once this issue is resolved too, so I really am hoping they get to the bottom of it.

Was hoping maybe someone might have also experienced this to put my mind at ease and that it’s not just me going mad… lol

CLEARRTC · January 27, 2021, 5:11pm

I am wondering if you ever found a solution to this problem. I am running into a VLAN tagging issue with VOIP phones and firmware 5. 43.18.12487 but on only one of my switches. I was wondering if the problem was related.

The issue I am having is on a US-48-750W over 1GB copper but that switch is a downlink from 10GB SFP+ from a US-XG-16. I Have my ESX 6.7 hosts plugged into 10GB SFP+ on the US-XG-16 switches so it may or may not be a related issue.

A few days after upgrading to 5. 43.18.12487 some of my Cisco phones started acting up, rebooting all the time. I could not figure it out, I replaced them with Yealink T23G’s on those phones then seemed to work fine for the most part, however when I take the Cisco down to my office and plug them into the switch in the IT office on the same VLAN on a US-8-150W they work fine, no reboots. If I take it back to its original location I run into the issue again.

However when I replaced a Cisco earlier today with a Yealink that worked fine in my office, it took the phone a long time to get an IP and they would behave strangely sometimes only allowing audio one way (something we see with external phones with NAT issues, but never on the local network)

After reading this comment I removed the LLDP VLAN Profile and rebooted the phone. Suddenly the phone was working correctly again albeit on the wrong VLAN/Subnet. What is really strange is I have this port profile on all switch ports with a phone connected across many switches and only this one seems to be having issues

Did your 10GB VLAN tagging problem ever get solved?

ad4m1 · January 29, 2021, 12:14am

I managed to solve the issue in the end but it was weird. Turns out the emulex cards i was using on the vmware hosts were the cause of the issue. I switched them for intel cards and the problem resolved. Oddly the same cards work fine on freebsd as im using them for both pfsense and for freenas.

That all seems to suggest that its driver related. I suspect the driver used on vmware might have issues and its not isolated to just emulex but also to broadcom as well.

My advice would be swap your card out for an intel based one as ive found you cant beat intel cards for their driver support in almost all platforms and they are pretty much a staple card in almost all systems.
I think off the top of my head the cards i used were intel 520 cards but i used the IBM embedded ones, but there is a pcie version of that same card. Retail on the pci ones is about $70.