Weird network issue with Truenas and a bridge

I’m looking for a little advice. I just converged 3 much smaller 4-disk physical NAS boxes (previously running Truenas SCALE 24.10) into a single refurbished 24-bay Supermicro SSG-6028R-E1CR24N running Truenas SCALE 25.04.0. I did this primarily for the 10Gbe SFP+ that it has built-in and to consolidate physical space, management, and rack-mounted goodness. Everything is fine and the host IS working at 10Gbe on both SFP+ ports. Other hardware in the mix is a TP-Link Omada SG3428X switch and a Netgate SG-4100 pfSense appliance handling all the VLAN and network traffic and all flashed with their respective latest release versions of firmware.

So, here’s my problem: I’m trying to change the network on the new-to-me Truenas host from 2 discreet ens5f0 & ens5f1 10GBe connections to a bridged network connection (no LAGG groups, just a br0) so I can run a single VM on the Truenas host that can communicate with the Truenas host itself. I don’t have any issue creating the virtual bridge on the host; what happens is, as soon as I tell Truenas to “test” the new configuration, my entire network goes down. By entire network, I’m talking about IoT devices on VLANs that don’t connect to Truenas at all, desktops, wifi, even some smart appliances (also on their own VLAN). The network DOES come back up after 2 or 3 minutes, once Truenas determines that the “test” phase isn’t successful and takes down the newly configured br0. I’ve reached out in the Truenas forums: Link to thread on Truenas forum, but so far everything is focused on the br0 creation itself and not any other potential causes for this weird behavior.

Other steps I’ve taken are to remove one of the SFP+ transceivers to use a single port (same results - network dies for that same 2-3 minutes), and even disconnecting the network entirely and doing the bridge creation via the BMC/IPMI console & CLI tools on Truenas, then plugging in the SFP+ transceivers to bring it up on the network; again, the network dies as soon as I plug in the first SFP+ transceiver.

My question is, what other places might I need to start looking at? Port settings on the switch? Something in pfSense? Could it even be the SFP+ modules themself? Is there something Truenas does differently with the br0 network that isn’t done on the “for lack of better terms” RAW ens5f0/ens5f1 networks directly?

Any help steering me in any direction that could help identify where/what might be happening would result in me owing you a full case of !

Thanks in advance for ANY suggestions!

Hi, that’s an interesting one.

What I’d do first is try to get more information. Plug something like a laptop with Wireshark and try to get some packet captures going, see what is going on. Is it flooding the network? Is it something else? Also Wireshark on the Truenas bridge might provide a lot of info.

Excellent suggestions! I’ll give that a try this afternoon. I’ll caveat, though, that since the entire network goes down, I’m not sure how Wireshark will react when/if the packet stream gets disconnected… we’ll see!

It definitely is a loop! I had a pause in work so I fired up a VM and did a dumpcap from it, then made the Truenas changes and clicked Test… the dumpcap terminated when the terminal session was lost, but I did manage to capture this sanitized output:

1834 74.125629164 SuperMic_0f:3d:a4 → Broadcast    ARP 60 Who has 10.X.X.X? Tell 10.X.X.X (duplicate use of 10.X.X.X detected!)
 1842 74.281123944   10.X.X.X → 224.0.0.251  MDNS 452 Standard query 0x0000 ANY 4.a.d.3.f.0.e.f.f.f.b.6.f.1.e.a.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa, "QM" question ANY truenas.local, "QM" question ANY X.X.X.10.in-addr.arpa, "QM" question ANY truenas._smb._tcp.local, "QM" question ANY truenas._http._tcp.local, "QM" question ANY truenas._device-info._tcp.local, "QM" question A 10.X.X.X PTR truenas.local SRV 0 0 445 truenas.local TXT SRV 0 0 80 truenas.local TXT SRV 0 0 9 truenas.local TXT AAAA fe80::ae1f:6bff:fe0f:3da4 PTR truenas.local
 1844 74.384679801 SuperMic_0f:3d:a4 → Broadcast    ARP 60 Who has 10.X.X.X? Tell 10.X.X.X (duplicate use of 10.X.X.X detected!)
...
...lots of duplicates of those previous lines...
...sanitized cruft out so this didn't turn into a massive mess...
...
 4254 122.988116170 Netgear_10:4d:2b → Broadcast    RLDP 60 Network Loop Detection

So, it almost looks like the ens5f0 nic isn’t releasing it’s IP address before trying to assign it to the new br0, if I had to guess. But, I’m deleting that statuc IP alias on ens5f0 before assigning it to the new br0.

SUCCESS!!! I found the problem!

The issue boiled down to (and why I did this, I haven’t a clue! So, I’m publicly face-palming myself) the switch port that the Truenas is plugged into had a custom profile with RSTP disabled. The 2nd port for ens5f1 had it enabled and was on the default switch-port profile, so if I had tried to use ens5f1 instead of ens5f0, it would have worked first try. Once I enabled RSTP on that switch port, it worked without any issue, so I’ve reverted that switch port back to the default profile and all is good.

Moral of the story: If you experience something similar to my issue, make sure Spanning-Tree protocol is enabled in your switch!

1 Like