Telemetry/NMS what do y'all like

Afternoon All;

So, network telemetry, ie data from your network infrastructure recording health and consumptive metrics. Things like interface utilization, interface errors, interface drops, link health (like fiber tx and rx light levels). Also application layer tests measuring basic application latency.

In my experience I’ve located two tools which I’ve found invaluable, both FOSS. Smokeping for Application layer monitoring and Observium for NMS.

Smokeping. What a great tool You can setup a TON of various tests but I’ve found DNS, HTTPS and PING tests to large cloud providers to be a great Canary in teh Coal mine as to the health of my network (at least from my SP system to WAN to cloud) Just recently my home was behaving poorly. While things worked it was very clunky. Zoom sessions were taking longer than usual to start. Initial webpage loads were also horrible/all over the place. Smokeping tests showed crazy latency and dropped packets. Gateway reboot fixed it. But 12 days later its back and I had the data to show it. Now ATT can review these graphs and data in my trouble ticket to replace my fiber gateway.

Observium. Less for home more for work. In one environment, they had zero NMS. No visibility at all. Stood this up and spent the next 4 weeks RMA TONS of power supplies (cisco catalyst) which had failed but no one knew. Also found more than a few transceivers showing low light levels and interface discards/drops. Was a mix of bad transceivers, fibers in need of cleaning, VLAN pruning and lastly some MTU mismatches. In one month had that whole company network hitting on all 8 cylinders.

Also these are free tools for anyone :slight_smile:

Those are my fav’s but what about y’all, what software have I not played with that I should
Smokeping. See where the Gateway reboot occurs?

Observium looking at a Juniper ex-2200.

I use zabbix. It ticks all the boxes for metrics and statuses for services.

Nice :slight_smile: Will have to take a look. Would you have any anecdotes as to how it has helped you. If they mirror stories above let me know. Part of this is not just software but curiosity as to what segment of IT finds this info valuable. As a network practitioner I find this telemetry akin to a spare tire. Most of the time in the background. But when issues arise, the most important thing in teh world, only to be put back in teh trunk once issues are corrected.

  1. It monitors VM’s metrics. (Hard drive space, network bandwidth utilization, CPU utilization, memory utilization, services CPU and memory utilization)

  2. Supports SMNP so I can get metrics from any device that supports that.

  3. It’s got alerting for many different platforms like slack, MS teams, email and so on.

  4. Can monitor HTTP/HTTPS sites to make sure they are up and test latency.

  5. Can query a DB and can do alerting on values in a table.

This helps so much when something goes wrong I can go back and drill down into all the resources at the specified time and trying to figure out what happened. It’s really good when you have higher up management wanting answers and you have data to fall back on.

Also help when writing up retrospectives on downtimes.