Hi All,
I am new to this forum and I work for a small MSP in Sweden.
We are migrating to Zabbix from another tool and it seems to progress as planned.
However, there is one issue that I would like to get some insight to from other Zabbix users, and that is how you manage the system when you have a lot of hosts?
We are in the beginning and we have so far 3 customers and about 150 hosts in the system at the moment.
I think we will end up at about 1500 hosts when we are done.
Already now the host list is hard to read and I haven’t found any way of creating some form of hierarchy.
Option 1 is Host Groups. A host can be in multiple groups, so for example I have Windows, Linux, OtherOS, then one group per customer, then for some things I might also make one group per location (datacenter, cloud, main office, etc)
So Host #1 is in the groups Linux, CustomerA, and CustomerA-Datacenter, and Host #100 is in the groups Windows, CustomerC, and AWS.
Host groups can be used to filter the hosts and alerts lists, and can be used for user access permissions too (CustomerC-readonly can see the CustomerC Host Group and nothing else, which also means if you enable alerts for that user/group they’ll get their hosts and nothing else)
Option 2 is tagging, which is much more complicated but has the potential to be much more powerful.
I use both Host Groups and Tags to to try to keep the system in order but the problem is that I can’t see the order I create.
In this picture you can see what it looks like in our old system. An “Agent” is similar to a Zabbix Proxy.
With this hierarchy it becomes very easy to keep the customers (and our own) equipment apart in the same system and it is easy to keep it tidy.
In this old system there are also the possibility to use a host list view, similar to the host list in Zabbix but i don’t think I ever used it.
I guess I might be old school and I like hierarchy because it makes it easy for me to se were everything goes.
Zabbix can handle plenty of CI to monitor and their items. You can have HA configuration (2+ Zabbix servers) that can guarantee that no collection data is missed (agents can store locally /buffer/ the events and collected data for certain amount of time. It’s up to you to configure it (out of the box the amount of data is sufficient if you let’s say - reboot the server or there is short interruption in the communication to the Zabbix server(s).
On top of that, you can have proxy server(s) installed at the customer environment and use active connection to your Zabbix server(s). With 7.0 version (I am referring to LTS versions only) you can have Proxy groups and load balance the traffic as well. Using the active connection from the clients side, you don’t even have to open FW ports or NAT. Note: don’t forget to enable encryption between the Zabbix proxies and the Zabbix server(s) if you don’t have Site-to-Site VPN connection.
Having weird device/service to monitor - if you can communicate with it, the Zabbix can monitor it. Just create a template (if not available in the Hub repozitory or someone already have it somewhere published) with items that you want to monitor.
Oh, one more thing - if you have 1500 hosts/devices to monitor, make sure you have plenty of space to store history data (if you keep it for a year), since that table will grow big (there is a housekeeper process to make sure old records are removed from the DB).
Make sure that you use group parameter. From there you can filter devices based on that group (apply maintenance period on the entire group or just list them). Using TAG is also helpful when navigating.
Pretty much that’s it. It’s not rocker science