ruminations : network monitoring (data collection part one)

network monitoring is, in theory, a relatively simple thing : watch X for Y conditions, and tell Z what you find. most systems break this down into collection, analysis, and alerting. today i’m looking at data collection for monitoring systems.

SNMP is the most common collection system in use thanks to how easy it is to setup and use. infrastructure equipment often can support a ton of SNMP traffic and requests with minimal performance impact. SNMP is even flexible and extensible enough that it can be installed and configured on systems that don’t natively have it, notably servers, and still provide almost any info you can want.

SNMP data is usually collected by polling the target device at regular intervals, hopefully matching your collection needs. SNMP also has the trap mechanic, which can make critical data collection even faster.

the down side to SNMP is that you can’t always get what you want in the way you want it. the majority of the time you can probably get your data, but its the unusual condition that always ends up being important. to deal with this, most systems have a plugin architecture that allows you, or more commonly a vendor, to handle those special cases.

regardless of which collection method is being used (TL1, SNMP, port scans, telnet scraping, etc), there are some things that need to be kept in mind :

1. polling interval. does the data need to be collected every second, every 5 minutes, once a day?
critical data points, such as interface state, probably need to be polled as often as practical in order to give whomever is watching that data the information they need to act on a failure state.
data that isn’t critical, such as disk usage and bandwidth stats, is usually graphed in some fashion and should be polled at whatever rate is equal to the smallest logical display interval. so if your graphs show bandwidth in 5 minute chunks, then you don’t need to poll every minute. in fact, some routers and switches will give you a 5 minute SNMP OID for that kind of thing, which is super helpful.
always keep in mind that your need to make sure that the system collecting the data can do so in a timely manner and doesn’t overload itself with so much work that it can’t reliably get the data needed in the defined time frame.
the collection target should also not be polled so much that the polling interferes with the primary task of the device.

2. storing the data. once your system has grabbed the data needed, it needs to be kept in such a way that it can be retrieved and analyzed. many systems do the collection and analysis as part of the same action, which means that there isn’t a way to manually review the data. if this isn’t something you’ll ever need to do, then you’re putting a lot of faith in your monitoring system. if nothing else, you need to be able to see the collection trends to make sure you’re at least getting good data at good intervals.

3. can you see behind the curtain? many vendors provide some basic plugins for various vendors and models, which is great, but you need to be able to know what those plugins are doing in order to better plan how to take full advantage of them. if a plugin is collecting more data than you actually need, then you’re wasting cycles and bandwidth for at least 2 systems.

ruminations : network monitoring (data collection part one)

Leave a Reply