Monitoring Sucks. Latency Sucks More.

This post is part 1 of 3 in a series on monitoring scalability.

The Monitoring Sucks conversation has been an awesome step in the right direction for defining a common language for describing monitoring concepts and documenting the available tools.

The reasons monitoring sucks are many and varied - poor configuration, poor visualisation, poor scalability, poor data retention - there is a lot of well-founded hate for the available tools (some of which I have authored!)

I want to take a closer look into a problem I grapple with on a daily basis as part of my job: monitoring scalability.

What do I mean by “monitoring scalability”?

For a monitoring system to be considered scalable, I would expect it to execute large volumes of monitoring checks under a variety of conditions (good + bad) with a consistent throughput.

Why is monitoring scalability a problem? Are there deeper, subtler problems that underly monitoring system architectures in general?

Nagios handles 6000+ checks like a champ. I say this with a completely straight face. At Bulletproof, we have several large instances of Nagios that have been running for years with thousands of checks.

There is one caveat, and it is pretty massive - if your monitoring checks take a variable amount of time to return a result (they have high check latency), you will get reduced throughput, and thus your incident response times becomes unreliable. This leads to a lack of trust in the monitoring system which can kill you operationally if you don’t nip it in the bud.

Let’s work through some of the scalability problems by looking at a hypothetical and simplified monitoring system:

Imagine you have a very small monitoring system with 150 checks running. The type of check is irrelevant (in Nagios parlance they could be “service” or “host” checks), however each check is scheduled to be executed every 300 seconds (for the sake of argument, lets just ignore that a 300 second interval is way too long).

To simplify this hypothetical, let’s posit that all the checks are running serially in a single thread, and each check takes 1 second to execute and return a result.

At this point, you’re golden. All checks are executing in 150 seconds, well within the 300 second window.

Now double the number of checks to 300.

That’s one check executed every second. All the checks execute within the execution window, but things are getting tight, and you don’t have any spare capacity to add more checks.

Worst of all: what happens when the check response time goes up to 2 seconds? Now you can only execute 50% of your checks within the 300 second window, and your monitoring is 300 seconds “behind”.

Now you’re suffering from check latency - a world of pain filled with plenty of insidious edge cases to cut yourself on.

My favourite edge case is when a service failure occurs just after a check has executed and returned an OK result. In the above hypothetical, you would be unaware of the failure for 599 seconds. In a monitoring system suffering heavily from check latency, that period of time could be much much longer. Furthermore, the problem is amplified when you’re using soft/hard states to eliminate false-positives.

The above hypothetical is a tad contrived as pretty much all monitoring systems execute checks in parallel, but it illustrates the scalability challenges even in a simple scenario.

Executing checks in parallel certainly helps stave off this type of bottleneck, but as you increase the number of checks and the parallelism of your monitoring system, you start running into operating system limitations such as context switching, memory exhaustion (if you use a language that gobbles up memory), or simply running out of CPU time to execute all the checks.

The other enormous gotcha is that when catastrophic failures happen, it’s very common to have monitoring checks that simply timeout because various network resources between your monitoring server and the machine you’re checking are down or misbehaving.

The last thing you want in an emergency situation is delayed alerts that may hide the root cause or feed you bad information.

So how do you mitigate check latency problems to improve your monitoring scalability?

In the next post in this series, we’ll look at monitoring systems as a type of complex web application, and investigate some performance optimisation techniques you can apply.