Monitoring dependency graphs are fine for small environments, but they are not a good fit for nested complex environments, like those that make up modern web infrastructures.

DAGs are a very alluring data structure to represent monitoring relationships, but they fall down once you start using them to represent relationships at scale:

  • There is an assumption there is a direct causal link between edges of the graph. It’s very tempting to believe that you can trace failure from one edge of the graph to another. Failures in one part of a complex systems all too often have weird effects and induce failure on other components of the same system that are quite removed from one another.

  • Complex systems are almost impossible to model. With time and an endless stream of money you can sufficiently model the failure modes within complex systems in isolation, but fully understanding and predicting how complex systems interact and relate with one another is almost impossible. The only way to model this effectively is to have a closed system with very few external dependencies, which is the opposite of the situation every web operations team is in.

  • The cost of maintaining the graph is non trivial. You could employ a team of extremely skilled engineers to understand and model the relationships between each component in your infrastructure, but their work would never be done. On top of that, given the sustained growth most organisations experience, whatever you model will likely change within 12-18 months. Fundamentally it would not provide a good return on investment.

check_check

This isn’t a new problem.

Jordan Sissel wrote a great post as part of Sysadvent almost three years ago about check_check.

His approach is simple and elegant:

  • Configure checks in Nagios, but configure a contact that drops the alerts
  • Read Nagios’s state out of a file + parse it
  • Aggregate the checks by regex, and alert if a percentage is critical

It’s a godsend for people who manage large Nagios instances, but it starts falling down if you’ve got multiple independent Nagios instances (shards) that are checking the same thing.

You still end up with a situation where each of your shards alert if the shared entity they’re monitoring fails.

Flapjack

This is the concrete use case behind why we’re rebooting Flapjack - we want to stream the event data from all Nagios shards to Flapjack, and do smart things around notification.

The approach we’re looking at in Flapjack is pretty similar to check_check - set thresholds on the number of failure events we see for particular entities - but we want to take it one step further.

Entities in Flapjack can be tagged, so we automatically create “failure counters” for each of those tags.

When checks on those entities fail, we simply increment each of those failure counters. Then we can set thresholds on each of those counters (based on absolute value like > 30 entities, or percentage like > 70% of entities), and perform intelligent actions like:

  • Send a single notification to on-call with a summary of the failing tag counters
  • Rate limit alerts and provide summary alerts to customers
  • Wake up the relevant owners of the infrastructure that is failing
  • Trigger a “workaround engine” that attempts to resolve the problem in an automated way

The result of this is that on-call aren’t overloaded with alerts, we involve the people who can fix the problems sooner, and it all works across multiple event sources.

One note on complexity: I am not convinced that automated systems that try to derive meaning from relationships in a graph (or even tag counters) and present the operator with a conclusion are going to provide anything more than a best-guess abstraction of the problem. In the real world, that best guess is most likely wrong.

We need to provide better rollup capabilities that give the operator a summarised view of the current facts, and allow the operator to do their own investigation untainted by the assumptions of the programmer who wrote the inaccurate heuristic.

The benefit of Flapjack’s (and check_check’s) approach also minimises the maintainability aspect, as tagging of entities becomes the only thing required to build smarter aggregation + analysis tools. This information can easily be pulled out configuration management.

More metadata == more granularity == faster resolution times.