Rethinking monitoring post-Monitorama PDX

The two key take home messages from Monitorama PDX are this:

  • We are mistakenly developing monitoring tools for ops people, not the developers who need them most.
  • Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.

Death to strip charts

Two years ago when I received my hard copy of William S. Cleveland’s The Elements of Graphing Data, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45˚, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.

In his talk at Monitorama PDX, Neil Gunther goes on a whirlwind tour of visualising data used by ops daily with visual tools other than time series strip charts. By ignoring time, looking at the distribution, and applying various transformations to the axes (linear-log, log-log, log-linear), Neil demonstrates how you can expose patterns in data (like power law distributions) that were simply invisible in the traditional linear time series form.

Neil’s talk explains why Cleveland’s Elements gives so little time to time series strip charts - they are a limited tool that obfuscates data that doesn’t match all but a very limited set of patterns.

Strip charts are the PHP Hammer of monitoring.

the infamous php hammer

We have been conditioned to accept strip charts as the One True Way to visualise time series data, and it is fucking us over without us even realising it. Time series strip charts are the single biggest engineering problem holding monitoring as a craft back.

It’s time to shape our future by building new tools and extending existing ones to visualise data in different ways.

This requires improving the statistical and visual literacy of tool developers (who are providing the generalised tools to visualise the data), and the people who are using the graphs to solve problems.

There is another problem here, which Rashid Khan touched on during his time on stage: many people are using logstash & Kibana directly and avoid numerical metric summaries of log data because that numerical data is just an abstraction of an abstraction.

The textual logs provide far more insight into what’s happening than numbers:

Stacktrace or GTFO

As an ops team, you have one job: provide a platform app developers can wire up logs, checks, and metrics to (in that order). Expose that to them in a meaningful way for analysis later on.

The real target audience for monitoring (or, How You Can Make Money In The Monitoring Space)

Adrian Cockcroft made a great point in his keynote: we are building monitoring tools for ops people, not the developers who need them most. This is a piercing insight that fundamentally reframes the problem domain for people building monitoring tools.

Building monitoring tools and clean integration points for developers is the most important thing we can do if we want to actually improve the quality of people’s lives on a day to day basis.

Help your developers ship a Sensu config & checks as part of their app. You can even leverage existing testing frameworks they are already familiar with.

This puts the power & responsibility of monitoring applications into the hands of people who are closest to the app. Ops still provide value: delivering a scalable monitoring platform, and working with developers to instrument & check their apps. You are reducing duplication of effort and have time to educate non-ops people on how to get the best insight into what’s happening.

There is still a room for monitoring tools as we’ve traditionally used them, but that’s mostly limited to providing insight into the platforms & environments that ops are providing to developers to run their applications.

The majority of application developers don’t care about the internal functioning of the platform though, and they almost certainly don’t want to be alerted about problems within the platform, other than “the platform has problems, we’re working on fixing them”.

The money in the monitoring industry is in building monitoring tools to eliminate the friction for developers get better insight into how their applications are performing and behaving in the real world. New Relic is living proof of this, but the market is far larger than what New Relic is currently catering to, and it’s a far larger market than the ops tools market because developers are much more willing to adopt new tools, experiment, and tinker.

If you can provide a method for developers to expose application state in a meaningful way while lowering the barrier of entry, they will jump at it.

So are you building monitoring tools for the future?