Rebooting Flapjack

This is the first time I’ve actually blogged about Flapjack.

The past

In 2008 I started talking with Matt Moor about building a “next generation monitoring system” that would be simple to setup & operate, and provide obvious paths to scale.

In 2009 I started hacking on Flapjack while backpacking, and by mid 2009 I had a working prototype running basic monitoring checks.

The fundamental idea was simple: decouple the check execution from the alerting and notification, and use message queues to distribute the check execution across lots of machines.

It seems simple and obvious now, but at the time nobody was really talking about doing this, so Flapjack gathered a reasonable amount of attention relatively quickly after I started talking about it at conferences.

2010 rolled around and I was unable to maintain a good development pace and hold that attention gained by talking at conferences due to some fairly significant life changes. Pretty much all of my open source projects suffered, and in the space of 12 months:

There were plenty of other interesting projects like Sensu that were achieving similar goals excellently, so while winding up Flapjack was a source of bitter personal disappointment, it was offset by seeing other people doing awesome work in the monitoring space.

The present

Mid last year, an interesting problem arose at work:

In a modern “monitoring system”, how do you:

  • Notify a dynamic group of people on a variety of media based on monitoring events? Bulletproof has thousands of people that may need to be notified by our monitoring system, depending on what monitoring checks are failing. While the thresholds on each monitoring check are universal, each of these people can have different notification settings based on time of day or week, the type of service affected, or the severity of the failure.

  • Dampen or roll up common events so on-call isn’t bombarded during outages? When one system deep in the stack fails, it has significant flow-on effects to everything else that depends on it. This generally manifests as thousands (or tens of thousands, in extremely bad cases) of alerts being sent to on-call in a very short period of time (<60 seconds). Obviously this is bad, and we simply want to detect cases like these, and wake up people involved in the incident response process.

  • Do the above in an API driven way? We need to solve both problems in a way that works in a multitenant environment with strong segregation between customers, and integrates with an existing monitoring & customer self-service stack.

Thus, Flapjack was rebooted with a significantly altered focus:

  • Event processing
  • Correlation & rollup
  • API driven configuration

We’ve been actively working on the reboot since July last year, and have been sending alerts from Flapjack to customers since January.

We’re developing Flapjack as a fully Open Source composable platform on which you can adapt and build to your organisation’s needs by hooking it into your existing check execution infrastructure (we ship a Nagios event processor), and self service and provisioning automation tools.

Because we care deeply about people integrating Flapjack into their existing environments, we have invested a lot of time and energy into writing quality documentation that covers working with the API, debugging production issues, and the data structures used behind the scenes. That’s all on top of the usage documentation, of course.

Flapjack is built on Redis, and funnily enough R.I. Pienaar did a post earlier this year that investigates using Redis to solve the same problem in an extremely similar way. R.I.’s post provides a good primer on some of the thinking behind Flapjack, so I recommend giving it a read.

The future

Fundamentally, Flapjack is trying to plug a notification hole in the monitoring ecosystem that I don’t believe is being adequately addressed by other tools, but the key to doing this is to play nicely with other tools and build a composable pipeline.

The above is merely a glimpse of Flapjack that leaves quite a few questions unanswered (e.g. “Why aren’t you using $x feature of $y check execution engine to do roll-up?”, “Do Flapjack and Riemann play nicely with one another?”), so stay tuned for more:

more waffles