The How and Why of Flapjack
"I've got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?"
Flapjack will be immediately useful to you if:
- You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
- You monitor infrastructures that have multiple teams responsible for keeping them up.
- Your monitoring infrastructure is multitenant, and each customer has a bespoke alerting strategy.
- You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.
The double-edged Nagios sword (or why monolithic monitoring systems hurt you in the long run)
One short-term advantage of Nagios is how much it can do for you out of the box. Check execution, notification, downtime, acknowledgements, and escalations can all be handled by Nagios if you invest a small amount of time understanding how to configure it.
This short-term advantage can turn into a long-term disadvantage: because Nagios does so much out of the box, you heavily invest in a single tool that does everything for you. When you hit cases that fit outside the scope of what Nagios can do for you easily, the cost of migrating away from Nagios can be quite high.
The biggest killer when migrating away from Nagios is you either have to:
- Find a replacement tool that matches Nagios's feature set very closely (or at least the subset of features you're using)
- Find a collection of tools that integrate well with one another
Given the composable monitoring world we live in, the second option is more preferable, but not always possible.
Flapjack aims to be a flexible notification system that handles:
- Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc)
- Alert summarisation (with per-user, per media summary thresholds)
- Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc)
Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
A team player (composable monitoring pipelines)
Flapjack aims to be composable - you should be able to easily integrate it with your existing monitoring check execution infrastructure.
There are three immediate benefits you get from Flapjack's composability:
- You can experiment with different check execution engines without needing to reconfigure notification settings across all of them. This helps you be more responsive to customer demands and try out new tools without completely writing off your existing monitoring infrastructure.
- You can scale your Nagios horizontally. Nagios can be really performant if you don't use notifications, acknowledgements, downtime, or parenting. Nagios executes static groups of checks efficiently, so scale the machines you run Nagios on horizontally and use Flapjack to aggregate events from all your Nagios instances and send alerts.
- You can run multiple check execution engines in production. Nagios is well suited to some monitoring tasks. Sensu is well suited to others. Flapjack makes it easy for you to use both, and keep your notification settings configured in one place.
While you're getting familiar with how Flapjack and Nagios play together, you can even do a side-by-side comparison of how Flapjack and Nagios alert by configuring them both to alert at the same time.
If you work for a service provider, you almost certainly run shared infrastructure to monitor the status of the services you sell your customers.
Exposing the observed state to customers from your monitoring system can be a real challenge - most monitoring tools simply aren't built for this particular requirement.
Bulletproof spearheaded the reboot of Flapjack because multitenancy is a core requirement of Bulletproof's monitoring platform - we run a shared monitoring platform, and we have very strict requirements about segregating customers and their data from one another.
To achieve this, we keep the security model in Flapjack extraordinarily simple - if you can authenticate against Flapjack's HTTP APIs, you can perform any action.
Flapjack pushes authorization complexity to the consumer, because every organisation is going to have very particular security requirements, and Flapjacks wants to make zero assumptions about what those requirements are going to be.
If you're serious about exposing this sort of data and functionality to your customers, you will need to do some grunt work to provide it through whatever customer portals you already run. We provide a very extensive Ruby API client to help you integrate with Flapjack, and Bulletproof has been using this API client in production for over a year in our customer portal.
One shortfall of Flapjack right now is we perhaps take multitenancy a little too seriously - the Flapjack user experience for single tenant users still needs a little work.
In particular, there are some inconsistencies and behaviours in the Flapjack APIs that make sense in a multitenant context, but are pretty surprising for single tenant use cases.
One other killer feature of Flapjack that's worth mentioning: updating any setting via Flapjack's HTTP API doesn't require any sort of restart of Flapjack.
This is a significant improvement over tools like Nagios that require full restarts for simple notification changes.
Flapjack is useful for organisations who segregate responsibility for different systems across different teams, much in the same way Flapjack is useful in a multitenant context.
- Your organisation has two on-call rosters - one for customer alerts, and one for internal infrastructure alerts.
- Your organisation is product focused, with dedicated teams owning the availability of those products end-to-end.
You can feed all your events into Flapjack so operationally you have a single aggregated source of truth of monitoring state, and use the same multitenancy features to create custom alerting rules for individual teams.
We're starting to experiment with this at Bulletproof as development teams start owning the availability of products end-to-end.
Probably the most powerful Flapjack feature is alert summarisation. Alerts can be summarised on a per-media, per-contact basis.
What on earth does that mean?
Contacts (people) are associated with checks. When a check alerts, a contact can be notified on multiple media (Email, SMS, Jabber, PagerDuty).
Each media has a summarisation threshold that allows a contact to specify when alerts should be "rolled up" so the contact doesn't receive multiple alerts during incidents.
If you've used PagerDuty before, you've almost certainly experienced similar behaviour when you have multiple alerts assigned to you at a time.
Summarisation is particularly useful in multitenant environments where contacts only care about a subset of things being monitored, and don't want to be overwhelmed with alerts for each individual thing that has broken.
To generalise, large numbers of alerts either indicate a total system failure of the thing being monitored, and or false-positives in the monitoring system.
In either case, nobody wants to receive a deluge of alerts.
Mitigating the effects of monitoring false-positives are especially important when you consider how failures in the monitoring pipeline cascade into surrounding stages of the pipeline.
Monitoring alert recipients generally don't care about the extent of a monitoring system failure (how many things are failing simultaneously, as evidenced by an alert for each thing), they care that the monitoring system can't be trusted right now (at least until the underlying problem is fixed).
What Flapjack is not
- Check execution engine. Sensu, Nagios, and cron already do a fantastic job of this. You still need to configure a tool to run your monitoring checks - Flapjack just processes events generated elsewhere and does notification magic.
- PagerDuty replacement. Flapjack and PagerDuty complement one another. PagerDuty has excellent on-call scheduling and escalation support, which is something that Flapjack doesn't try to go near. Flapjack can trigger alerts in PagerDuty.
At Bulletproof we use Flapjack to process events from Nagios, and work out if our on-call or customers should be notified about state changes. Our customers receive alerts directly from Flapjack, and our on-call receive alerts from PagerDuty, via Flapjack's PagerDuty gateway.
The Flapjack PagerDuty gateway has a neat feature: it polls the PagerDuty API for alerts it knows are unacknowledged, and will update Flapjack's state if it detects alerts have been acknowledged in PagerDuty.
This is super useful for eliminating the double handling of alerts, where an on-call engineer acknowledges an alert in PagerDuty, and then has to go and acknowledge the alert in Nagios.
In the Flapjack world, the on-call engineer acknowledges the alert in PagerDuty, Flapjack notices the acknowledgement in PagerDuty, and Flapjack updates its own state.
How do I get started?
Follow the quickstart guide to get Flapjack running locally using Vagrant.
The quickstart guide will take you through basic Flapjack configuration, pushing events check results from Nagios into Flapjack, and configuring contacts and entities.
Examining the Puppet module will give you a good starting point for rolling out Flapjack into your monitoring environment.
Where to next?
We're gearing up to release Flapjack 1.0.