Flapjack, heartbeating, and one off events
Flapjack assumes a constant stream of events from upstream event producers, and this is fundamental to Flapjack’s design.
Flapjack asks a fundamentally different question to other notification and alerting systems: “How long has a check been failing for?”. Flapjack cares about the elapsed time, not the number of observed failures.
Alerting systems that depend on counting the number of observed failures to decide whether to send an alert suffer problems when the observation interval is variable.
Take this scenario with a LAMP stack running in a large cluster:
- Nagios detects a single failure in the database layer. It increments the soft state by 1.
- Nagios detects every service that depends on the database layer is now failing due to timeouts. It increments the soft state by 1 for each of these services.
- The timeouts for each of these services cause the next recheck of the original database layer check to be delayed (e.g. after an additional 3 minutes). When it is eventually checked, its soft state is incremented.
- The timeouts for the other services get bigger, causing the database layer check to be delayed further.
- Eventually the original database layer check enters a hard state and alerts.
The above example is a little exaggerated, however the problem with using observed failure counts as a basis for alerting are obvious.
Control theory gives us a lot of practical tools for modelling scenarios like these, and the answer is never pretty - if you rely on the number of times you’ve observed a failure to determine if you need to send an alert, you’re alerting effectiveness is limited by any latency in your checkers.
By looking at how long something has been failing for, Flapjack limits the effects of latency in the observation interval, and provides alerts to humans about problems faster.
This leads to an interesting question though - can I send a one-off event to Flapjack?
Technically you can - Flapjack just won’t notify anyone until:
- Two events (or more) have been received by Flapjack.
- 30 seconds have elapsed between the first event received by Flapjack and the latest.
This is due to the aforementioned heartbeating behaviour that is baked into Flapjack’s design.
As more people are using Flapjack we are seeing increasing demand for one-off event submission. There are two key cases:
- Arbitrary event submission via HTTP
- Routing CloudWatch alarms via Flapjack
One way to solve this would be to build a bridge that accepts one-off events, and periodically dispatches a cached value for these events to Flapjack.
Flapjack will definitely close this gap in the future.