If alarms are more often false than true, a culture emerges on the unit in that staff may delay response to alarms, especially when staff are engaged in other patient care activities, and more important critical alarms may be missed.
One of the most difficult challenges we face in the operations field right now is “alert fatigue”. Alert fatigue is a term the tech industry has borrowed from a similar term used in the medical industry, “alarm fatigue” - a phenomenon of people being so desensitised to the alarm noise from monitors that they fail to notice or react in time.
In an on-call scenario, I posit two main factors contribute to alert fatigue:
- The accuracy of the alert.
- The volume of alerts received by the operator.
Alert fatigue can manifest itself in many ways:
- Operators delaying a response to an alert they’ve seen before because “it’ll clear itself”.
- Impaired reasoning and creeping bias, due to physical or mental fatigue.
- Poor decision making during incidents, due to an overload of alerts.
Earlier this year a story popped up about a Boston hospital that silenced alarms to improve the standard of care. It sounded counter-intuitive, but in the context of the alert fatigue problems we’re facing, I wanted to get a better understanding of what they actually did, and how we could potentially apply it to our domain.
When rolling out new cardiac telemetry monitoring equipment in 2008 to all adult inpatient clinical units at Boston Medical Center (BMC), a Telemetry Task Force (TTF) was convened to develop standards for patient monitoring. The TTF was a multidisciplinary team drawing people from senior management, cardiologists, physicians, nursing practitioners and directors, clinical instructors, and a quality and patient safety specialist.
BMC’s cardiac telemetry monitoring equipment provide configurable limit alarms (we know this as “thresholding”), with alarms for four levels: message, advisory, warning, crisis. These alarms can either be visual or auditory.
As part of the rollout, TTF members observed nursing staff responding to alarms from equipment configured with factory default settings. The TTF members observed that alarms were frequently ignored by nursing staff, but for a good reason - the alarms would self-reset and stop firing.
To frame this behaviour from an operations perspective, this is like a Nagios check passing a threshold for a
CRITICAL alert to fire, the on-call team member receiving the alert, sitting on it for a few minutes, and the alert recovering all by itself.
When the nursing staff were questioned about this behaviour, they reported that more often than not the alarms self-reset, and answering every alarm pulled them away from looking after patients.
Fast forward 3 years, and in 2011 BMC started an Alarm Management Quality Improvement Project that experimented with multiple approaches to reducing alert fatigue:
- Widen the acceptable thresholds for patient vitals so alarms would fire less often.
- Eliminate all levels of alarms except “message” and “crisis”. Crisis alarms would emit an audible alert, while message history would build up on the unit’s screen for the next nurse to review.
- Alarms that had the ability to self-reset (recover on their own) were disabled.
- If false positives were detected, nursing staff were required to tune the alarms as they occurred.
The approaches were applied over the course of 6 weeks, with buy-in from all levels of staff, most importantly with nursing staff who were responding to the alarms.
Results from the study were clear:
- The number of total audible alarms decreased by 89%. This should come as no surprise, given the alarms were tuned to not fire as often.
- The number of code blues decreased by 50%. This indicates that the reduction of work from the elimination of constant alarms freed up nurses to provide more proactive care, and that lower priority alarms for precursor problems for code blues are more likely to be responded to.
- The number of Rapid Response Team activations on the unit stayed constant. It’s reasonable to assert that the operational effectiveness of the unit was maintained even though alarms fired less often.
- Anonymous surveys of nurses on the unit showed an increase in satisfaction with the level of noise on the unit, with night staff reporting they “kept going back to the central station to reassure themselves that the central station was working”. One anonymous comment stated “I feel so much less drained going home at the end of my shift”.
At the conclusion of the study, the nursing staff requested that the previous alarming defaults were not restored.
The approach outlined in the study is pretty simple: change the default alarm thresholds so they don’t fire unless action must be taken, and give the operator the power to tune the alarms if the alarm is inaccurate.
Alerts should exist in two states: nothing is wrong, and the world is on fire.
But the elimination of alarms that have the ability to recover is a really surprising solution. Can we apply that to monitoring in an operations domain?
Two obvious methods to make this happen:
- Remove checks that have the ability to self-recover.
- Redesign checks so they can’t self-recover.
For redesigning checks, I’ve yet to encounter a check designed to not recover when thresholds are no longer exceeded. That would be a very surprising alerting behaviour to stumble upon in the wild, that most operators, myself included, would likely attribute to a bug in the check. Socially, a check redesign like that would break many fundamental assumptions operators have about their tools.
From a technical perspective, a non-recovering check would require the check having some sort of memory about its previous states and acknowledgements, or at least have the alerting mechanism do this. This approach is totally possible in the realm of more modern tools, but is not in any way commonplace.
Regardless of the problems above, I believe adopting this approach in an operations domain would be achievable and I would love to see data and stories from teams who try it.
As for removing checks, that’s actually pretty sane! The typical CPU/memory/disk utilisation alerts engineers receive can be handy diagnostics during outages, but in almost all modern environments they are terrible indicators for anomalous behaviour, let alone something you want to wake someone up about. If my site can take orders, why should I be woken up about a core being pegged on a server I’ve never heard of?
Looking deeper though, the point of removing alarms that self-recover is to eliminate the background noise of alarms that are ignorable. This ensures each and every alarm that fires actually requires action, is investigated, acted upon, or is tuned.
This is only possible if the volume of alerts is low enough, or there are enough people to distribute the load of responding to alerts. Ops teams that meet both of these criteria do exist, but they’re in the minority.
Another consideration is that checks for operations teams are cheap, but physical equipment for nurses is not. I can go and provision a couple of thousand new monitoring checks in a few minutes and have them alert me on my phone, and do all that without even leaving my couch. There’s capacity constraints on the telemetry monitoring in hospitals - budgets limit the number of potential alarms that can be deployed and thus fire, and a person physically needs to move and act on a check to silence it.
Also consider that hospitals are dealing with pets, not cattle. Each patient is a genuine snowflake, and the monitoring equipment has to be tuned for size, weight, health. We are extremely lucky in that most modern infrastructure is built from standard, similarly sized components. The approach outlined in this study may be more applicable to organisations who are still looking after pets.
There are constraints and variations in physical systems like hospitals that simply don’t apply to the technical systems we’re nurturing, but there is a commonality between the fields: thinking about the purpose of the alarm, and how people are expected to react to it firing, is an extremely important consideration when designing the interaction.
One interesting anecdote from the study was that extracting alarm data was a barrier to entry, as manufacturers often don’t provide mechanisms to easily extract data from their telemetry units. We have a natural advantage in operations in that we tend to own our monitoring systems end-to-end and can extract that data, or have access to APIs to easily gather the data.
The key takeaway the authors of the article make clear is this:
Review of actual alarm data, as well as observations regarding how nursing staff interact with cardiac monitor alarms, is necessary to craft meaningful quality alarm initiatives for decreasing the burden of audible alarms and clinical alarm fatigue.
Regardless of whether you think any of the methods employed above make sense in the field of operations, it’s difficult to argue against collecting and analysing alerting data.
The thing that excites me so much about this study is there is actual data to back the proposed techniques up! This is something we really lack in the field of operations, and it would be amazing to see more companies publish studies analysing different alert management techniques.
Finally, the authors lay out some recommendations for other institutions can use to improve alarm fatigue without requiring additional resources or technology.
To adapt them to the field of operations:
- Establish a multidisciplinary alerting work group (dev, ops, management).
- Extract and analyse alerting data from your monitoring system.
- Eliminate alerts that are inactionable, or are likely to recover themselves.
- Standardise default thresholds, but allow local variations to be made by people responding to the alerts.