The questions that should have been asked after the RMS outage

Routine error caused NSW Roads and Maritime outage

The NSW Roads and Maritime Services’ driver and vehicle registration service suffered a full-day outage on Wednesday due to human error during a routine exercise, an initial review has determined.

Insiders told ITnews that the outage, which affected services for most of Wednesday, was triggered by an error made by a database administrator employed by outsourced IT supplier, Fujitsu.

The technician had made changes to what was assumed to be the test environment for RMS’ Driver and Vehicle system (DRIVES), which processes some 25 million transactions a year, only to discover the changes were being made to a production system, iTnews was told.

There is a lot to digest here, so let’s start our analysis with two simple and innocuous words in the opening paragraph: “routine exercise”.

If the exercise is routine, what is the frequency that release routine is followed? Once a day? Once a week? Once a month? Once a year?

The article provides some insight into this:

“The activity on Tuesday night was carried out ahead of a standard quarterly release of the DRIVES system,” a spokesman for Service NSW said.

The statement suggests releases are being done every 3 months. By government standards, RMS’s schedules are likely quite progressive, given their public track record for smooth IT operations and an innovative IT procurement strategy.

But it’s still a large delta between releases. Given many organisations are moving to daily and even hourly releases to reduce the risk of failures, 3 months release cycles are relatively archaic.

Think about everything that can change in three months. There will be big changes that are low impact. There will be small changes that are high impact. There will be everything in between. There will be changes that are determined to be low impact & low risk, but in hindsight will be considered high impact & high risk.

Now think about releasing all those changes at once. The longer you wait between releases, the greater the risk something will go wrong.

What are the organisational factors that make three monthly releases acceptable? Are people within RMS aware of the pain their current practices are causing? Are those aware of the pain in management or are they on the front line?

Are either of these groups pushing for more frequent releases? Do any of those people have the power within RMS to make those change happen? What are the channels for driving organisational change and process improvement?

If those channels don’t exist, or when those channels fail, how does the organisation react? How do people within the organisation react?

These are all interesting questions that will go a long way to uncovering the extent of the problem, and give people a starting point to address those problems.

But that’s rarely the type of coverage you get in the media. The typical narrative in the media and organisations that aren’t learning from their mistakes is very simple:

Attribute blame to bad apples.
Find the scape goat.
Excise the perpetrator.

Discussion of these complex issues focuses exclusively on “human error”.

But what happens if you replaced the human in that situation with another? The answer is almost certainly going to be “exactly the same outcome”.

Humans are just actors in a complex system, or complex systems nested in other complex systems. They are locally rational. They are doing their best based on the information they have at hand. Nobody wakes up in the morning with the intention of causing an accident.

We have the facts (or at least a media-manipulated interpretation of them). We know the outcome of the “bad actions”. Our knowledge of the outcome (a day-long outage) taints any interpretation of the events from an “objective” point of view. This is hindsight bias in its rawest form.

We use hindsight to pass judgement on people who were closest to the thing that went wrong. Dekker says “hindsight converts a once vague, unlikely future into an immediate, certain path”. After an accident we draw a line in the sand and say:

“There! They crossed the line! They should have known better!”

But in the fog of war those actors in a complex system were making what they considered to be rational decision based on the information they had at hand. Before the accident, the line is a band of grey, and people in the system are drifting within that band. After an accident, that band of gray rapidly consolidates into a thin dark line that the people closest to the accident are conveniently on the other side of.

Being mindful of our own hindsight bias, it’s critical we start at the very beginning: What was the operator thinking when they were performing this routine exercise? What information did the operator have at hand that informed their judgements?

How many times had the operator performed that “routine” exercise?

If they had performed that exercise before, what was different about this instance of the exercise?

If they hadn’t performed that exercise before, what training had they received? What support were they provided? Was someone double checking every item on their checklist?

What types of behaviour does the organisation incentivise? Does it reward people who take risks, improve processes, and improvise to get the job done? Or does it reward people who don’t rock the boat, who shut up and do their work — no questions asked?

If the incentive is not to rock the boat, do people have the power to put up their hands when their workload is becoming unmanageable? How does the organisation react to people who identify and raise problems with workload? Is their workload managed to an achievable level, or are they told to suck it up?

Are the powers to flag excessive workload extended to people who work with the organisation, but aren’t necessarily members of the organisation — like contractors, or outsourced suppliers?

And most importantly of all — after an accident, what effect do the words and actions of those in management send to employees? What effect do they have on supplier relationships?

The message in RMS’s case is pretty clear:

“If you make a mistake we’ll publicly hang you out to dry.”

A culture that prioritises blaming individuals over identifying and improving systemic flaws is not a culture I would choose to be part of.

The questions that should have been asked after the RMS outage

Routine error caused NSW Roads and Maritime outage

Lindsay Holmwood

Recent Stories

Using a first gen iPad mini as a grafana dashboard in 2024

Using MikroTik Netinstall on Linux

My philosophy on work

A simple proxy service for scrapers running on Morph

AWS in government: risks, myths, and misconceptions

Help! I’ve just been made a manager