photo of me

Lindsay Holmwood is an engineering manager living in the Australian Blue Mountains. He is the creator of Visage, Flapjack & cucumber-nagios, and organises the Sydney DevOps Meetup.

It's not a promotion - it's a career change

The biggest misconception engineers have when thinking about moving into management is they think it's a promotion.

Management is not a promotion. It is a career change.

If you want to do your leadership job effectively, you will be exercising a vastly different set of skills on a daily basis to what you are exercising as an engineer. Skills you likely haven't developed and are unaware of.

Your job is not to be an engineer. Your job is not to be a manager. Your job is to be a multiplier.

You exist to remove roadblocks and eliminate interruptions for the people you work with.

You exist to listen to people (not just hear them!), to build relationships and trust, to deliver bad news, to resolve conflict in a just way.

You exist to think about the bigger picture, ask provoking and sometimes difficult questions, and relate the big picture back to something meaningful, tangible, and actionable to the team.

You exist to advocate for the team, to promote the group and individual achievements, to gaze into unconstructive criticism and see underlying motivations, and sometimes even give up control and make sacrifices you are uncomfortable or disagree with.

You exist to make systemic improvements with the help of the people you work with.

Does this sound like engineering work?

The truth of the matter is this: you are woefully unprepared for a career in management, and you are unaware of how badly unprepared you are.

There are two main contributing factors that have put you in this position:

  • The Dunning-Kruger effect
  • Systemic undervaluation of non-technical skills in tech

Systemic undervaluation of non-technical skills

Technical skills are emphasised above all in tech. It is part of our mythology.

Technical skill is the dominant currency within our industry. It is highly valued and sought after. If you haven't read all the posts on the Hacker News front page today, or you're not running the latest releases of all your software, or you haven't recently pulled all-nighter coding sessions to ship that killer feature, you're falling behind bro.

Naturally, for an industry so unhealthily focused on technical skills, they tend to be the deciding factor for hiring people.

Non-technical skills that are lacking, like teamwork, conflict resolution, listening, and co-ordination, are often overlooked and excused away in engineering circles. They are seen as being of lesser importance than technical skills, and organisations frequently compensate for, minimise the effects of, and downplay the importance of these skills.

If you really want to see where our industry places value, just think about the terms "hard" and "soft" we use to describe and differentiate between the two groups of skills. What sort of connotations do each of those words have, and what implicit biases do they feed into and trigger?

If you're an engineer thinking about going into management, you are a product of this culture.

There are a handful of organisations that create cultural incentives to develop these non-technical skills in their engineers, but these organisations are, by and large, unicorns.

And if you want to lead people, you're in for a rude shock if you haven't develop those non-technical skills.

Because guess what - you can't lead people in the same way you write code or manage machines. If you could, management would have been automated a long time ago.

The Dunning-Kruger effect

The identification of the Dunning-Kruger effect is one of the most interesting development of modern psychology, and one of the most revelatory insights available to our industry.

In 1999 David Dunning and Justin Kruger started publishing the results of experiments on the ability of people to self-assess competence:

Dunning and Kruger proposed that, for a given skill, incompetent people will:

  • tend to overestimate their own level of skill
  • fail to recognize genuine skill in others
  • fail to recognize the extremity of their inadequacy
  • recognize and acknowledge their own previous lack of skill, if they are exposed to training for that skill

If you've had a career in tech without any leadership responsibilities, you've likely had thoughts like:

  • "Managing people can't be that hard."
  • "My boss has no idea what they are doing."
  • "I could do a better job than them."

Congratulations! You've been partaking in the Dunning-Kruger effect.

The bad news: Dunning-Kruger is exacerbated by the systemic devaluation of non-technical skills within tech.

The good news: soon after going into leadership, the scope of your lack of skill, and unawareness of your lack of skill, will become plain for you to see.

Also, everyone else around you will see it.

Multiplied impact

This is the heart of the matter: by being elevated into a position of leadership, you are being granted a responsibility over people's happiness and wellbeing.

Mistakes made due to lack of skill and awareness can cause people irreparable damage and create emotional scar tissue that will stay with people for years, if not decades.

Conversely, by developing skills and helping your team row in the same direction, you can also create positive experiences that will last with people their entire careers.

The people in your team will spend a lot of time looking up at you - far more time than what you realise. Everything you do will be analysed and disected, sometime fairly, sometimes not.

If you're not willing to push yourself, develop the skills, and fully embrace the career change, maybe you should stay on the engineering career development track.

But it's not all doom and gloom.

By striving to be a multiplier, the effects of the hard work you and the team put in can be far greater than what you can achieve individually.

You only reap the benefits of this if you shift your measure of job satisfaction from your own performance to the group's.

"Real work"

Many engineers who change into management feel disheartened because they're not getting as much "real work" done.

If you dig deeper, "real work" is always linked to their own individual performance. Of course you're not going to perform to the same level as an engineer - you're working towards the same goals, but you are each working on fundamentally different tasks to get there!

Focusing on your own skills and performance can be a tough loop to break out of - individual achievement is bound up in the same mythology as technical skills - it's something highly prized and disproportionately incentivised in much of our culture.

If you've decided to undertake this career change, it's important to treat your lack of skill as a learning opportunity, develop a hunger for learning more and developing your skills, routinely reflect on your experiences and compare yourself to your cohort.

None of these things are easy - I struggled with feelings of inadequacy in meeting the obligations of my job for the first 3 years of being in a leadership position. Once I worked out that I was tying job satisfaction to engineering performance, it was a long and hard struggle to re-link my definition of success to group performance.

If everything you've read here hasn't scared you, and you've committed to the change to management, there are three key things you can start doing to start skilling up:

  1. Do professional training.
  2. Get mentors.
  3. Educate yourself.


Tech has a bias against professional training that doesn't come from universities. Engineering organisations tend to value on-the-job experience over training and certification. A big part of that comes from a lot of technical training outside of universities being a little bit shit.

Our experience of bad training in the technical domain doesn't apply to management - there is plenty of quality short course management training available, that other industries have been financing the development of the last couple of decades.

In Australia, AIM provide several courses ranging from introductory to advanced management and leadership development.

Do your research, ask around, find what people would recommend, then make the case for work to pay for it.


Find other people in your organisation you can talk to about the challenges your facing developing your non-technical skills. This person doesn't necessarily need to be your boss - in fact diversifying your mentors is important for developing skills to entertain multiple perspectives on the same situation.

If you're lucky, your organisation assigns new managers a buddy to act as a mentor, but professional development maturity for management skills varys widely across organisations.

If you don't have anyone in your organisation to act as a mentor or buddy, then seek out old bosses and see if they'd be willing to chat for half an hour every few weeks.

I have semi-regular breakfast catchups with a former boss from very early on in my career that are always a breath of fresh air - to the point where my wife actively encourages me to catch up because of how less stressed I am afterwards.

Another option is to find other people in your organisation also going through the same transition from engineer to manager as you. You won't have all the answers, but developing a safe space to bounce ideas around and talk about problems you're struggling with is a useful tool.


I spend a lot of time reading and sharing articles on management and leadership - far more time than I spend on any technical content.

At the very beginning of your journey it's difficult to identify what is good and what is bad, what is gold and what is fluff. I have read a lot of crappy advice, but four years into the journey my barometer for advice is becoming more accurate.

Also, be careful of only reading things that re-inforce your existing biases and leadership knowledge. If there's a particular article I disagree with, I'll often spend a 5 minutes jotting a brief critique. I'll either get better at articulating to others what about that idea is flawed, or my perspective will become more nuanced.

It's also pertinent to note how the article made you feel, and reflect for a moment on what about the article made you to feel that way.

If you're scratching your head for where to start, I recommend Bob's Sutton "The No Asshole Rule", then "Good Boss, Bad Boss". Sutton's work is rooted in evidence based management (he's not talking out of his arse - he's been to literally thousands of companies and observed how they work), but writes in an engaging and entertaining way.

Almost four years into my career change, I can say that it's been worth it. It has not been easy. I have made plenty of mistakes, have prioritised incorrectly, and hurt people accidentally.

But so has everyone else. Nobody else has this nailed. Even the best managers are constantly learning, adapting, improving.

Think about it this way: you're going to accumulate leadership skills faster than people who have made the change because you're starting with nothing. The difference is nuance and tact that comes from experience, something you can develop by sticking with your new career.

This will only happen when you fully commit to your new career, and you change your definition for success to meet your new responsibilities as a manager.


Applying cardiac alarm management techniques to your on-call

If alarms are more often false than true, a culture emerges on the unit in that staff may delay response to alarms, especially when staff are engaged in other patient care activities, and more important critical alarms may be missed.

One of the most difficult challenges we face in the operations field right now is "alert fatigue". Alert fatigue is a term the tech industry has borrowed from a similar term used in the medical industry, "alarm fatigue" - a phenomenon of people being so desensitised to the alarm noise from monitors that they fail to notice or react in time.

In an on-call scenario, I posit two main factors contribute to alert fatigue:

  • The accuracy of the alert.
  • The volume of alerts received by the operator.

Alert fatigue can manifest itself in many ways:

  • Operators delaying a response to an alert they've seen before because "it'll clear itself".
  • Impaired reasoning and creeping bias, due to physical or mental fatigue.
  • Poor decision making during incidents, due to an overload of alerts.

Earlier this year a story popped up about a Boston hospital that silenced alarms to improve the standard of care. It sounded counter-intuitive, but in the context of the alert fatigue problems we're facing, I wanted to get a better understanding of what they actually did, and how we could potentially apply it to our domain.


Rethinking monitoring post-Monitorama PDX

The two key take home messages from Monitorama PDX are this:

  • We are mistakenly developing monitoring tools for ops people, not the developers who need them most.
  • Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.

Death to strip charts

Two years ago when I received my hard copy of William S. Cleveland's The Elements of Graphing Data, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45˚, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.


Flapjack, heartbeating, and one off events

Flapjack assumes a constant stream of events from upstream event producers, and this is fundamental to Flapjack's design.

a beating heart

Flapjack asks a fundamentally different question to other notification and alerting systems: "How long has a check been failing for?". Flapjack cares about the elapsed time, not the number of observed failures.

Alerting systems that depend on counting the number of observed failures to decide whether to send an alert suffer problems when the observation interval is variable.

Take this scenario with a LAMP stack running in a large cluster:


Data driven alerting with Flapjack + Puppet + Hiera

On Monday I gave a talk at Puppet Camp Sydney 2014 about managing Flapjack data (specifically: contacts, notification rules) with Puppet + Hiera.

There was a live demo of some new Puppet types I've written to manage the data within Flapjack. This is incredibly useful if you want to configure how your on-call are notified from within Puppet.



The code is a little rough around the edges, but you can try it out at in the puppet-type branch on vagrant-flapjack.


The questions that should have been asked after the RMS outage

Routine error caused NSW Roads and Maritime outage

The NSW Roads and Maritime Services' driver and vehicle registration service suffered a full-day outage on Wednesday due to human error during a routine exercise, an initial review has determined.

Insiders told ITnews that the outage, which affected services for most of Wednesday, was triggered by an error made by a database administrator employed by outsourced IT supplier, Fujitsu.

The technician had made changes to what was assumed to be the test environment for RMS' Driver and Vehicle system (DRIVES), which processes some 25 million transactions a year, only to discover the changes were being made to a production system, iTnews was told.

There is a lot to digest here, so let's start our analysis with two simple and innocuous words in the opening paragraph: "routine exercise".


The How and Why of Flapjack

In October @rodjek asked on Twitter:

"I've got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?"

Flapjack will be immediately useful to you if:

  • You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
  • You monitor infrastructures that have multiple teams responsible for keeping them up.
  • Your monitoring infrastructure is multitenant, and each customer has a bespoke alerting strategy.
  • You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.


CLI testing with RSpec and Cucumber-less Aruba

At Bulletproof, we are increasingly finding home brew systems tools are critical to delivering services to customers.

These tools are generally wrapping a collection of libraries and other general Open Source tools to solve specific business problems, like automating a service delivery pipeline.

Traditionally these systems tools tend to lack good tests (or simply any tests) for a number of reasons:

  • The tools are quick and dirty
  • The tools model business processes that are often in flux
  • The tools are written by systems administrators

Sysadmins don't necessarily have a strong background in software development. They are likely proficient in Bash, and have hacked a little Python or Ruby. If they've really gotten into the infrastructure as code thing they might have delved into the innards of Chef and Puppet and been exposed to those projects respective testing frameworks.

In a lot of cases, testing is seen as "something I'll get to when I become a real developer".


Just post mortems

Earlier this week I gave a talk at Monitorama EU on psychological factors that should be considered when designing alerts.

Dave Zwieback pointed me to a great blog post of his on managing the human side of post mortems, which bookends nicely with my talk:

Imagine you had to write a postmortem containing statements like these:

We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.

We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.

We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.

While the above scenarios are entirely realistic, it's hard to find many postmortem write-ups that even hint at these "human factors." Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.

Dave's third example dovetails well with some of the examples in Dekker's Just Culture.


Counters not DAGs

Monitoring dependency graphs are fine for small environments, but they are not a good fit for nested complex environments, like those that make up modern web infrastructures.

DAGs are a very alluring data structure to represent monitoring relationships, but they fall down once you start using them to represent relationships at scale:

  • There is an assumption there is a direct causal link between edges of the graph. It's very tempting to believe that you can trace failure from one edge of the graph to another. Failures in one part of a complex systems all too often have weird effects and induce failure on other components of the same system that are quite removed from one another.

  • Complex systems are almost impossible to model. With time and an endless stream of money you can sufficiently model the failure modes within complex systems in isolation, but fully understanding and predicting how complex systems interact and relate with one another is almost impossible. The only way to model this effectively is to have a closed system with very few external dependencies, which is the opposite of the situation every web operations team is in.

  • The cost of maintaining the graph is non trivial. You could employ a team of extremely skilled engineers to understand and model the relationships between each component in your infrastructure, but their work would never be done. On top of that, given the sustained growth most organisations experience, whatever you model will likely change within 12-18 months. Fundamentally it would not provide a good return on investment.