Lindsay Holmwood - Fractional 2015-05-24T04:55:28+00:00 Lindsay Holmwood CD for infrastructure services 2015-05-22T00:00:00+00:00 <p>For the last 6 months I&#39;ve been consulting on a project to build a monitoring metrics storage service to store several hundred thousand metrics that are updated every ten seconds. We decided to build the service in a way that could be continuously deployed and use as many existing Open Source tools as possible.</p> <p>There is a <a href="">growing body</a> of evidence to show that continuous deployment of applications lowers defect rates and improves software quality. However, the significant corpus of literature and talks on continuous delivery and deployment is primarily focused on applications - there is scant information available on applying these CD principals to the work that infrastructure engineers do every day.</p> <p>Through the process of building a monitoring service with a continous deployment mindset, we&#39;ve learnt quite a bit about how to structure infrastructure services so they can be delivered and deployed continuously. In this article we&#39;ll look at some of the principals you can apply to your infrastructure to start delivering it continuously.</p> <!-- excerpt --> <h2>How to CD your infrastructure successfully</h2> <p>There are two key principals for doing CD with infrastructure services successfully:</p> <ol> <li><strong>Optimise for fast feedback.</strong> This is essential for quickly validating your changes match the business requirements, and eliminating technical debt and sunk cost before it spirals out of control.</li> <li><strong>Chunk your changes.</strong> A CD mindset forces you to think about creating the shortest <em>and smoothest</em> path to production for changes to go live. Anyone who has worked on public facing systems knows that many big changes made at once rarely result in happy times for anyone involved. Delivering infrastructure services continuously doesn&#39;t absolve you from good operational practice - it&#39;s an opportunity to create a structure that re-inforces such practices.</li> </ol> <h2>Definitions</h2> <ul> <li>Continous Delivery is different from Continuous Deployment in that in Continuous Delivery there is some sort of human intevention required to promote a change from one stage of the pipeline to the next. In Continuous Deployment no such breakpoint exists - changes are promoted automatically. The speed of Continuous Deployment comes at the cost of potentially pushing a breaking change live. Most discussion of &quot;CD&quot; rarely qualifies the terms.</li> <li>An infrastructure service is a configuration of software and data that is consumed by other software - not by end users themselves. Think of them as “the gears of the internet”. Examples of infrastructure services include DNS, databases, Continuous Integration systems, or monitoring.</li> </ul> <h2>What the pipeline looks like</h2> <ol> <li><strong>Push.</strong> An engineer makes a change to the service configuration and pushes it to a repository. There may be ceremony around how the changes are reviewed, or they could be pushed directly into <code>master</code>.</li> <li><strong>Detect and trigger.</strong> The CI system detects the change and triggers a build. This can be through polling the repository regularly, or a hosted version control system (like GitHub) may call out via a webhook.</li> <li><strong>Build artifacts.</strong> The build sets up dependencies and builds any required software artifacts that will be deployed later.</li> <li><strong>Build infrastructure.</strong> The build talks to an IaaS service to build the necessary network, storage, compute, and load balancing infrastructure. The IaaS service may be run by another team within the business, or an external provider like AWS.</li> <li><strong>Orchestrate infrastructure.</strong> The build uses some sort of configuration management tool to string the provisioned infrastructure together to provide the service.</li> </ol> <p>There is a testing step between almost all of these steps. Automated verification of the changes about to be deployed and the state of the running service after the deployment is crucial to doing CD effectively. Without it, CD is just a framework for continuously shooting yourself in the foot faster and not learning to stop. <em>You will fail if you don&#39;t build feedback into every step of your CD pipeline.</em></p> <h2>Defining the service for quality feedback</h2> <ul> <li><strong>Decide what guarantees you are providing</strong> to your users. A good starting point for thinking about about what those guarantees should be is the CAP theorem. Decide if the service you&#39;re building is an AP or CP system. Infrastructure services generally tend towards AP, but there are cases where CP is preferred (e.g. databases).</li> <li><strong>Define your SLAs.</strong> This is where you quantify the guarantees you&#39;ve just made to your users. These SLAs will relate to service throughput, availability, and data consistency (note the overlap with CAP theorem). <em>95e response time for monitoring metric queries in a one hour window is &lt; 1 second</em>, and <em>a single storage node failure does not result in graph unavailability</em> are examples of SLAs.</li> <li><strong>Codify your SLAs as tests and checks.</strong> Once you&#39;ve quantified your guarantees SLAs, this is how you get automated feedback throughout your pipeline. These tests must be executed while you&#39;re making changes. Use your discretion as to if you run all of the tests after every change, or a subset.</li> <li><strong>Define clear interfaces.</strong> It&#39;s extremely rare you have a service that is one monolithic component that does everything. Infrastructure services are made of multiple moving parts that work together to provide the service, e.g. multiple PowerDNS instances fronting a MySQL cluster. Having clear, well defined interfaces are important for verifying expected interactions between parts before and after changes, as well as during the normal operation of the service.</li> <li><strong>Know your data.</strong> Understanding where the data lives in your service is vital to understanding how failures will cascade throughout your service when one part fails. Relentlessly eliminate state within your service by pushing it to one place and front access with horizontally scalable immutable parts. Your immutable infrastructure is then just a stateless application.</li> </ul> <h2>Making it fast</h2> <p><strong>Getting iteration times down</strong> is the most important goal for achieving fast feedback. From pushing a change to version control to having the change live should take less than 5 minutes (excluding cases where you&#39;ve gotta build compute resources). Track execution time on individual stages in your pipeline with <code>time(1)</code>, logged out to your CI job&#39;s output. Analyse this data to determine the min, max, median and 95e execution time for each stage. Identify what steps are taking the longest and optimise them.</p> <p><strong>Get your CI system close to the action.</strong> One nasty aspect of working with infrastructure services is the latency between where you are making changes from, and the where the service you&#39;re making changes to is hosted. By moving your CI system into the same point of presence as the service, you minimise latency between the systems.</p> <p>This is especially important when you&#39;re interacting with an IaaS API to inventory compute or storage resources at the beginning of a build. Before you can act on any compute resources to install packages or change configuration files you need to ensure those compute resources exist, either by building up an inventory of them or creating them and adding them to said inventory.</p> <p>Every time your CD runs it has to talk to your IaaS provider to do these three steps:</p> <ol> <li>Does the thing exist?</li> <li>Maybe make a change to create the thing</li> <li>Get info about the thing</li> </ol> <p>Each of these steps requires sending and recieving often non-trivial amounts of data that will be affected by network and processing latency.</p> <p>By moving your CI close to the IaaS API, you get a significant boost in run time performance. By doing this on the monitoring metrics storage project we reduced the CD pipeline build time from 20 minutes to 5 minutes.</p> <p><strong>Push all your changes through CI.</strong> It&#39;s tempting when starting out your CD efforts to push some changes through the pipeline, but still make ad-hoc changes outside the pipeline, say from your local machine.</p> <p>This results in several problems:</p> <ul> <li>You don&#39;t receive the latency reducing benefits of having your CI system close to the infrastructure.</li> <li>You limit visibility to other people in your team as to what changes have actually been made to the service. That quick fix you pushed from your local machine might contribute to a future failure that your colleagues will have no idea about. The team as a whole benefits from having an authoriative log of all changes made.</li> <li>You end up with divergent processes - one for ad-hoc changes and another for Real Changes™. Now you&#39;re optimising two processes, and those optimisations will likely clobber one another. Have fun.</li> <li>You reduce your confidence that changes made in one environment will apply cleanly to another. If you&#39;re pushing changes through multiple environments before they are applied to your production environment, you reduce the certainty that one off changes in one environment won&#39;t cause changes to pass there but fail elsewhere.</li> </ul> <p>There&#39;s no point in lying: <em>pushing all changes through CI is hard</em> but worth it. It requires thinking about changes differently and embracing a different way of working.</p> <p>The biggest initial pushback you&#39;ll probably get is having to context switch between your terminal where you&#39;re making changes and the web browser where you&#39;re tracking the CI system output. This context switch sounds trivial but I dare you to try it for a few hours and not feel like you&#39;re working more slowly.</p> <p>Netflix Skunkworks&#39; <a href="">jenkins-cli</a> is an absolutely godsend here - it allows you to start, stop, and tail jobs from your command line. Your workflow for making changes now looks something like this:</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash">git push <span class="o">&amp;&amp;</span> jenkins start <span class="nv">$job</span> <span class="o">&amp;&amp;</span> jenkins tail <span class="nv">$job</span> </code></pre></div> <p>The <code>tail</code> is the real killer feature here - you get the console output from Jenkins on your command line without the need to switch away to your browser.</p> <h2>Chunking your changes</h2> <p><strong>Change one, test one</strong> is a really important way of thinking about how to apply changes so they are more verifiable. When starting out CD the easiest path is to make all your changes and then test them straight away, e.g.</p> <blockquote> <ul> <li>Change app</li> <li>Change database</li> <li>Change proxy</li> <li>Test app</li> <li>Test database</li> <li>Test proxy</li> </ul> </blockquote> <p>What happens when your changes cause multiple tests to fail? You&#39;re faced with having to debug multiple moving parts without solid information on what is contributing to the failure.</p> <p>There&#39;s a very simple solution to this problem - and test immediately after you make changes:</p> <blockquote> <ul> <li>Change app</li> <li>Test app</li> <li>Change database</li> <li>Test database</li> <li>Change proxy</li> <li>Test proxy</li> </ul> </blockquote> <p>When you make changes to the app that fail the tests, you&#39;ll get fast feedback and automatically abort all the other changes until you debug and fix the problem in the app layer.</p> <p>If you were applying changes by hand you would likely be doing something like this anyway, so encode that good practice into your CD pipeline.</p> <p><strong>Tests must finish quickly</strong>. If you&#39;ve worked on a code base with good test coverage you&#39;ll know that slow tests are a huge productivity killer. Exactly the same here - the tests should be a help not a hinderance. Aim to keep each test executing in under 10 seconds, preferably under 5 seconds.</p> <p>This means you must make compromises in what you test. Test for really obvious things like <em>“Is the service running?”</em>, <em>“Can I do a simple query?”</em>, <em>“Are there any obviously bad log messages?”</em>. You&#39;ll likely see the crossover here with &quot;traditional&quot; monitoring checks. You know, those ones railed against as being bad practice because they don&#39;t sufficiently exercise the entire stack.</p> <p>In this case, they are a pretty good indication your change has broken something. Aim for &quot;good enough&quot; fast coverage in your CD pipeline which complements your longer running monitoring checks to verify things like end-to-end behaviour.</p> <p><a href="">Serverspec</a> is your friend for quickly writing tests for your infrastructure.</p> <p><strong>Make the feedback visual</strong>. The raw data is cool, but graphs are better. If you&#39;re doing a simple threshold check and you&#39;re using something like Librato or Datadog, link to a dashboard.</p> <p>If you want to take your visualisation to the next level, use gnuplot&#39;s <a href="">dumb</a> <a href="">terminal</a> <a href="">output</a> to graph metrics on the command line:</p> <div class="highlight"><pre><code class="language-text" data-lang="text"> 1480 ++---------------+----------------+----------------+---------------** + + + + ************** + 1460 ++ ******* ## | ******* #### | 1440 ++ ***************** ####### ++ | *** ## | 1420 ******************* # ++ | # | 1400 ++ ## ++ | #### | | ### | 1380 ++ ### ++ | ## | 1360 ++ ##### ++ | #### | 1340 ++ ####### ++ | ### | 1320 ++ ####### ++ ############ + + + + 1300 ++---------------+----------------+----------------+---------------++ 0 5 10 15 20 CRITICAL: Deviation (116.55) is greater than maximum allowed (100.00) </code></pre></div> <h2>Conclusion</h2> <p>CD of infrastructure services is possible provided you stick to the two guiding principals:</p> <ol> <li>Optimise for fast feedback.</li> <li>Chunk your changes.</li> </ol> <p>Focus on constantly identifying and eliminating bottlenecks in your CD pipeline to get your iteration time down.</p> Why do you want to lead people? 2014-10-03T00:00:00+00:00 <p>Understanding your motivations for a career change into management is vitally important to understanding what kind of manager you want to be.</p> <p>When I made the transition into management, I didn&#39;t have a clear idea of what my motivations were. I had vague feelings of wanting to explore the challenges of managing people. I also wanted to test myself and see if I could do as good a job as role models throughout my career.</p> <p>But all of this was vague, unquantifiable feelings that took a while to get a handle on. Understanding, questioning, and clarifying my motivations was something I put a lot of thought into in the first year of my career change.</p> <p>People within your teams will spend much more time than you realise looking at and analysing what you are doing, and they will pick up on what your motivations are, and where your priorities lie.</p> <p>They will mimic these behaviours and motivations, both positive and negative. You are a signalling mechanism to the team about what&#39;s important and what&#39;s not.</p> <p>This is a huge challenge for people making the career change! You&#39;re still working all this shit out, and you&#39;ve got the ever gazing eye of your team examining and dissecting all of your actions.</p> <p>These are some of the motivations I&#39;ve picked up on in myself and others when trying to understand what drew me to the management career change.</p> <!-- excerpt --> <h2>Money</h2> <p>It is undeniable that there is a pay bump when moving to management. In most organisations, the pay ceiling is much higher in management than in engineering.</p> <p>Many engineers who rise through the ranks get to a point where the only way they will earn more is if they switch from engineering to management, so that becomes the primary motivation.</p> <p>The pay is higher for a good reason though - it&#39;s actually difficult to do the job well! Management looks easy from the outside, but it&#39;s difficult on the inside. Again, our friends <a href="">Dunning and Kruger</a> posit that for a given skill, incompetent people will:</p> <blockquote> <ul> <li>tend to overestimate their own level of skill</li> <li>fail to recognize genuine skill in others</li> <li>fail to recognize the extremity of their inadequacy</li> <li>recognize and acknowledge their own previous lack of skill, if they are exposed to training for that skill</li> </ul> </blockquote> <p>Poor decisions are obvious and easy to criticise. Because we spend a lot of time looking at those in our organisation above us, we&#39;re finely attuned to mistakes and inadequacies, and tend to glass over the good things they do.</p> <p>Understanding what about those decisions and behaviours makes sense to the people making them is difficult but vital to effectively working with others, regardless of whether you&#39;re in management, engineering, sales, finance, or operations.</p> <p>More often than not, there are good reasons behind bad decisions. We are all <a href="">locally rational</a>.</p> <p>The pay bump has strings attached - you&#39;re going to be making plenty of decisions, both good and bad, and wearing the consequences of them.</p> <p>You are being paid to be empathetic - to understand how people are feeling, how implementing change will affect people, how to keep them motivated and working towards the big picture goal. None of these tasks are simple!</p> <p>If you&#39;re primarily motivated to move into management by better pay, then you need to seriously consider how that motivation will affect the people that report to you, how mimicry of those motivations and behaviours by people in your team flow on other teams you work with, and what you need to do to meet the commitments you have to your team.</p> <p>Will you be doing the bare minimum to collect your paycheck? What’s stopping you from becoming an example of the <a href="">Peter Principle</a>? What skills do you need to develop to meet your people&#39;s needs and expectations?</p> <p>The hard problems in tech are not technology, they&#39;re people. That is why management pays more.</p> <h2>Influence</h2> <p>Being in management grants you power and influence in your organisation to build and run things as you see fit.</p> <p>This is often a key motivation for people who want to transition from engineering to management - they have a clarity of vision and they want the power to mandate how things should be built, and implement that vision.</p> <p>The motivation is always rooted in good intentions (“things could be so much more efficient if everyone just listened and did what I said”), and often results in a industrialist approach to managing people - “manager smart, worker stupid”.</p> <h3>The influence trap</h3> <p>Your influence can be wielded as a lever (leadership) or as a vise (management). Levers are useful at moving heavy objects but lack precision. Vises are very precise but a weight too heavy will slip from them.</p> <p>Vises are an alluring way for first time managers to work. The vise management style is prescriptive, centrally co-ordinated, command and control. And if you watch carefully you&#39;ll soon realise it limits the potential of the team.</p> <p>Prescriptive, vise-like management assumes you are the smartest person in the room, and know best how things should be done.</p> <p>It doesn&#39;t multiply the teams effectiveness. The point of being a manager is to be a lever that multiplies the effectiveness of the team - to synthesise different and conflicting ideas to come to decisions and solutions nobody could have anticipated or come up by themselves. This is near impossible if you solely wield your influence as a vise.</p> <p>Studies show people&#39;s problem individual performance lifts after <a href="">being exposed to teamwork situations and training</a>.</p> <p>Prescriptive management increases the gap between <em>Work As Imagined</em> vs <em>Work As Done</em>. While conceptually you may have a great idea about how to solve a problem or operate a system daily, the people implementing your plans always discover gaps between the concept and implementation. Over time these gaps become larger, to the point you have a distorted view of how work is being done compared to how it&#39;s actually being carried out.</p> <p>You optimise the effectiveness of the system by having tight feedback loops, open communication channels where people are rewarded for providing both negative and positive feedback about the design and operation of the system. As a manager, this means you need to be actively engaging with the people in your team - finding out what they think and feel about the work.</p> <p>Finally, prescriptive management is an empirically bad way of retaining creative talent. Constant overruling and minimisation of feedback is a great way to piss people off. If you hire creative, intelligent, capable people and keep them locked in a box, they&#39;re going to break out.</p> <h3>Multiplying, trust, and happiness</h3> <p>Maybe you are the smartest person in the room, but others will bring knowledge and experience to the table you simply don&#39;t have.</p> <p>You get the best out of the team by creating a safe space for people to put forward ideas, argue them without recriminations, and build consensus.</p> <p>The goal for people leading high performing teams should be to have the output of the team be greater than the sum of the individual efforts of people in the team.</p> <p>Your status as a manager grants you power within your organisation. That power must be wielded responsibly. You won&#39;t know if you&#39;re wielding that power responsibly in the first 12 months of the career change, at best.</p> <p>You must constantly assess whether the decisions you&#39;re making are the best for the people who report to you. It&#39;s a constant tightrope act to balance the needs of your people over the needs of the business.</p> <p>It&#39;s easy to pass the policy buck and say “I’m just following orders” when implementing unpopular changes, but you do have a responsibility to identify and push back on change that negatively affects people before you roll it out, and minimise the unavoidable negative effects of that change.</p> <p>It does not take long for things to come apart when you take your eye off the ball and stop looking out for the team. Trust is hard to build, and easy to lose. People spend a lot of time looking at you and analysing your behaviour. They will notice much earlier than you realise when you take your eye off the ball.</p> <p>It takes <a href="">at least 5 positive interactions</a> to start re-establishing trust after you&#39;ve breached it.</p> <p>Being in a management position grants you the power to shape how people within your organisation do their work. This means you have a direct influence over their happiness and wellbeing. Blindly implementing policy and not empathising with the people in your team can cause irreparable damage and create emotional scar tissue that will stay with people for years, if not decades.</p> <p>Your power must be wielded responsibly. Do not fuck this up. When you do (and don&#39;t worry, you will, we all have), own your mistakes, apologise, and rebuild the trust.</p> <h2>Personal development / Career change</h2> <p>Personal development is a pretty good motivation for a career change to management! You want to challenge yourself to do a better job than those before and around you.</p> <p>A huge personal motivation for me when moving into management was to treat others better than I had been treated throughout my career until that point.</p> <p>Working in environments where the happiness of people was not the primary concern of those in charge is not a fun experience. Shared negative and stressful experiences helped me form close bonds and develop a camaraderie with the people I worked with. I couldn&#39;t say the same about the people I worked for.</p> <p>Those relationships are something I value, but I wouldn&#39;t want anyone else to have to go through what we did just to obtain that sort of relationship.</p> <p>The challenge for me was clear: was it possible to develop that camaraderie within the team I lead through purely positive experiences?</p> <p>Looking back at how particular decisions and behaviours I experienced affected me and other people in the teams I worked in in these stressful environments, there were some obvious things that I could improve on.</p> <p>There were other decisions I considered to be poor at the time, but after finding myself in similar positions I made similar choices.</p> <p>I failed fairly terribly at the transition during the first 12 months of my career change. Someone in my team described my management style as “absent father”. That really put into perspective that my priorities were misplaced, and I needed to focus on the team and not my own individual performance.</p> <p>My first experience working in tech was overwhelmingly positive. The working environment and management I experienced on a daily basis in the first 3 years of working in tech is the experience I aspire to create for people in the teams I lead every day.</p> <p>The times I had a “good boss” are some of my best memories in my career. I was focused on the work, consistently delivered things I was excited about, and rarely worried about troubles elsewhere in the business (and it turned out there were a lot of them).</p> <p>The enduring attitude from that time is the feeling of working with, not for my manager. We worked as a team to solve problems together, not as individuals off doing our own thing. That&#39;s the feeling I want to create in the teams I lead.</p> <hr> <p>Understanding what motivates your career change is not an easy task.</p> <p>At the end of the first year of my career change, my motivations lay somewhere between influence and personal development.</p> <p>These motivations have morphed over time. Today, my focus is the happiness of the people I work with.</p> <p>You need to undertake a constant process of self-reflection and a space to develop your understanding of your motivations. It’s important you create the time and space to do this!</p> <p>The simplest trap to fall into in your first year is to be focused on the daily grind, the tactical details, and not think about the bigger picture.</p> <p>This is something that affects experienced and novice managers alike, and it’s important to establish good personal habits early on so you have time to reflect on what motivates you, and what sort of leader you’re going to be.</p> It's not a promotion - it's a career change 2014-09-19T00:00:00+00:00 <p>The biggest misconception engineers have when thinking about moving into management is they think it&#39;s a promotion.</p> <p>Management is not a promotion. It is a career change.</p> <p>If you want to do your leadership job effectively, you will be exercising a vastly different set of skills on a daily basis to what you are exercising as an engineer. Skills you likely haven&#39;t developed and are unaware of.</p> <p>Your job is not to be an engineer. Your job is not to be a manager. Your job is to <a href="">be a multiplier</a>.</p> <p>You exist to remove roadblocks and eliminate interruptions for the people you work with.</p> <p>You exist to listen to people (not just hear them!), to build relationships and trust, to deliver bad news, to resolve conflict in a just way.</p> <p>You exist to think about the bigger picture, ask provoking and sometimes difficult questions, and relate the big picture back to something meaningful, tangible, and actionable to the team.</p> <p>You exist to advocate for the team, to promote the group and individual achievements, to gaze into unconstructive criticism and see underlying motivations, and sometimes even give up control and make sacrifices you are uncomfortable or disagree with.</p> <p>You exist to make systemic improvements with the help of the people you work with.</p> <p>Does this sound like engineering work?</p> <p>The truth of the matter is this: you are woefully unprepared for a career in management, and you are unaware of how badly unprepared you are.</p> <p>There are two main contributing factors that have put you in this position:</p> <ul> <li>The Dunning-Kruger effect</li> <li>Systemic undervaluation of non-technical skills in tech</li> </ul> <!-- excerpt --> <h3>Systemic undervaluation of non-technical skills</h3> <p>Technical skills are emphasised above all in tech. It is <a href="">part of our mythology</a>.</p> <p>Technical skill is the dominant currency within our industry. It is highly valued and sought after. If you haven&#39;t read all the posts on the Hacker News front page today, or you&#39;re not running the latest releases of all your software, or you haven&#39;t recently pulled all-nighter coding sessions to ship that killer feature, you&#39;re falling behind bro.</p> <p>Naturally, for an industry so unhealthily focused on technical skills, they tend to be the deciding factor for hiring people.</p> <p>Non-technical skills that are lacking, like teamwork, conflict resolution, listening, and co-ordination, are often overlooked and excused away in engineering circles. They are seen as being of <a href="">lesser importance</a> than technical skills, and organisations frequently compensate for, minimise the effects of, and downplay the importance of these skills.</p> <p>If you really want to see where our industry places value, just think about the terms &quot;hard&quot; and &quot;soft&quot; we use to describe and differentiate between the two groups of skills. What sort of connotations do each of those words have, and what implicit biases do they feed into and trigger?</p> <p>If you&#39;re an engineer thinking about going into management, you are a product of this culture.</p> <p>There are a handful of organisations that create cultural incentives to develop these non-technical skills in their engineers, but these organisations are, by and large, unicorns.</p> <p>And if you want to lead people, you&#39;re in for a rude shock if you haven&#39;t developed those non-technical skills.</p> <p>Because guess what - you can&#39;t lead people in the same way you write code or manage machines. If you could, management would have been automated a long time ago.</p> <h3>The Dunning-Kruger effect</h3> <p>The identification of the Dunning-Kruger effect is one of the most interesting development of modern psychology, and one of the most revelatory insights available to our industry.</p> <p>In 1999 David Dunning and Justin Kruger started publishing the results of experiments on the ability of people to <a href="">self-assess competence</a>:</p> <blockquote> <p>Dunning and Kruger proposed that, for a given skill, incompetent people will:</p> <ul> <li>tend to overestimate their own level of skill</li> <li>fail to recognize genuine skill in others</li> <li>fail to recognize the extremity of their inadequacy</li> <li>recognize and acknowledge their own previous lack of skill, if they are exposed to training for that skill</li> </ul> </blockquote> <p>If you&#39;ve had a career in tech without any leadership responsibilities, you&#39;ve likely had thoughts like:</p> <ul> <li>&quot;Managing people can&#39;t be that hard.&quot;</li> <li>&quot;My boss has no idea what they are doing.&quot;</li> <li>&quot;I could do a better job than them.&quot;</li> </ul> <p>Congratulations! You&#39;ve been partaking in the Dunning-Kruger effect.</p> <p>The bad news: Dunning-Kruger is exacerbated by the systemic devaluation of non-technical skills within tech.</p> <p>The good news: soon after going into leadership, the scope of your lack of skill, and unawareness of your lack of skill, will become plain for you to see.</p> <p>Also, everyone else around you will see it.</p> <h3>Multiplied impact</h3> <p>This is the heart of the matter: by being elevated into a position of leadership, you are being granted a responsibility over people&#39;s happiness and wellbeing.</p> <p>Mistakes made due to lack of skill and awareness can cause people irreparable damage and create emotional scar tissue that will stay with people for years, if not decades.</p> <p>Conversely, by developing skills and helping your team row in the same direction, you can also create positive experiences that will last with people their entire careers.</p> <p>The people in your team will spend a lot of time looking up at you - far more time than what you realise. Everything you do will be analysed and disected, sometime fairly, sometimes not.</p> <p>If you&#39;re not willing to push yourself, develop the skills, and fully embrace the career change, maybe you should stay on the engineering career development track.</p> <p>But it&#39;s not all doom and gloom.</p> <p>By striving to be a multiplier, the effects of the hard work you and the team put in can be far greater than what you can achieve individually.</p> <p>You only reap the benefits of this if you shift your measure of job satisfaction from your own performance to the group&#39;s.</p> <h3>&quot;Real work&quot;</h3> <p>Many engineers who change into management feel disheartened because they&#39;re not getting as much &quot;real work&quot; done.</p> <p>If you dig deeper, &quot;real work&quot; is always linked to their own individual performance. Of course you&#39;re not going to perform to the same level as an engineer - you&#39;re working towards the same goals, but you are each working on fundamentally different tasks to get there!</p> <p>Focusing on your own skills and performance can be a tough loop to break out of - individual achievement is bound up in the same mythology as technical skills - it&#39;s something highly prized and disproportionately incentivised in much of our culture.</p> <p>If you&#39;ve decided to undertake this career change, it&#39;s important to treat your lack of skill as a learning opportunity, develop a hunger for learning more and developing your skills, routinely reflect on your experiences and compare yourself to your cohort.</p> <p>None of these things are easy - I struggled with feelings of inadequacy in meeting the obligations of my job for the first 3 years of being in a leadership position. Once I worked out that I was tying job satisfaction to engineering performance, it was a long and hard struggle to re-link my definition of success to group performance.</p> <p>If everything you&#39;ve read here hasn&#39;t scared you, and you&#39;ve committed to the change to management, there are three key things you can start doing to start skilling up:</p> <ol> <li>Do professional training.</li> <li>Get mentors.</li> <li>Educate yourself.</li> </ol> <h3>Training</h3> <p>Tech has a bias against professional training that doesn&#39;t come from universities. Engineering organisations tend to value on-the-job experience over training and certification. A big part of that comes from a lot of technical training outside of universities being a little bit shit.</p> <p>Our experience of bad training in the technical domain doesn&#39;t apply to management - there is plenty of quality short course management training available, that other industries have been financing the development of the last couple of decades.</p> <p>In Australia, <a href="">AIM</a> provide several courses ranging from introductory to advanced management and leadership development.</p> <p>Do your research, ask around, find what people would recommend, then make the case for work to pay for it.</p> <h3>Mentors</h3> <p>Find other people in your organisation you can talk to about the challenges you are facing developing your non-technical skills. This person doesn&#39;t necessarily need to be your boss - in fact diversifying your mentors is important for developing skills to entertain multiple perspectives on the same situation.</p> <p>If you&#39;re lucky, your organisation assigns new managers a buddy to act as a mentor, but professional development maturity for management skills varys widely across organisations.</p> <p>If you don&#39;t have anyone in your organisation to act as a mentor or buddy, then seek out old bosses and see if they&#39;d be willing to chat for half an hour every few weeks.</p> <p>I have semi-regular breakfast catchups with a former boss from very early on in my career that are always a breath of fresh air - to the point where my wife actively encourages me to catch up because of how less stressed I am afterwards.</p> <p>Another option is to find other people in your organisation also going through the same transition from engineer to manager as you. You won&#39;t have all the answers, but developing a safe space to bounce ideas around and talk about problems you&#39;re struggling with is a useful tool.</p> <h3>Self-education</h3> <p>I spend a lot of time reading and sharing articles on management and leadership - far more time than I spend on any technical content.</p> <p>At the very beginning of your journey it&#39;s difficult to identify what is good and what is bad, what is gold and what is fluff. I have read a lot of crappy advice, but four years into the journey my barometer for advice is becoming more accurate.</p> <p>Also, be careful of only reading things that re-inforce your existing biases and leadership knowledge. If there&#39;s a particular article I disagree with, I&#39;ll often spend a 5 minutes jotting a brief critique. I&#39;ll either get better at articulating to others what about that idea is flawed, or my perspective will become more nuanced.</p> <p>It&#39;s also pertinent to note how the article made you feel, and reflect for a moment on what about the article made you to feel that way.</p> <p>If you&#39;re scratching your head for where to start, I recommend Bob&#39;s Sutton &quot;The No Asshole Rule&quot;, then &quot;Good Boss, Bad Boss&quot;. Sutton&#39;s work is rooted in evidence based management (he&#39;s not talking out of his arse - he&#39;s been to literally thousands of companies and observed how they work), but writes in an engaging and entertaining way.</p> <hr> <p>Almost four years into my career change, I can say that it&#39;s been worth it. It has not been easy. I have made plenty of mistakes, have prioritised incorrectly, and hurt people accidentally.</p> <p>But so has everyone else. Nobody else has this nailed. Even the best managers are constantly learning, adapting, improving.</p> <p>Think about it this way: you&#39;re going to accumulate leadership skills faster than people who have made the change because you&#39;re starting with nothing. The difference is nuance and tact that comes from experience, something you can develop by sticking with your new career.</p> <p>This will only happen when you fully commit to your new career, and you change your definition for success to meet your new responsibilities as a manager.</p> Applying cardiac alarm management techniques to your on-call 2014-08-26T00:00:00+00:00 <blockquote> <p>If alarms are more often false than true, a culture emerges on the unit in that staff may delay response to alarms, especially when staff are engaged in other patient care activities, and more important critical alarms may be missed.</p> </blockquote> <p>One of the most difficult challenges we face in the operations field right now is &quot;alert fatigue&quot;. Alert fatigue is a term the tech industry has borrowed from a similar term used in the medical industry, &quot;alarm fatigue&quot; - a phenomenon of people being so desensitised to the alarm noise from monitors that they fail to notice or react in time.</p> <p>In an on-call scenario, I posit two main factors contribute to alert fatigue:</p> <ul> <li>The accuracy of the alert.</li> <li>The volume of alerts received by the operator.</li> </ul> <p>Alert fatigue can manifest itself in many ways:</p> <ul> <li>Operators delaying a response to an alert they&#39;ve seen before because &quot;it&#39;ll clear itself&quot;.</li> <li>Impaired reasoning and creeping bias, due to physical or mental fatigue.</li> <li>Poor decision making during incidents, due to an overload of alerts.</li> </ul> <p>Earlier this year a story <a href="">popped up</a> about a Boston hospital that silenced alarms to improve the standard of care. It sounded counter-intuitive, but in the context of the alert fatigue problems we&#39;re facing, I wanted to get a better understanding of <a href="">what they actually did</a>, and how we could potentially apply it to our domain.</p> <!-- excerpt --> <h3>The Study</h3> <p>When rolling out new cardiac telemetry monitoring equipment in 2008 to all adult inpatient clinical units at Boston Medical Center (BMC), a Telemetry Task Force (TTF) was convened to develop standards for patient monitoring. The TTF was a multidisciplinary team drawing people from senior management, cardiologists, physicians, nursing practitioners and directors, clinical instructors, and a quality and patient safety specialist.</p> <p>BMC&#39;s cardiac telemetry monitoring equipment provide configurable limit alarms (we know this as &quot;thresholding&quot;), with alarms for four levels: message, advisory, warning, crisis. These alarms can either be visual or auditory.</p> <p>As part of the rollout, TTF members observed nursing staff responding to alarms from equipment configured with factory default settings. The TTF members observed that alarms were frequently ignored by nursing staff, but for a good reason - the alarms would self-reset and stop firing.</p> <p>To frame this behaviour from an operations perspective, this is like a Nagios check passing a threshold for a <code>CRITICAL</code> alert to fire, the on-call team member receiving the alert, sitting on it for a few minutes, and the alert recovering all by itself.</p> <p>When the nursing staff were questioned about this behaviour, they reported that more often than not the alarms self-reset, and answering every alarm pulled them away from looking after patients.</p> <p>Fast forward 3 years, and in 2011 BMC started an Alarm Management Quality Improvement Project that experimented with multiple approaches to reducing alert fatigue:</p> <ul> <li>Widen the acceptable thresholds for patient vitals so alarms would fire less often.</li> <li>Eliminate all levels of alarms except &quot;message&quot; and &quot;crisis&quot;. Crisis alarms would emit an audible alert, while message history would build up on the unit&#39;s screen for the next nurse to review.</li> <li>Alarms that had the ability to self-reset (recover on their own) were disabled.</li> <li>If false positives were detected, nursing staff were required to tune the alarms as they occurred.</li> </ul> <p>The approaches were applied over the course of 6 weeks, with buy-in from all levels of staff, most importantly with nursing staff who were responding to the alarms.</p> <p>Results from the study were clear:</p> <ul> <li>The number of total audible alarms decreased by 89%. This should come as no surprise, given the alarms were tuned to not fire as often.</li> <li>The number of <a href="">code blues</a> decreased by 50%. This indicates that the reduction of work from the elimination of constant alarms freed up nurses to provide more proactive care, and that lower priority alarms for precursor problems for code blues are more likely to be responded to.</li> <li>The number of Rapid Response Team activations on the unit stayed constant. It&#39;s reasonable to assert that the operational effectiveness of the unit was maintained even though alarms fired less often.</li> <li>Anonymous surveys of nurses on the unit showed an increase in satisfaction with the level of noise on the unit, with night staff reporting they &quot;kept going back to the central station to reassure themselves that the central station was working&quot;. One anonymous comment stated &quot;I feel so much less drained going home at the end of my shift&quot;.</li> </ul> <p>At the conclusion of the study, the nursing staff requested that the previous alarming defaults were not restored.</p> <h3>Analysis</h3> <p>The approach outlined in the study is pretty simple: change the default alarm thresholds so they don&#39;t fire unless action <em>must</em> be taken, and give the operator the power to tune the alarms if the alarm is inaccurate.</p> <p>Alerts should exist in two states: nothing is wrong, and the world is on fire.</p> <p>But the elimination of alarms that have the ability to recover is a really surprising solution. Can we apply that to monitoring in an operations domain?</p> <p>Two obvious methods to make this happen:</p> <ul> <li>Remove checks that have the ability to self-recover.</li> <li>Redesign checks so they can&#39;t self-recover.</li> </ul> <p>For redesigning checks, I&#39;ve yet to encounter a check designed to <em>not</em> recover when thresholds are no longer exceeded. That would be a very surprising alerting behaviour to stumble upon in the wild, that most operators, myself included, would likely attribute to a bug in the check. Socially, a check redesign like that would break many fundamental assumptions operators have about their tools.</p> <p>From a technical perspective, a non-recovering check would require the check having some sort of memory about its previous states and acknowledgements, or at least have the alerting mechanism do this. This approach is totally possible in the realm of more <a href="">modern tools</a>, but is not in any way commonplace.</p> <p>Regardless of the problems above, I believe adopting this approach in an operations domain would be achievable and I would love to see data and stories from teams who try it.</p> <p>As for removing checks, that&#39;s actually pretty sane! The typical CPU/memory/disk utilisation alerts engineers receive can be handy diagnostics during outages, but in almost all modern environments they are terrible indicators for anomalous behaviour, let alone something you want to wake someone up about. If my site can take orders, why should I be woken up about a core being pegged on a server I&#39;ve never heard of?</p> <p>Looking deeper though, the point of removing alarms that self-recover is to <em>eliminate the background noise of alarms that are ignorable</em>. This ensures each and every alarm that fires actually requires action, is investigated, acted upon, or is tuned.</p> <p>This is only possible if the volume of alerts is low enough, or there are enough people to distribute the load of responding to alerts. Ops teams that meet both of these criteria do exist, but they&#39;re in the minority.</p> <p>Another consideration is that checks for operations teams are cheap, but physical equipment for nurses is not. I can go and provision a couple of thousand new monitoring checks in a few minutes and have them alert me on my phone, and do all that without even leaving my couch. There&#39;s capacity constraints on the telemetry monitoring in hospitals - budgets limit the number of potential alarms that can be deployed and thus fire, and a person physically needs to move and act on a check to silence it.</p> <p>Also consider that hospitals are dealing with <a href="">pets, not cattle</a>. Each patient is a genuine snowflake, and the monitoring equipment has to be tuned for size, weight, health. We are extremely lucky in that most modern infrastructure is built from standard, similarly sized components. The approach outlined in this study may be more applicable to organisations who are still looking after pets.</p> <p>There are constraints and variations in physical systems like hospitals that simply don&#39;t apply to the technical systems we&#39;re nurturing, but there is a commonality between the fields: thinking about the purpose of the alarm, and how people are expected to react to it firing, is an extremely important consideration when designing the interaction.</p> <p>One interesting anecdote from the study was that extracting alarm data was a barrier to entry, as manufacturers often don&#39;t provide mechanisms to easily extract data from their telemetry units. We have a natural advantage in operations in that we tend to own our monitoring systems end-to-end and can extract that data, or have access to APIs to easily gather the data.</p> <p>The key takeaway the authors of the article make clear is this:</p> <blockquote> <p>Review of actual alarm data, as well as observations regarding how nursing staff interact with cardiac monitor alarms, is necessary to craft meaningful quality alarm initiatives for decreasing the burden of audible alarms and clinical alarm fatigue.</p> </blockquote> <p>Regardless of whether you think any of the methods employed above make sense in the field of operations, it&#39;s difficult to argue against collecting and analysing alerting data.</p> <p>The thing that excites me so much about this study is there is actual data to back the proposed techniques up! This is something we really lack in the field of operations, and it would be amazing to see more companies publish studies analysing different alert management techniques.</p> <p>Finally, the authors lay out some recommendations for other institutions can use to improve alarm fatigue without requiring additional resources or technology.</p> <p>To adapt them to the field of operations:</p> <ul> <li>Establish a multidisciplinary alerting work group (dev, ops, management).</li> <li>Extract and analyse alerting data from your monitoring system.</li> <li>Eliminate alerts that are inactionable, or are likely to recover themselves.</li> <li>Standardise default thresholds, but allow local variations to be made by people responding to the alerts.</li> </ul> Rethinking monitoring post-Monitorama PDX 2014-05-10T00:00:00+00:00 <p>The two key take home messages from <a href="">Monitorama PDX</a> are this:</p> <ul> <li>We are mistakenly developing monitoring tools for ops people, not the developers who need them most.</li> <li>Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.</li> </ul> <h3>Death to strip charts</h3> <p>Two years ago when I received my hard copy of William S. Cleveland&#39;s <a href="">The Elements of Graphing Data</a>, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45˚, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.</p> <!-- excerpt --> <p>In his talk at Monitorama PDX, <a href="">Neil Gunther</a> <a href="">goes on a whirlwind tour</a> of visualising data used by ops daily with visual tools other than time series strip charts. By ignoring time, looking at the distribution, and applying various transformations to the axes (linear-log, log-log, log-linear), Neil demonstrates how you can expose patterns in data (like power law distributions) that were simply invisible in the traditional linear time series form.</p> <p>Neil&#39;s talk explains why Cleveland&#39;s <em>Elements</em> gives so little time to time series strip charts - they are a limited tool that obfuscates data that doesn&#39;t match all but a very limited set of patterns.</p> <p>Strip charts are the PHP Hammer of monitoring.</p> <p><a href=""> <img src="" class="img-responsive" alt="the infamous php hammer"> </a></p> <p>We have been conditioned to accept strip charts as the One True Way to visualise time series data, and it is fucking us over without us even realising it. <strong>Time series strip charts are the single biggest engineering problem holding monitoring as a craft back.</strong></p> <p>It&#39;s time to shape our future by building new tools and extending existing ones to visualise data in different ways.</p> <p>This requires improving the statistical and visual literacy of tool developers (who are providing the generalised tools to visualise the data), and the people who are using the graphs to solve problems.</p> <p>There is another problem here, which <a href="">Rashid Khan</a> touched on during his time on stage: many people are using <a href="">logstash</a> &amp; <a href="">Kibana</a> directly and avoid numerical metric summaries of log data because that numerical data is just an abstraction of an abstraction.</p> <p>The textual logs provide far more insight into what&#39;s happening than numbers:</p> <p><img src="" class="img-responsive" alt="Stacktrace or GTFO"></p> <p>As an ops team, you have one job: provide a platform app developers can wire up logs, checks, and metrics to (in that order). Expose that to them in a meaningful way for analysis later on.</p> <h3>The real target audience for monitoring (or, How You Can Make Money In The Monitoring Space)</h3> <p><a href="">Adrian Cockcroft</a> made a great point in his keynote: we are building monitoring tools for ops people, not the developers who need them most. This is a piercing insight that fundamentally reframes the problem domain for people building monitoring tools.</p> <p>Building monitoring tools and clean integration points for developers is the most important thing we can do if we want to actually improve the quality of people&#39;s lives on a day to day basis.</p> <p>Help your developers ship a Sensu config &amp; checks as part of their app. You can even leverage <a href="">existing testing frameworks</a> they are already familiar with.</p> <iframe src="" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px 1px 0; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <p>This puts the power &amp; responsibility of monitoring applications into the hands of people who are closest to the app. Ops still provide value: delivering a scalable monitoring platform, and working with developers to instrument &amp; check their apps. You are reducing duplication of effort and have time to educate non-ops people on how to get the best insight into what&#39;s happening.</p> <p>There is still a room for monitoring tools as we&#39;ve traditionally used them, but that&#39;s mostly limited to providing insight into the platforms &amp; environments that ops are providing to developers to run their applications.</p> <p>The majority of application developers don&#39;t care about the internal functioning of the platform though, and they almost certainly don&#39;t want to be alerted about problems within the platform, other than &quot;the platform has problems, we&#39;re working on fixing them&quot;.</p> <p>The money in the monitoring industry is in building monitoring tools to eliminate the friction for developers get better insight into how their applications are performing and behaving in the real world. New Relic is living proof of this, but the market is far larger than what New Relic is currently catering to, and it&#39;s a far larger market than the ops tools market because developers are much more willing to adopt new tools, experiment, and tinker.</p> <p>If you can provide a method for developers to expose application state in a meaningful way while lowering the barrier of entry, they will jump at it.</p> <p>So are you building monitoring tools for the future?</p> Flapjack, heartbeating, and one off events 2014-05-09T00:00:00+00:00 <p>Flapjack assumes a constant stream of events from upstream event producers, and this is fundamental to Flapjack&#39;s design.</p> <p><img src="" alt="a beating heart"></p> <p>Flapjack asks a fundamentally different question to other notification and alerting systems: &quot;How long has a check been failing for?&quot;. Flapjack cares about the elapsed time, not the number of observed failures.</p> <p>Alerting systems that depend on counting the number of observed failures to decide whether to send an alert suffer problems when the observation interval is variable.</p> <p>Take this scenario with a LAMP stack running in a large cluster:</p> <!-- excerpt --> <blockquote> <ol> <li>Nagios detects a single failure in the database layer. It increments the soft state by 1.</li> <li>Nagios detects every service that depends on the database layer is now failing due to timeouts. It increments the soft state by 1 for each of these services.</li> <li>The timeouts for each of these services cause the next recheck of the original database layer check to be delayed (e.g. after an additional 3 minutes). When it is eventually checked, its soft state is incremented.</li> <li>The timeouts for the other services get bigger, causing the database layer check to be delayed further.</li> <li>Eventually the original database layer check enters a hard state and alerts.</li> </ol> </blockquote> <p>The above example is a little exaggerated, however the problem with using observed failure counts as a basis for alerting are obvious.</p> <p><a href="">Control theory</a> gives us a lot of practical tools for modelling scenarios like these, and the answer is never pretty - if you rely on the number of times you&#39;ve observed a failure to determine if you need to send an alert, you&#39;re alerting effectiveness is limited by any latency in your checkers.</p> <p>By looking at how long something has been failing for, Flapjack limits the effects of latency in the observation interval, and provides alerts to humans about problems faster.</p> <p>This leads to an interesting question though - <strong>can I send a one-off event to Flapjack?</strong></p> <p>Technically you can - Flapjack just won&#39;t notify anyone until:</p> <ul> <li>Two events (or more) have been received by Flapjack.</li> <li>30 seconds have elapsed between the first event received by Flapjack and the latest.</li> </ul> <p>This is due to the aforementioned heartbeating behaviour that is baked into Flapjack&#39;s design.</p> <p>As more people are using Flapjack we are seeing increasing demand for one-off event submission. There are two key cases:</p> <ul> <li>Arbitrary event submission via HTTP</li> <li>Routing CloudWatch alarms via Flapjack</li> </ul> <p>One way to solve this would be to build a bridge that accepts one-off events, and periodically dispatches a cached value for these events to Flapjack.</p> <p>Flapjack will definitely close this gap in the future.</p> Data driven alerting with Flapjack + Puppet + Hiera 2014-02-12T00:00:00+00:00 <p>On Monday I gave a talk at <a href="">Puppet Camp Sydney 2014</a> about managing Flapjack data (specifically: contacts, notification rules) with Puppet + Hiera.</p> <p>There was a live demo of some new Puppet types I&#39;ve written to manage the data within Flapjack. This is incredibly useful if you want to configure how your on-call are notified from within Puppet.</p> <p><a href="">Video</a>:</p> <iframe width="770" height="577" src="//" frameborder="0" allowfullscreen></iframe> <p><a href="">Slides</a>:</p> <script async class="speakerdeck-embed" data-id="a16e2070743d0131f1361a9a8f72571c" data-ratio="1.33333333333333" src="//"></script> <p>The code is a little rough around the edges, but you can try it out at in the <a href=""><code>puppet-type</code> branch on vagrant-flapjack</a>.</p> The questions that should have been asked after the RMS outage 2014-01-20T00:00:00+00:00 <blockquote> <h3><a href=",routine-error-caused-nsw-roads-and-maritime-outage.aspx">Routine error caused NSW Roads and Maritime outage</a></h3> <p>The <a href="">NSW Roads and Maritime Services</a>&#39; driver and vehicle registration service suffered a full-day outage on Wednesday due to human error during a routine exercise, an initial review has determined.</p> <p>Insiders told ITnews that the outage, which affected services for most of Wednesday, was triggered by an error made by a database administrator employed by outsourced IT supplier, Fujitsu.</p> <p>The technician had made changes to what was assumed to be the test environment for RMS&#39; Driver and Vehicle system (DRIVES), which processes some 25 million transactions a year, only to discover the changes were being made to a production system, iTnews was told.</p> </blockquote> <p>There is a lot to digest here, so let&#39;s start our analysis with two simple and innocuous words in the opening paragraph: &quot;routine exercise&quot;. </p> <!-- excerpt --> <p>If the exercise is routine, what is the frequency that release routine is followed? Once a day? Once a week? Once a month? Once a year? </p> <p>The article provides some insight into this: </p> <blockquote> <p>&quot;The activity on Tuesday night was carried out ahead of a standard quarterly release of the DRIVES system,&quot; a spokesman for Service NSW said. </p> </blockquote> <p>The statement suggests releases are being done every 3 months. By government standards, RMS&#39;s schedules are likely quite progressive, given their public track record for smooth IT operations and <a href="">an innovative IT procurement strategy</a>. </p> <p>But it&#39;s still a large delta between releases. Given many organisations are moving to daily and even hourly releases to reduce the risk of failures, 3 months release cycles are relatively archaic. </p> <p>Think about everything that can change in three months. There will be big changes that are low impact. There will be small changes that are high impact. There will be everything in between. There will be changes that are determined to be low impact &amp; low risk, but in hindsight will be considered high impact &amp; high risk.</p> <p>Now think about releasing all those changes at once. The longer you wait between releases, the greater the risk something will go wrong.</p> <p>What are the organisational factors that make three monthly releases acceptable? Are people within RMS aware of the pain their current practices are causing? Are those aware of the pain in management or are they on the front line? </p> <p>Are either of these groups pushing for more frequent releases? Do any of those people have the power within RMS to make those change happen? What are the channels for driving organisational change and process improvement?</p> <p>If those channels don&#39;t exist, or when those channels fail, how does the organisation react? How do people within the organisation react?</p> <p>These are all interesting questions that will go a long way to uncovering the extent of the problem, and give people a starting point to address those problems. </p> <p>But that&#39;s rarely the type of coverage you get in the media. The typical narrative in the media and organisations that aren&#39;t learning from their mistakes is very simple:</p> <ul> <li><strong>Attribute</strong> blame to bad apples. </li> <li><strong>Find</strong> the scape goat. </li> <li><strong>Excise</strong> the perpetrator. </li> </ul> <p>Discussion of these complex issues focuses exclusively on &quot;human error&quot;. </p> <p>But what happens if you replaced the human in that situation with another? The answer is almost certainly going to be &quot;exactly the same outcome&quot;. </p> <p>Humans are just actors in a complex system, or complex systems nested in other complex systems. They are locally rational. They are doing their best based on the information they have at hand. Nobody wakes up in the morning with the intention of causing an accident. </p> <p>We have the facts (or at least a media-manipulated interpretation of them). We know the outcome of the &quot;bad actions&quot;. Our knowledge of the outcome (a day-long outage) taints any interpretation of the events from an &quot;objective&quot; point of view. This is <a href="">hindsight bias</a> in its rawest form. </p> <p>We use hindsight to pass judgement on people who were closest to the thing that went wrong. <a href="">Dekker</a> says <em>&quot;hindsight converts a once vague, unlikely future into an immediate, certain path&quot;</em>. After an accident we draw a line in the sand and say:</p> <blockquote> <p><em>&quot;There! They crossed the line! They should have known better!&quot;</em></p> </blockquote> <p>But in the fog of war those actors in a complex system were making what they considered to be rational decision based on the information they had at hand. Before the accident, the line is a band of grey, and people in the system are drifting within that band. After an accident, that band of gray rapidly consolidates into a thin dark line that the people closest to the accident are conveniently on the other side of. </p> <p>Being mindful of our own hindsight bias, it&#39;s critical we start at the very beginning: What was the operator thinking when they were performing this routine exercise? What information did the operator have at hand that informed their judgements?</p> <p>How many times had the operator performed that &quot;routine&quot; exercise? </p> <p>If they had performed that exercise before, what was different about this instance of the exercise? </p> <p>If they hadn&#39;t performed that exercise before, what training had they received? What support were they provided? Was someone double checking every item on their checklist?</p> <p>What types of behaviour does the organisation incentivise? Does it reward people who take risks, improve processes, and improvise to get the job done? Or does it reward people who don&#39;t rock the boat, who shut up and do their work &mdash; no questions asked?</p> <p>If the incentive is not to rock the boat, do people have the power to put up their hands when their workload is becoming unmanageable? How does the organisation react to people who identify and raise problems with workload? Is their workload managed to an achievable level, or are they told to suck it up?</p> <p>Are the powers to flag excessive workload extended to people who work with the organisation, but aren&#39;t necessarily members of the organisation &mdash; like contractors, or outsourced suppliers? </p> <p>And most importantly of all &mdash; after an accident, what effect do the words and actions of those in management send to employees? What effect do they have on supplier relationships? </p> <p>The message in RMS&#39;s case is pretty clear:</p> <blockquote> <p><strong>&quot;If you make a mistake we&#39;ll publicly hang you out to dry.&quot;</strong></p> </blockquote> <p>A culture that prioritises blaming individuals over identifying and improving systemic flaws is not a culture I would choose to be part of.</p> The How and Why of Flapjack 2014-01-03T00:00:00+00:00 <p>In October <a href="">@rodjek</a> <a href="">asked on Twitter</a>:</p> <blockquote> <p>&quot;I&#39;ve got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?&quot;</p> </blockquote> <p>Flapjack will be immediately useful to you if:</p> <ul> <li>You want to <strong>identify failures faster</strong> by rolling up your alerts across multiple monitoring systems.</li> <li>You monitor infrastructures that have <strong>multiple teams</strong> responsible for keeping them up.</li> <li>Your monitoring infrastructure is <strong>multitenant</strong>, and each customer has a <strong>bespoke alerting strategy</strong>.</li> <li>You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.</li> </ul> <!-- excerpt --> <h3>The double-edged Nagios sword (or why monolithic monitoring systems hurt you in the long run)</h3> <p>One short-term advantage of Nagios is how much it can do for you out of the box. Check execution, notification, downtime, acknowledgements, and escalations can all be handled by Nagios if you invest a small amount of time understanding how to configure it.</p> <p>This short-term advantage can turn into a long-term disadvantage: because Nagios does so much out of the box, you heavily invest in a single tool that does everything for you. When you hit cases that fit outside the scope of what Nagios can do for you easily, the cost of migrating away from Nagios can be quite high.</p> <p>The biggest killer when migrating away from Nagios is you either have to:</p> <ul> <li>Find a replacement tool that matches Nagios&#39;s feature set very closely (or at least the subset of features you&#39;re using)</li> <li>Find a collection of tools that integrate well with one another</li> </ul> <p>Given the composable monitoring world we live in, the second option is more preferable, but not always possible.</p> <h3>Enter Flapjack</h3> <p><img src="" alt="flapjack logo"></p> <p>Flapjack aims to be a flexible notification system that handles:</p> <ul> <li>Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc)</li> <li>Alert summarisation (with per-user, per media summary thresholds)</li> <li>Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc)</li> </ul> <p>Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.</p> <h3>A team player (composable monitoring pipelines)</h3> <p>Flapjack aims to be composable - you should be able to easily integrate it with your existing monitoring check execution infrastructure.</p> <p>There are three immediate benefits you get from Flapjack&#39;s composability:</p> <ul> <li><strong>You can experiment with different check execution engines</strong> without needing to reconfigure notification settings across all of them. This helps you be more responsive to customer demands and try out new tools without completely writing off your existing monitoring infrastructure.</li> <li><strong>You can scale your Nagios horizontally.</strong> Nagios can be really performant if you don&#39;t use notifications, acknowledgements, downtime, or parenting. Nagios executes static groups of checks efficiently, so scale the machines you run Nagios on horizontally and use Flapjack to aggregate events from all your Nagios instances and send alerts.</li> <li><strong>You can run multiple check execution engines in production.</strong> Nagios is well suited to some monitoring tasks. Sensu is well suited to others. Flapjack makes it easy for you to use both, and keep your notification settings configured in one place.</li> </ul> <p>While you&#39;re getting familiar with how Flapjack and Nagios play together, you can even do a side-by-side comparison of how Flapjack and Nagios alert by configuring them both to alert at the same time.</p> <h3>Multitenant monitoring</h3> <p>If you work for a service provider, you almost certainly run shared infrastructure to monitor the status of the services you sell your customers.</p> <p>Exposing the observed state to customers from your monitoring system can be a real challenge - most monitoring tools simply aren&#39;t built for this particular requirement.</p> <p><a href="">Bulletproof</a> spearheaded the reboot of Flapjack because multitenancy is a core requirement of Bulletproof&#39;s monitoring platform - we run a shared monitoring platform, and we have very strict requirements about segregating customers and their data from one another.</p> <p>To achieve this, we keep the security model in Flapjack extraordinarily simple - if you can authenticate against Flapjack&#39;s HTTP APIs, you can perform any action.</p> <p>Flapjack pushes authorization complexity to the consumer, because every organisation is going to have very particular security requirements, and Flapjacks wants to make zero assumptions about what those requirements are going to be.</p> <p>If you&#39;re serious about exposing this sort of data and functionality to your customers, you will need to do some grunt work to provide it through whatever customer portals you already run. We provide a <a href="">very extensive Ruby API client</a> to help you integrate with Flapjack, and Bulletproof has been using this API client in production for over a year in our customer portal.</p> <p>One shortfall of Flapjack right now is we perhaps take multitenancy a little too seriously - the Flapjack user experience for single tenant users still needs a little work.</p> <p>In particular, there are some inconsistencies and behaviours in the Flapjack APIs that make sense in a multitenant context, but are pretty surprising for single tenant use cases.</p> <p>We&#39;re <a href="">actively</a> <a href="">improving</a> <a href="">the single tenant user experience</a> for the Flapjack <a href="">1.0 release</a>.</p> <p>One other killer feature of Flapjack that&#39;s worth mentioning: updating any setting via Flapjack&#39;s HTTP API doesn&#39;t require any sort of restart of Flapjack.</p> <p>This is a significant improvement over tools like Nagios that require full restarts for simple notification changes.</p> <h3>Multiple teams</h3> <p>Flapjack is useful for organisations who segregate responsibility for different systems across different teams, much in the same way Flapjack is useful in a multitenant context.</p> <p>For example:</p> <ul> <li>Your organisation has two on-call rosters - one for customer alerts, and one for internal infrastructure alerts.</li> <li>Your organisation is product focused, with dedicated teams owning the availability of those products end-to-end.</li> </ul> <p>You can feed all your events into Flapjack so operationally you have a single aggregated source of truth of monitoring state, and use the same multitenancy features to create custom alerting rules for individual teams.</p> <p>We&#39;re starting to experiment with this at Bulletproof as development teams start owning the availability of products end-to-end.</p> <h3>Summarisation</h3> <p>Probably the most powerful Flapjack feature is alert summarisation. Alerts can be summarised on a per-media, per-contact basis.</p> <p>What on earth does that mean?</p> <p>Contacts (people) are associated with checks. When a check alerts, a contact can be notified on multiple media (Email, SMS, Jabber, PagerDuty).</p> <p>Each media has a summarisation threshold that allows a contact to specify when alerts should be &quot;rolled up&quot; so the contact doesn&#39;t receive multiple alerts during incidents.</p> <p>If you&#39;ve used <a href="">PagerDuty</a> before, you&#39;ve almost certainly experienced similar behaviour when you have multiple alerts assigned to you at a time.</p> <p>Summarisation is particularly useful in multitenant environments where contacts only care about a subset of things being monitored, and don&#39;t want to be overwhelmed with alerts for each individual thing that has broken.</p> <p>To generalise, large numbers of alerts either indicate a total system failure of the thing being monitored, and or false-positives in the monitoring system.</p> <p>In either case, nobody wants to receive a deluge of alerts.</p> <p>Mitigating the effects of monitoring false-positives are especially important when you consider how failures in the <a href="">monitoring pipeline</a> cascade into surrounding stages of the pipeline.</p> <p>Monitoring alert recipients generally don&#39;t care about the extent of a monitoring system failure (how many things are failing simultaneously, as evidenced by an alert for each thing), they care that the monitoring system can&#39;t be trusted right now (at least until the underlying problem is fixed).</p> <h3>What Flapjack is not</h3> <ul> <li><strong>Check execution engine.</strong> Sensu, Nagios, and cron already do a fantastic job of this. You still need to configure a tool to run your monitoring checks - Flapjack just processes events generated elsewhere and does notification magic.</li> <li><strong>PagerDuty replacement.</strong> Flapjack and PagerDuty <em>complement</em> one another. PagerDuty has excellent on-call scheduling and escalation support, which is something that Flapjack doesn&#39;t try to go near. Flapjack can trigger alerts in PagerDuty.</li> </ul> <p>At Bulletproof we use Flapjack to process events from Nagios, and work out if our on-call or customers should be notified about state changes. Our customers receive alerts directly from Flapjack, and our on-call receive alerts from PagerDuty, via Flapjack&#39;s PagerDuty gateway.</p> <p>The Flapjack PagerDuty gateway has a neat feature: it polls the PagerDuty API for alerts it knows are unacknowledged, and will update Flapjack&#39;s state if it detects alerts have been acknowledged in PagerDuty.</p> <p>This is super useful for eliminating the double handling of alerts, where an on-call engineer acknowledges an alert in PagerDuty, and then has to go and acknowledge the alert in Nagios.</p> <p>In the Flapjack world, the on-call engineer acknowledges the alert in PagerDuty, Flapjack notices the acknowledgement in PagerDuty, and Flapjack updates its own state.</p> <h3>How do I get started?</h3> <p>Follow the <a href="">quickstart guide</a> to get Flapjack running locally using Vagrant.</p> <p>The quickstart guide will take you through basic Flapjack configuration, pushing events check results from Nagios into Flapjack, and configuring contacts and entities.</p> <p>Once you&#39;ve finished the tutorial, check out the <a href="">Flapjack Puppet module</a> and <a href="">manifest that sets up</a> the Vagrant box.</p> <p>Examining the Puppet module will give you a good starting point for rolling out Flapjack into your monitoring environment.</p> <h3>Where to next?</h3> <p>We&#39;re gearing up to release Flapjack 1.0.</p> <p>If you take a look at Flapjack in the next little while, please let us know any feedback you have on the <a href="!forum/flapjack-project">Google group</a>, or ping <a href="">@auxesis</a> or <a href="">@jessereynolds</a> on Twitter.</p> <p><a href="">Jesse</a> and I are also <a href="">running a tutorial at 2014</a> in Perth next Wednesday, and we&#39;ll make the slides available online.</p> <p>Happy Flapjacking!</p> CLI testing with RSpec and Cucumber-less Aruba 2013-12-06T00:00:00+00:00 <p>At <a href="">Bulletproof</a>, we are increasingly finding home brew systems tools are critical to delivering services to customers.</p> <p>These tools are generally wrapping a collection of libraries and other general Open Source tools to solve specific business problems, like automating a service delivery pipeline.</p> <p>Traditionally these systems tools tend to lack good tests (or simply any tests) for a number of reasons:</p> <ul> <li>The tools are quick and dirty</li> <li>The tools model business processes that are often in flux</li> <li>The tools are written by systems administrators</li> </ul> <p>Sysadmins don&#39;t necessarily have a strong background in software development. They are likely proficient in Bash, and have hacked a little Python or Ruby. If they&#39;ve really gotten into the <a href="">infrastructure as code</a> thing they might have delved into the innards of Chef and Puppet and been exposed to those projects respective testing frameworks.</p> <p>In a lot of cases, testing is seen as <em>&quot;something I&#39;ll get to when I become a real developer&quot;</em>.</p> <!-- excerpt --> <p>The success of technical businesses can be tied to the <a href="">quality of their tools</a>.</p> <p>Ask any software developer how they&#39;ve felt inheriting an untested or undocumented code base, and you&#39;ll likely hear wails of horror. Working with such a code base is a painful exercise in frustration.</p> <p>And this is what many sysadmins are doing on a daily basis when hacking on their janky scripts that have <a href="">evolved to send and read email</a>.</p> <p>So lets build better systems tools:</p> <!-- excerpt --> <ul> <li>We want to ensure our systems tools are of a consistent high quality</li> <li>We want to ensure new functionality doesn&#39;t break old functionality</li> <li>We want to verify we don&#39;t introduce regressions</li> <li>We want to streamline peer review of changes</li> </ul> <p>We can achieve much of this by skilling up sysadmins on how to write tests, adopting a developer mindset to write system tools, and provide them a good framework that helps frame questions that can be answered with tests.</p> <p>We want our engineers to feel confident their changes are going to work, and they are consistently meeting our quality standards.</p> <h2>But what do you test?</h2> <p>We&#39;ve committed to testing, but what exactly do we test?</p> <p><a href="">Unit</a> and <a href="">integration</a> tests are likely not relevant unless the cli tool is large and unwieldy.</p> <p><strong>The user of the tool doesn&#39;t care whether the tool is tested. The user cares whether they can achieve a goal.</strong> Therefore, the tests should verify that the user can achieve those goals.</p> <p><a href="">Acceptance tests</a> are a good fit because we want to treat the cli tool as a black box and test what the user sees.</p> <p>Furthermore, we don&#39;t care how the tool is actually built.</p> <p>We can write a generic set of high level tests that are decoupled from the language the tool is implemented in, and refactor the tool to a more appropriate language once we&#39;re more familiar with the problem domain.</p> <h2>How do you test command line applications?</h2> <p><a href="">Aruba</a> is a great extension to <a href="">Cucumber</a> that helps you write high level acceptance tests for command line applications, regardless of the language those cli apps are written in.</p> <p>There are actually two parts to Aruba:</p> <ol> <li>Pre-defined Cucumber steps for running + verifying behaviour of command line applications locally</li> <li>An API to perform the actual testing, that is called by the Cucumber steps</li> </ol> <div class="highlight"><pre><code class="language-cucumber" data-lang="cucumber"><span class="k">Scenario:</span><span class="nf"> create a file</span> <span class="k"> Given </span><span class="nf">a file named &quot;</span><span class="s">foo/bar/example.txt</span><span class="nf">&quot; with:</span> <span class="nf"> </span><span class="k">&quot;&quot;&quot;</span><span class="s"></span> <span class="s"> hello world</span> <span class="s"> </span><span class="k">&quot;&quot;&quot;</span><span class="nf"></span> <span class="nf"> </span><span class="k">When </span><span class="nf">I run `cat foo/bar/example.txt`</span> <span class="nf"> </span><span class="k">Then </span><span class="nf">the output should contain exactly &quot;</span><span class="s">hello world</span><span class="nf">&quot;</span> </code></pre></div> <p>The other player in the command line application testing game is <a href="">serverspec</a>. It can do very similar things to Aruba, and provides some fancy <a href="">RSpec</a> matchers and helper methods to make the tests look neat and elegant:</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">describe</span> <span class="n">package</span><span class="p">(</span><span class="s1">&#39;httpd&#39;</span><span class="p">)</span> <span class="k">do</span> <span class="n">it</span> <span class="p">{</span> <span class="n">should</span> <span class="n">be_installed</span> <span class="p">}</span> <span class="k">end</span> <span class="n">describe</span> <span class="n">service</span><span class="p">(</span><span class="s1">&#39;httpd&#39;</span><span class="p">)</span> <span class="k">do</span> <span class="n">it</span> <span class="p">{</span> <span class="n">should</span> <span class="n">be_enabled</span> <span class="p">}</span> <span class="n">it</span> <span class="p">{</span> <span class="n">should</span> <span class="n">be_running</span> <span class="p">}</span> <span class="k">end</span> <span class="n">describe</span> <span class="n">port</span><span class="p">(</span><span class="mi">80</span><span class="p">)</span> <span class="k">do</span> <span class="n">it</span> <span class="p">{</span> <span class="n">should</span> <span class="n">be_listening</span> <span class="p">}</span> <span class="k">end</span> </code></pre></div> <p>The cool thing about serverspec that sets it apart from Aruba is it can test things locally <em>and</em> remotely via SSH.</p> <p>This is useful when testing automation that creates servers somewhere: run the tool, connect to the server created, verify conditions are met.</p> <p>But what happens when we want to test the behaviour of tools that create things both locally and remotely? For local testing Aruba is awesome. For remote testing, serverspec is a great fit.</p> <p>But Aruba is Cucumber, and serverspec is RSpec. Does this mean we have to write and maintain two separate test suites?</p> <p>Given we&#39;re trying to encourage people who have traditionally never written tests before to write tests, we want to remove extraneous tooling to make testing as simple as possible.</p> <p>A single test suite is a good start.</p> <p>This test suite should be able to run both local + remote tests, letting us use the powerful built-in tests from Aruba, and the great remote tests from serverspec.</p> <p>There are two obvious ways to slice this:</p> <ol> <li>Use serverspec like Aruba - build common steps around serverspec matchers</li> <li>Use the Aruba API without the Cucumber steps</li> </ol> <p>We opted for the second approach - use the Aruba API from within RSpec, sans the Cucumber steps.</p> <p>Opinions on Cucumber within Bulletproof R&amp;D are split between love and loathing. There&#39;s a reasonable argument to be made that Cucumber adds a layer of abstraction to tests that increases maintenance of tests and slows down development. On the other hand, Cucumber is great for capturing high level user requirements in a format those users are able to understand.</p> <p>Again, given we are trying to keep things as simple as possible, eliminating Cucumber from the testing setup to focus purely on RSpec seemed like a reasonable approach.</p> <p>The path was pretty clear:</p> <ol> <li>Do a small amount of grunt work to allow the Aruba API to be used in RSpec</li> <li>Provide small amount of coaching to developers on workflow</li> <li>Let the engineers run wild</li> </ol> <h2>How do you make Aruba work without Cucumber?</h2> <p>It turns out this was easier than expected.</p> <p>First you add Aruba to your Gemfile</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># Gemfile</span> <span class="n">source</span> <span class="s1">&#39;;</span> <span class="n">group</span> <span class="ss">:development</span> <span class="k">do</span> <span class="n">gem</span> <span class="s1">&#39;rake&#39;</span> <span class="n">gem</span> <span class="s1">&#39;rspec&#39;</span> <span class="n">gem</span> <span class="s1">&#39;aruba&#39;</span> <span class="k">end</span> </code></pre></div> <p>Run the obligatory <code>bundle</code> to ensure all dependencies are installed locally:</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash">bundle </code></pre></div> <p>Add a default Rake task to execute tests, to speed up the developer&#39;s workflow, and make tests easy to run from CI:</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># Rakefile</span> <span class="nb">require</span> <span class="s1">&#39;rspec/core/rake_task&#39;</span> <span class="no">RSpec</span><span class="o">::</span><span class="no">Core</span><span class="o">::</span><span class="no">RakeTask</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="ss">:spec</span><span class="p">)</span> <span class="n">task</span> <span class="ss">:default</span> <span class="o">=&gt;</span> <span class="o">[</span><span class="ss">:spec</span><span class="o">]</span> </code></pre></div> <p>Bootstrap the project with RSpec:</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>rspec --init </code></pre></div> <p>Require and include the Aruba API bits in the specs:</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># spec/template_spec.rb</span> <span class="nb">require</span> <span class="s1">&#39;aruba&#39;</span> <span class="nb">require</span> <span class="s1">&#39;aruba/api&#39;</span> <span class="kp">include</span> <span class="no">Aruba</span><span class="o">::</span><span class="no">Api</span> </code></pre></div> <p>This pulls in <em>just</em> the API helper methods in the <code>Aruba::Api</code> namespace. These are what we&#39;ll be using to run commands, test outputs, and inspect files. The <code>include Aruba::Api</code> makes those methods available in the current namespace.</p> <p>Then we set up <code>PATH</code> so the tests know where executables are:</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># spec/template_spec.rb</span> <span class="nb">require</span> <span class="s1">&#39;pathname&#39;</span> <span class="n">root</span> <span class="o">=</span> <span class="no">Pathname</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="bp">__FILE__</span><span class="p">)</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">parent</span> <span class="c1"># Allows us to run commands directly, without worrying about the CWD</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;PATH&#39;</span><span class="o">]</span> <span class="o">=</span> <span class="s2">&quot;</span><span class="si">#{</span><span class="n">root</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s1">&#39;bin&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">to_s</span><span class="si">}#{</span><span class="no">File</span><span class="o">::</span><span class="no">PATH_SEPARATOR</span><span class="si">}#{</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;PATH&#39;</span><span class="o">]</span><span class="si">}</span><span class="s2">&quot;</span> </code></pre></div> <p>The <code>PATH</code> environment variable is used by Aruba to find commands we want to run. We could specify a full path in each test, but by setting <code>PATH</code> above we can just call the tool by its name, completely pathless, like we would be doing on a production system.</p> <h2>How do you go about writing tests?</h2> <p>The workflow for writing stepless Aruba tests that still use the Aruba API is pretty straight forward:</p> <ol> <li>Find the relevant step from <a href="">Aruba&#39;s <code>cucumber.rb</code></a></li> <li>Look at how the step is implemented (what methods are called, what arguments are passed to the method, how is output captured later on, etc)</li> <li>Take a quick look at how the method is implemented <a href="">in Aruba::Api</a></li> <li>Write your tests in pure-RSpec</li> </ol> <p>Here&#39;s an example test:</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># spec/template_spec.rb</span> <span class="c1"># genud is the name of the tool we&#39;re testing</span> <span class="n">describe</span> <span class="s2">&quot;genud&quot;</span> <span class="k">do</span> <span class="n">describe</span> <span class="s2">&quot;YAML templates&quot;</span> <span class="k">do</span> <span class="n">it</span> <span class="s2">&quot;should emit valid YAML to STDOUT&quot;</span> <span class="k">do</span> <span class="n">fqdn</span> <span class="o">=</span> <span class="s1">&#39;;</span> <span class="c1"># Run the command with Aruba&#39;s run_simple helper</span> <span class="n">run_simple</span> <span class="s2">&quot;genud --fqdn </span><span class="si">#{</span><span class="n">fqdn</span><span class="si">}</span><span class="s2"> --template </span><span class="si">#{</span><span class="n">template</span><span class="si">}</span><span class="s2">&quot;</span> <span class="c1"># Test the YAML can be parsed</span> <span class="nb">lambda</span> <span class="p">{</span> <span class="n">userdata</span> <span class="o">=</span> <span class="no">YAML</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">all_output</span><span class="p">)</span> <span class="n">userdata</span><span class="o">.</span><span class="n">should_not</span> <span class="n">be_nil</span> <span class="p">}</span><span class="o">.</span><span class="n">should_not</span> <span class="n">raise_error</span> <span class="n">assert_exit_status</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> </code></pre></div> <h2>Multiple inputs, and DRYing up the tests</h2> <p>Testing multiple inputs and outputs of the tool is important for verifying the behaviour of the tool in the wild.</p> <p>Specifically, we want to know the same inputs create the same outputs if we make a change to the tool, and we want to know that new inputs we add are valid in multiple use cases.</p> <p>We also don&#39;t want to write test cases for each instance of test data - generating the tests automatically would be ideal.</p> <p>Our first approach at doing this was to glob a bunch of test data and test the behaviour of the tool for each instance of test data:</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># spec/template_spec.rb</span> <span class="n">describe</span> <span class="s2">&quot;genud&quot;</span> <span class="k">do</span> <span class="n">describe</span> <span class="s2">&quot;YAML templates&quot;</span> <span class="k">do</span> <span class="n">it</span> <span class="s2">&quot;should emit valid YAML to STDOUT&quot;</span> <span class="k">do</span> <span class="c1"># The inputs we want to test</span> <span class="n">templates</span> <span class="o">=</span> <span class="no">Dir</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="n">root</span> <span class="o">+</span> <span class="s1">&#39;templates&#39;</span> <span class="o">+</span> <span class="s2">&quot;*.yaml.erb&quot;</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">template</span><span class="o">|</span> <span class="n">fqdn</span> <span class="o">=</span> <span class="s1">&#39;;</span> <span class="c1"># Run the command with Aruba&#39;s run_simple helper</span> <span class="n">run_simple</span> <span class="s2">&quot;genud --fqdn </span><span class="si">#{</span><span class="n">fqdn</span><span class="si">}</span><span class="s2"> --template </span><span class="si">#{</span><span class="n">template</span><span class="si">}</span><span class="s2">&quot;</span> <span class="c1"># Test the YAML can be parsed</span> <span class="nb">lambda</span> <span class="p">{</span> <span class="n">userdata</span> <span class="o">=</span> <span class="no">YAML</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">all_output</span><span class="p">)</span> <span class="n">userdata</span><span class="o">.</span><span class="n">should_not</span> <span class="n">be_nil</span> <span class="p">}</span><span class="o">.</span><span class="n">should_not</span> <span class="n">raise_error</span> <span class="n">assert_exit_status</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> </code></pre></div> <p>This worked great provided all the tests were passing, but the tests themselves became very black box when one of the test data input caused a failure.</p> <p>The engineer would need to add a bunch of <code>puts</code> statements all over the place to determine which input was causing the failure. And even worse, early test failures mask failures in later test data.</p> <p>To combat this, we DRY&#39;d up the tests by doing the Dir.glob once in the outer scope, rather than in each test:</p> <div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># spec/template_spec.rb</span> <span class="n">describe</span> <span class="s2">&quot;genud&quot;</span> <span class="k">do</span> <span class="n">templates</span> <span class="o">=</span> <span class="no">Dir</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="n">root</span> <span class="o">+</span> <span class="s1">&#39;templates&#39;</span> <span class="o">+</span> <span class="s2">&quot;*.yaml.erb&quot;</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">template</span><span class="o">|</span> <span class="n">describe</span> <span class="s2">&quot;YAML templates&quot;</span> <span class="k">do</span> <span class="n">describe</span> <span class="s2">&quot;</span><span class="si">#{</span><span class="no">File</span><span class="o">.</span><span class="n">basename</span><span class="p">(</span><span class="n">template</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span> <span class="k">do</span> <span class="n">it</span> <span class="s2">&quot;should emit valid YAML to STDOUT&quot;</span> <span class="k">do</span> <span class="n">fqdn</span> <span class="o">=</span> <span class="s1">&#39;;</span> <span class="c1"># Run the command with Aruba&#39;s run_simple helper</span> <span class="n">run_simple</span> <span class="s2">&quot;genud --fqdn </span><span class="si">#{</span><span class="n">fqdn</span><span class="si">}</span><span class="s2"> --template </span><span class="si">#{</span><span class="n">template</span><span class="si">}</span><span class="s2">&quot;</span> <span class="c1"># Test the YAML can be parsed</span> <span class="nb">lambda</span> <span class="p">{</span> <span class="n">userdata</span> <span class="o">=</span> <span class="no">YAML</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">all_output</span><span class="p">)</span> <span class="n">userdata</span><span class="o">.</span><span class="n">should_not</span> <span class="n">be_nil</span> <span class="p">}</span><span class="o">.</span><span class="n">should_not</span> <span class="n">raise_error</span> <span class="n">assert_exit_status</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> <span class="k">end</span> </code></pre></div> <p>This produces a nice clean test output that decouples the tests from one another while providing the engineer more insight into what test data triggered a failure:</p> <div class="highlight"><pre><code class="language-text" data-lang="text">$ be rake genud YAML templates test.yaml.erb should emit valid YAML to STDOUT YAML templates test2.yaml.erb should emit valid YAML to STDOUT </code></pre></div> <h2>Where to from here?</h2> <p>The above test rig is a good first pass at meeting our goals for building systems tools:</p> <ul> <li>We want to ensure our systems tools are of a consistent high quality</li> <li>We want to ensure new functionality doesn&#39;t break old functionality</li> <li>We want to verify we don&#39;t introduce regressions</li> <li>We want to streamline peer review of changes</li> </ul> <p>&hellip; but we want to take it to the next level: integrating serverspec into the same test suite.</p> <p>Having a quick feedback loop to verify local operation of the tool is essential to engineer productivity, especially when remote operations of these type of system tools can take upwards of 10 minutes to complete.</p> <p>But we have to verify the output of local operation actually creates the desired service at the other end. serverspec will help us do this.</p> Just post mortems 2013-09-25T00:00:00+00:00 <p>Earlier this week I <a href="">gave a talk</a> at Monitorama EU on <a href="">psychological factors that should be considered when designing alerts</a>.</p> <p><a href="">Dave Zwieback</a> pointed me to a great blog post of his on <a href="">managing the human side of post mortems</a>, which bookends nicely with my talk:</p> <blockquote> <p>Imagine you had to write a postmortem containing statements like these:</p> <blockquote> <p>We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.</p> <p>We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.</p> <p>We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.</p> </blockquote> <p>While the above scenarios are entirely realistic, it&#39;s hard to find many postmortem write-ups that even hint at these &quot;human factors.&quot; Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.</p> </blockquote> <p>Dave&#39;s third example dovetails well with some of the examples in <a href="">Dekker&#39;s Just Culture</a>.</p> <!-- excerpt --> <p>Dekker posits that people fear the consequences of reporting mistakes because:</p> <ul> <li>They don&#39;t know what the consequences will be</li> <li>The consequences of reporting can be really bad</li> </ul> <p>The last point can be especially important when you consider how things like <a href="">hindsight bias</a> elevate the importance of proximity.</p> <p>Simply put: when looking at the consequences of an accident, we tend to blame people who were closest to the thing that went wrong.</p> <p>In the middle of an incident, unless you know your organisation has your back if you volunteer mistakes you have made or witnessed, you are more likely to withhold situationally helpful but professionally damaging information.</p> <p>This limits the team&#39;s operational effectiveness and perpetuates a culture of secrecy, thwarting any organisational learning.</p> <p>I think for Dave&#39;s first example to work effectively (<em>&quot;our decision making was impacted by extreme stress&quot;</em>), you would need to quantify what the causes and consequences of that stress are.</p> <p>At <a href="">Bulletproof</a> we are very open to customers in our problem analyses about the technical details of what fails, because our customers are deeply technical themselves, appreciate the detail, and would cotton on quickly if we were pulling the wool over their eyes.</p> <p>This works well for all parties because all parties have comparable levels of technical knowledge.</p> <p>There is risk when talk about stress in general terms because psychological knowledge is not evenly distributed.</p> <p>Because every man and his dog has experienced stress, every man and his dog feel qualified to talk about and comment on other people&#39;s reactions to stress. Furthermore, it&#39;s a natural reaction to distance yourself from bad qualities you recognise in yourself by attacking and ridiculing those qualities in others.</p> <p>I&#39;d wager that outsiders would be more reserved in passing judgement when unfamiliar concepts or terminology is used (e.g. talking about <a href="">confirmation bias</a>, the <a href="">Semmelweis reflex</a>, etc).</p> <p>You could reasonably argue that by using those concepts or terminology you are deliberately using jargon to obfuscate information to those outsiders and <a href="">Cover Your Arse</a>, however I would counter that it&#39;s a good opportunity to open a dialog with those outsiders on building just cultures, eschewing the use of labels like human error, and how cognitive biases are amplified in stressful situations.</p> Counters not DAGs 2013-07-17T00:00:00+00:00 <p>Monitoring dependency graphs are fine for small environments, but they are not a good fit for nested complex environments, like those that make up modern web infrastructures.</p> <p><a href="">DAGs</a> are a very alluring data structure to represent monitoring relationships, but they fall down once you start using them to represent relationships at scale:</p> <ul> <li><p><strong>There is an assumption there is a direct causal link between edges of the graph.</strong> It&#39;s very tempting to believe that you can trace failure from one edge of the graph to another. Failures in one part of a complex systems all too often have weird effects and <a href="">induce failure on other components</a> of the same system that are quite removed from one another.</p></li> <li><p><strong>Complex systems are almost impossible to model.</strong> With time and an endless stream of money you can sufficiently model the failure modes within complex systems in isolation, but fully understanding and predicting how complex systems interact and relate with one another is almost impossible. The only way to model this effectively is to <a href="">have a closed system with very few external dependencies</a>, which is the opposite of the situation every web operations team is in.</p></li> <li><p><strong>The cost of maintaining the graph is non trivial.</strong> You could employ a team of extremely skilled engineers to understand and model the relationships between each component in your infrastructure, but their work would never be done. On top of that, given the sustained growth most organisations experience, whatever you model will likely change within 12-18 months. <em>Fundamentally it would not provide a good return on investment</em>.</p></li> </ul> <!-- excerpt --> <h3>check_check</h3> <p>This isn&#39;t a new problem.</p> <p><a href="">Jordan Sissel</a> wrote a great post as part of Sysadvent almost three years ago <a href="">about check_check</a>.</p> <p>His approach is simple and elegant:</p> <ul> <li>Configure checks in Nagios, but configure a contact that drops the alerts</li> <li>Read Nagios&#39;s state out of a file + parse it</li> <li>Aggregate the checks by regex, and alert if a percentage is critical</li> </ul> <p>It&#39;s a godsend for people who manage large Nagios instances, but it starts falling down if you&#39;ve got multiple independent Nagios instances (shards) that are checking the same thing.</p> <p>You still end up with a situation where each of your shards alert if the shared entity they&#39;re monitoring fails.</p> <h3>Flapjack</h3> <p>This is the concrete use case behind why <a href="">we&#39;re</a> <a href="">rebooting Flapjack</a> - we want to stream the event data from all Nagios shards to Flapjack, and do smart things around notification.</p> <p>The approach we&#39;re looking at in Flapjack is pretty similar to <code>check_check</code> - set thresholds on the number of failure events we see for particular entities - but we want to take it one step further.</p> <p>Entities in Flapjack <a href="">can be tagged</a>, so we automatically create &quot;failure counters&quot; for each of those tags.</p> <p>When checks on those entities fail, we simply increment each of those failure counters. Then we can set thresholds on each of those counters (based on absolute value like &gt; 30 entities, or percentage like &gt; 70% of entities), and perform intelligent actions like:</p> <ul> <li>Send a single notification to on-call with a summary of the failing tag counters</li> <li>Rate limit alerts and provide summary alerts to customers</li> <li>Wake up the relevant owners of the infrastructure that is failing</li> <li>Trigger a &quot;workaround engine&quot; that attempts to resolve the problem in an automated way</li> </ul> <p>The result of this is that on-call aren&#39;t overloaded with alerts, we involve the people who can fix the problems sooner, and it all works across multiple event sources.</p> <p><strong>One note on complexity</strong>: I am not convinced that automated systems that try to derive meaning from relationships in a graph (or even tag counters) and present the operator with a conclusion are going to provide anything more than a best-guess abstraction of the problem. <em>In the real world, that best guess is most likely wrong</em>.</p> <p>We need to provide better rollup capabilities that give the operator a summarised view of the current facts, and allow the operator to do their own investigation untainted by the assumptions of the programmer who wrote the inaccurate heuristic.</p> <p>The benefit of Flapjack&#39;s (and <code>check_check</code>&#39;s) approach also minimises the maintainability aspect, as tagging of entities becomes the only thing required to build smarter aggregation + analysis tools. This information can easily be pulled out configuration management.</p> <p>More metadata == more granularity == faster resolution times.</p> How we do Kanban 2013-05-20T00:00:00+00:00 <p>At my <a href="">day job</a>, I run a <a href="">distributed team</a> of infrastructure coders spread across Australia + one in Vietnam. Our team is called the Software team, but we&#39;re more analogous to a product focused <a href="">Research &amp; Development</a> team.</p> <p>Other teams at Bulletproof are a mix of office and remote workers, but our team is a little unique in that we&#39;re fully distributed. We do daily standups using Google Hangouts, and try to do face to face meetups every few months at Bulletproof&#39;s offices in Sydney.</p> <p>Intra-team communication is something we&#39;re good at, but I&#39;ve been putting a lot of effort lately into improving how our team communicates with others in the business.</p> <p>This is a post I wrote on our internal company blog explaining how we schedule work, and why we work this way.</p> <hr> <p><img src="" alt="our physical wallboard in the office"></p> <h3>What on earth is this?</h3> <p>This is a <a href="">Kanban board</a>.</p> <!-- excerpt --> <p>A Kanban board is a tool for implementing Kanban. <a href="">Kanban</a> is a scheduling system developed at Toyota in the 70&#39;s as part of the broader <a href="">Toyota Production System</a>.</p> <p>Applied to <a href="">software development</a>, the top three things Kanban aims to achieve are:</p> <ul> <li><strong>Visualise</strong> the flow of work</li> <li><strong>Limit</strong> the Work-In-Progress (WIP)</li> <li><strong>Manage</strong> and optimise the flow of work</li> </ul> <h3>How does Kanban work for the Software team?</h3> <p>In practical terms, work tends to be tracked in:</p> <ul> <li><strong><a href="">RT tickets</a></strong>, as created using the standard request process, or escalated from other teams</li> <li><strong><a href="">GitHub issues</a></strong>, for product improvements, and work discovered while doing other work</li> <li><strong>Ad-hoc requests</strong>, through informal communication channels (IM, email)</li> </ul> <p>Because Software deals with requests from many audiences, we use a Kanban board to visualise work from request to completion across all these systems.</p> <h3>Managing flow</h3> <p>As of writing, we have 5 stages a task progresses through:</p> <p><img src="" alt="the board"></p> <ul> <li><strong>To Do</strong> - tasks <a href="">triaged</a>, and scheduled to be worked on next</li> <li><strong>Doing</strong> - tasks being worked on right now</li> <li><strong>Deployable</strong> - completed tasks that need to be released to production in the near future (generally during change windows)</li> <li><strong>Done</strong> - completed tasks</li> </ul> <p>That&#39;s only 4 - there is another stage called the Icebox. This is for tasks we&#39;re aware of, but haven&#39;t been triaged and aren&#39;t scheduled to be worked on yet.</p> <p>Done tasks are cleaned out once a week on Mondays, after the morning standup.</p> <p><strong>Triage</strong> is the process of taking a request and:</p> <ul> <li>Determining the business priority</li> <li>Breaking it up into smaller tasks</li> <li>(Tentatively) allocating it to someone</li> <li>Classifying the type of work (Internal, Customer, <a href="">BAU</a>)</li> <li>Estimating a task completion time</li> </ul> <p>We use the board exclusively to visualise the tasks - we don&#39;t communicate with the stakeholder through the board.</p> <p>Each task has a pointer to the system the request originated from:</p> <p><img src="" alt="detailed view"></p> <p>…and a little bit of metadata about the overall progress.</p> <p>Communication with the stakeholder is done through the RT ticket / GitHub issue / email.</p> <h3>Limiting WIP</h3> <p>The <a href="">WIP</a> Limit is an artificial limit on the number of tasks the whole team can work on simultaneously. We currently calculate the WIP as:</p> <blockquote> <p>(Number of people in Software) x 2</p> </blockquote> <p>The goal here is to ensure no one person is ever working on more than 2 tasks at once.</p> <p>I can hear you thinking <em>&quot;That&#39;s crazy and will never work for me! I&#39;m always dealing with multiple requests simultaneously&quot;</em>.</p> <p>The key to making the WIP Limit work is that <strong>tasks are never pushed</strong> through the system - <strong>they are pulled</strong> by the people doing the work. Once you finish your current task, you pull across the next highest priority task from the To Do column.</p> <p>The WIP Limit is particularly useful when coupled with visualising flow because:</p> <ul> <li>If people need to work on more than 2 things at once, it&#39;s indicative of a bigger scheduling contention problem that needs to be solved. We are likely context switching rapidly, which rapidly reduces our delivery throughput.</li> <li>If the team is constantly working at the WIP limit, we need more people. We always aim to have at least 20% slack in the system to deal with ad-hoc tasks that bubble up throughout the day. If we&#39;re operating at 100% capacity, we have no room to breathe, and this severely reduces our operational effectiveness.</li> </ul> <h3>Visualising flow</h3> <p>Work makes it way from left to right across the board.</p> <p>This is valuable for communicating to people where their requests sit in the overall queue of work, but also in identifying bottlenecks where work isn&#39;t getting completed.</p> <p>The <a href="">Kanban tool</a> we use colour codes tasks based on how long they have been sitting in the same column:</p> <p><img src="" alt="colour coding of tasks"></p> <p>This is vital for identifying work that people are blocking on completing, and tends to be indicative of one of two things:</p> <ul> <li>Work that is too large and needs to be broken down into smaller tasks</li> <li>Work that is more complex or challenging than originally anticipated</li> </ul> <p>The latter is an interesting case, because it may require pulling people off other work to help the person assigned that task push through and complete that work.</p> <p>Normally as a manager this isn&#39;t easy to discover unless you are regularly polling your people about their progress, but that behaviour is incredibly annoying to be on the receiving end of.</p> <p>The board is updated in real time as people in the team do work, which means as a manager I can get out of their way and let them Get Shit Done while having a passive visual indicator of any blockers in the system.</p> Escalating Complexity 2013-05-15T00:00:00+00:00 <p>Back in 2009 when I was backpacking around Europe I remember waking up on the morning of June 1 and reading about how an Air France flight had disappeared somewhere over the Atlantic.</p> <p>The lack of information on what happened to the flight intrigued me, and given the traveling I was doing, I was left wondering &quot;what if I was on that plane?&quot;</p> <p>Keeping an ear out for updates, in December 2011 I stumbled upon the <a href="">Popular Mechanics article</a> describing the final moments of the flight. I was left fascinated by how a technical system so advanced could fail so horribly, apparently because of the faulty meatware operating it.</p> <!-- excerpt --> <p>Around the same time I began reading the works of <a href="">Sidney Dekker</a>. I was left in a state of cognitive dissonance, trying to reconcile the mainstream explanation of what happened in the final moments of AF447 (the pilots were poorly trained, inexperienced, and simply incompetent) with the New View that the operators were merely locally rational actors within a complex system, and that &quot;root cause is simply the place you stop looking further&quot; - with that cause far too commonly attributed to humans.</p> <p>I decided to do my own research, which resulted in me producing a talk that has received the strongest reaction of any talk I&#39;ve ever given.</p> <iframe width="560" height="315" src="" frameborder="0" allowfullscreen></iframe> <blockquote> <p>On June 1, 2009 Air France 447 crashed into the Atlantic ocean killing all 228 passengers and crew. The 15 minutes leading up to the impact were a terrifying demonstration of the how thick the fog of war is in complex systems.</p> <p>Mainstream reports of the incident put the blame on the pilots - a common motif in incident reports that conveniently ignore a simple fact: people were just actors within a complex system, doing their best based on the information at hand.</p> <p>While the systems you build and operate likely don&#39;t control the fate of people&#39;s lives, they share many of the same complexity characteristics. Dev and Ops can learn an abundance from how the feedback loops between these aviation systems are designed and how these systems are operated.</p> <p>In this talk Lindsay will cover what happened on the flight, why the mainstream explanation doesn&#39;t add up, how design assumptions can impact people&#39;s ability to respond to rapidly developing situations, and how to improve your operational effectiveness when dealing with rapidly developing failure scenarios.</p> </blockquote> <iframe src="" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe> <p>The subject matter is heavy, and I while it&#39;s something I&#39;m passionate about, it was an emotionally taxing talk to prepare, and a talk that angers me when presenting.</p> <p>Time to let it sit and rest.</p> Data failures, compartmentalisation challenges, monitoring pipelines 2013-03-25T00:00:00+00:00 <p>To recap, <a href="">pipelines are a useful way of modelling monitoring systems</a>.</p> <p>Each compartment of the pipeline manipulates monitoring data before making it available to the next.</p> <p>At a high level, this is how data flows between the compartments:</p> <p><img src="" alt="basic pipeline"></p> <p>This design gives us a nice separation of concern that enables scalability, fault tolerance, and clear interfaces.</p> <h3>The problem</h3> <p>What happens when there is no data available for the checks to query?</p> <!-- excerpt --> <p>In this very concrete case, we can divide the problem into two distinct classes of failure:</p> <ul> <li><strong>Latency when accessing the metric storage layer</strong>, manifested as <a href="">checks timing out</a>.</li> <li><strong>Latency or failure when pushing metrics into the storage layer</strong>, manifested as checks being unable to retrieve fresh data.</li> </ul> <p>There are two outcomes from this:</p> <ul> <li>We need to provide clearer feedback to the people responding to alerts, to give them more insight into what&#39;s happening within the pipeline</li> <li>We need to make the technical system more robust when dealing with either of the above cases</li> </ul> <h3>Alerting severity levels aren&#39;t granular or accurate in a modern monitoring context</h3> <p>There are entire classes of monitoring problems (like the one we&#39;re dealing with here) that map poorly into the existing levels. This is an artefact of an industry wide cargo culting of the alerting levels from Nagios, and these levels may not make sense in a modern monitoring pipeline with distinctly compartmentalised stages.</p> <p>For example, the <a href="">Nagios plugin development guidelines</a> state that <code>UNKNOWN</code> from a check can mean:</p> <ul> <li>Invalid command line arguments were supplied to the plugin</li> <li>Low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation.</li> </ul> <p>&quot;Low-level failures&quot; is extremely broad, and it&#39;s important operationally to provide precise feedback to the people maintaining the monitoring system.</p> <p>Adding an additional level (or levels) with contextual debugging information would help close this feedback loop.</p> <p>In defence of the current practice, there are operational benefits to mapping problems into just 4 levels. For example, there are only ever 4 levels that an engineer needs to be aware of, as opposed to a system where there are 5 or 10 different levels that capture the nuance of a state, but engineers don&#39;t understand what that nuance actually is.</p> <h3>Compartmentalisation as the saviour and bane</h3> <p>The core idea driving the pipeline approach is compartmentalisation. We want to split out the different functions of monitoring into separate reliable compartments that have clearly defined interfaces.</p> <p>The motivation for this approach comes from the performance limitations of traditional monitoring systems where all the functions essentially live on a single box that can only be scaled vertically. Eventually you will reach the vertical limit of hardware capacity.</p> <p>This is bad.</p> <p><img src="" alt="a monolithic monitoring system"></p> <p>Thus the <a href="">pipeline approach</a>:</p> <blockquote> <p>Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.</p> </blockquote> <p>This sounds great, except that now we have to deal with the relationships between each compartment both in the normal mode of operation (fetching metrics, querying metrics, sending notifications, etc), but during failure scenarios (one or more compartments being down, incorrect or delayed information passed between compartments, etc).</p> <p>The pipeline attempts to take this into account:</p> <blockquote> <p>Ideally, failures and scalability bottlenecks are compartmentalised.</p> <p>Where there are cascading failures that can&#39;t be contained, safeguards can be implemented in the surrounding compartments to dampen the effects.</p> <p>For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.</p> <p>We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.</p> </blockquote> <p>While the design is in theory meant to allow this containment, the practicalities of doing this are not straightforward.</p> <p>Some simple questions that need to be asked of each compartment:</p> <ul> <li>How does the compartment deal with a response it hasn&#39;t seen before?</li> <li>What is the <a href="">adaptive capacity</a> of each compartment? How robust is each compartment?</li> <li>Does a failure in one compartment cascade into another? How far?</li> </ul> <p>The initial answers won&#39;t be pretty, and the solutions won&#39;t be simple (ideal as that would be) or easily discovered.</p> <p>Additionally, the robustness of each compartments in the pipeline <em>will be different</em>, so making each compartent fault tolerant is a hard slog with unique challenges in each compartment.</p> <h3>How are people solving this problem?</h3> <p>Netflix recently <a href="">open sourced a project called Hystrix</a>:</p> <blockquote> <p>Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.</p> </blockquote> <p>Specifically, Netflix talk about how they make this happen:</p> <blockquote> <h4>How does Hystrix accomplish this?</h4> <ul> <li>Wrap all calls to external systems (dependencies) in a HystrixCommand object (command pattern) which typically executes within a separate thread.</li> <li>Time-out calls that take longer than defined thresholds. A default exists but for most dependencies is custom-set via properties to be just slightly higher than the measured 99.5th percentile performance for each dependency.</li> <li>Maintain a small thread-pool (or semaphore) for each dependency and if it becomes full commands will be immediately rejected instead of queued up.</li> <li>Measure success, failures (exceptions thrown by client), timeouts, and thread rejections.</li> <li>Trip a circuit-breaker automatically or manually to stop all requests to that service for a period of time if error percentage passes a threshold.</li> <li>Perform fallback logic when a request fails, is rejected, timed-out or short-circuited.</li> <li>Monitor metrics and configuration change in near real-time.</li> </ul> </blockquote> <h3>Potential Solutions</h3> <p>We can apply many of the strategies from Hystrix to the monitoring pipeline:</p> <ul> <li>Wrap all monitoring checks with a timeout that returns an <code>UNKNOWN</code> (assuming you stick with the existing severity levels)</li> <li>Add some sort of signalling mechanism to the checks so they fail faster, e.g. <ul> <li>Stick a load balancer like HAProxy or Nginx in front of the data storage compartment</li> <li>Cache the state of the data storage compartment that all monitoring checks check before querying the compartment</li> </ul></li> <li>Detect mass failures, and notify on-call and the monitoring system owners directly to shorten the <a href="">MTTR</a>. This is something <a href="">Flapjack</a> aims to do <a href="">as part of the reboot</a>.</li> </ul> <p>I don&#39;t profess to have all (or even any) of the answers. This is new ground, and I&#39;m very curious to hear how other people are solving this problem.</p>