I want you to think back to when you were a child learning to ride a bike. When you started out I can imagine that you fell a few times. I am sure that most of you learned that scraping your knee was a painful experience best avoided and that you learned not to do it again. If you were like me you may have even given up, angrily throwing your hands up in frustration and shouting that it was the bikes fault and that it was stupid. Unfortunately to up with your friends (competitors) you would learn on your own, in some cases the hard way with its own list of trials and tribulations.
Incident management is very similar in many ways in that some organizations will experience some issues, learn from them and repeat this cycle until they get better over time. Other organizations may even give up trying and simply live in the firefight mode that incident brings to the forefront until a point where teams decide there has to be a better way.
The bad news is that whether we like it or not there is a need to have incident management. For me if I must have it I will need to make sure that we can squeeze some improvements out of each critical escalation to streamline it wherever we can. In my opinion the only part that has any real value from an incident, is the ability to learn from it and make improvements so that they don’t happen again.
There are many different ways that incidents can be reviewed after service has been restored, with just as may terms to describe them. Whatever you want to call them you should look at a few of these areas when you are reviewing.
How was the incident escalated to your team? There may be several modes of escalation. The key question to ask; “was the right amount of information delivered in order to escalate to support resources?” If the answer to this is yes, we need to identify why it happened and ensure we find a way for it not to happen again. For example if the issue was reported through the phone did the analyst get all the information that was required? No? Perhaps a list of minimum details is require, maybe in the form of a checklist. Fine tuning this initial part of the escalation ensures that you don’t need to keep going back to the person who is having issues and asking questions you should have asked in the first place. It gives a bad perception that we don’t know what we are doing. This time that is wasted with more questions could be better served with time to resolve the incident itself.
Did all the right people get the right information to be able to restore service as quickly as possible? I have been in situations where an issue has occurred and after some looking around it turned out if we had only looped in “support group x” they would have helped cut the resolution time in half. Communication during these issues is crucial so reviewing the mode of communication in the post incident review is equally paramount.
Did I mention that this is important… the trick here is to ensure that not only your support teams know that there is an issue and that someone is working on it but to give your clients a sense that their support resources are working to restore service with some timelines and content to inform them that this issue is being fixed. Quite often what happens is that the senior leadership from your business may have a tendency to question IT leadership directly on the status. Making sure that your boss looks informed and can convey that the situation is under control will go a long way in improving the business IT relationship when these issues happen. If your leadership has no information they look as though they are not in control of the situation despite what might be happening.
Obviously one of the most important parts of the review is to talk about what broke, and what caused it to break. In some cases the root cause may not be clear so you will need to decide if a problem needs to be opened for this to investigate further. In other situations we might know what the cause was and the learning outcome is that we need to address something specific to prevent another issue from happening in a similar way. Lastly we have to find out if we, the IT team, had caused this through a change that we put in, and if that is the case we should identify what happened there and make sure that the change management team (who should have a presence at these reviews) documents this for not only the emergency fix that was employed to fix the situation as well as ensuring that the change that caused the issue has documentation to reflect that is caused an issue.
Lastly you need to go over all the actions that need to happen to prevent these issues where you can, or at the very least streamline them so that the service can be restored with minimal impact to your business.
Incidents are like riding a bike, even after being a skilled rider and you get going faster you can still have an accident and fall off. Make sure your teams are in a position to not only pick themselves up, but be able to improve and continue riding for many more years.