I want you to think back to when you were a
child learning to ride a bike. When you started out I can imagine that you fell
a few times. I am sure that most of you learned that scraping your knee was a
painful experience best avoided and that you learned not to do it again. If you
were like me you may have even given up, angrily throwing your hands up in
frustration and shouting that it was the bikes fault and that it was stupid.
Unfortunately to up with your friends (competitors) you would learn on your own,
in some cases the hard way with its own list of trials and tribulations.
Incident management is very similar in many
ways in that some organizations will experience some issues, learn from them
and repeat this cycle until they get better over time. Other organizations may
even give up trying and simply live in the firefight mode that incident brings
to the forefront until a point where teams decide there has to be a better way.
The bad news is that whether we like it or not
there is a need to have incident
management. For me if I must have it I will need to make sure that we can
squeeze some improvements out of each critical escalation to streamline it
wherever we can. In my opinion the only part that has any real value from an
incident, is the ability to learn from it and make improvements so that they
don’t happen again.
There are many different ways that incidents
can be reviewed after service has been restored, with just as may terms to
describe them. Whatever you want to call them you should look at a few of these
areas when you are reviewing.
Initial
Escalation
How was the incident escalated to your team?
There may be several modes of escalation.
The key question to ask; “was the right amount of information delivered
in order to escalate to support resources?” If the answer to this is yes, we
need to identify why it happened and ensure we find a way for it not to happen
again. For example if the issue was reported through the phone did the analyst get
all the information that was required? No? Perhaps a list of minimum details is
require, maybe in the form of a checklist. Fine tuning this initial part of the
escalation ensures that you don’t need to keep going back to the person who is
having issues and asking questions you should have asked in the first place. It
gives a bad perception that we don’t know what we are doing. This time that is
wasted with more questions could be better served with time to resolve the
incident itself.
Internal
Escalations
Did all the right people get the right
information to be able to restore service as quickly as possible? I have been
in situations where an issue has occurred and after some looking around it
turned out if we had only looped in “support group x” they would have helped
cut the resolution time in half. Communication during these issues is crucial
so reviewing the mode of communication in the post incident review is equally paramount.
Communication
Did I mention that this is important… the trick
here is to ensure that not only your support teams know that there is an issue
and that someone is working on it but to give your clients a sense that their
support resources are working to restore service with some timelines and
content to inform them that this issue is being fixed. Quite often what happens
is that the senior leadership from your business may have a tendency to
question IT leadership directly on the status. Making sure that your boss looks
informed and can convey that the situation is under control will go a long way
in improving the business IT relationship when these issues happen. If your
leadership has no information they look as though they are not in control of
the situation despite what might be happening.
What
broke?
Obviously one of the most important parts of
the review is to talk about what broke, and what caused it to break. In some
cases the root cause may not be clear so you will need to decide if a problem
needs to be opened for this to investigate further. In other situations we might
know what the cause was and the learning outcome is that we need to address
something specific to prevent another issue from happening in a similar way. Lastly
we have to find out if we, the IT team, had caused this through a change that
we put in, and if that is the case we should identify what happened there and
make sure that the change management team (who should have a presence at these
reviews) documents this for not only the emergency fix that was employed to fix
the situation as well as ensuring that the change that caused the issue has
documentation to reflect that is caused an issue.
Going
forward
Lastly you need to go over all the actions that
need to happen to prevent these issues where you can, or at the very least
streamline them so that the service can be restored with minimal impact to your
business.
Incidents are like riding a bike, even after
being a skilled rider and you get going faster you can still have an accident
and fall off. Make sure your teams are in a position to not only pick themselves
up, but be able to improve and continue riding for many more years.
Follow me on Twitter @ryanrogilvie or connect with me on LinkedIn
Labels: Incident Management, ITIL, ITSM, Service Delivery, Service Management