Thursday, 9 May 2013

Service Management metrics too good to be true

As a bit of a numbers junkie I like to look at the stats and see what they tell me. As the old saying goes if it is too good to be true than it probably is. So with that in mind, when we have measurements which are accurate with regularity and we see something that can’t possibly be so, what do we do? The answer is that we may need a way to verify against another set of information.
Here is an example:
We have received our monthly Change Management stats for the previous month and it indicates that of the 250 changes we made we attained a 100% success rate – fantastic right? Immediately the skeptic in me wants to ask questions. Let’s assume for this examples sake that on average we have 2 or 3 failed changes which result in a total of 10 reported incidents each month normally. This is where we tie in the other service management processes. From an Incident Management standpoint are we relating all the potential reasons to the correct source? This is not the only challenge, another might be that the impact from the change may not directly impact customers until well after the implementation has been completed and when incident staff attempt to tie it back the description of the change is not obvious enough to make an accurate correlation. These incidents which have no “apparent “ root cause should at the very least end up in the lap of someone who is working your Problem Management angle. Since the implementer has completed the work and testing they have determined that this implementation was a success. They have likely closed off the change as successful as nothing has been shown otherwise. Nothing wrong with that, but once this new information surfaces the closure code should be updated to reflect the change in status.
That was when the numbers look too good to be true, however in reality we should have a check point to ensure that this doesn’t happen in the first place. The first time that it does your stakeholders, whether they are customers or IT, will start to question all statistics and their validity from that point forward. A major benchmark of this is with uptime. Much like in the example above if we hit an astronomical number; our business may call us out and say there were several outages that may have not been tracked appropriately.
To remedy this, a regular (weekly) meeting with your Service Management functionality and IT stakeholders should take place to get alignment on the activities that are happening in your environments for a particular time frame
This activity will allow your Incident team to see what changes are scheduled in the next week and ask appropriate questions on what may cause customers to call in with any potential issues. It will also allow your Problem manager to determine if there are any new problems they should be focusing on or reprioritizing the ones they are currently investigating. As in this example your Change manager will be able to identify if any changes which happened in the previous week had inadvertently caused issues which could be avoided next time.

Performing these checkpoints will allow your teams to be more confident that when the numbers are this good that they actually are.

Follow me on Twitter @ryanrogilvie or connect with me on LinkedIn


1 comment:

  1. I think this is the "quota" mindset Dr. Deming tried to warn us about. If you are 100% successful at change managment, why do you need to improve anything? Isn't ITSM all about improving IT services anyway? Once the quota mindsets takes hold (i.e. zero defects, etc.)...let the gam(ing) begin!

    We're learning that our whole ITSM focus is about continuously improving. If we do this, the results will take care of themselves