Monday, 27 January 2014

Score a Service Management Touchdown through Incidents

Incidents will happen; this is the unfortunate reality. The key is to manage them effectively and wherever possible to avoid them from happening in the first place. Since it is almost Super Bowl time I thought it was almost fitting to use the example of a football team (yes, American football is played with our hands). The player who is responsible for facilitating and coordinates the plays on the field is the quarterback (QB). Their function is to move the football towards the end zone (business outcome) more times than the other teams (your competition) to win the game. During the course of the game they will have to adjust for setbacks such as fumbles, interceptions etc.

Think about your current service management organization. Does it have a centralized incident “player” like the QB? Or is this role managed by your operations teams when incidents occur? How this is done is not nearly as important as the perceptions around the Incident process. Quite often the resolution of incidents is met with acclaim for the "team who has restored service” (use booming voice over). People often congratulate the firefighter, but think of it this way, if the fire department was required to come out to put out a fire 3 times in a month people would start to wonder what was going on. The same is true for your services.

Whoever is quarterbacking your incident team should understand that the value is not necessarily as much in the resolution of incidents as it is in understanding the root cause. Granted root cause is obviously better managed through problem, but we all desire the best service possible. The challenge is that culturally we tend to reward where we “see” value. You might hear “last month we reduced incidents,” or “our Incident team helped facilitate a quick resolution on this issue”. While this is important we need to start building a capability to address the problems which are impacting our services.

A part of the game of football also includes a halftime. This is when there is a halftime show; people get more beer, and so on. The real purpose is to allow time for the teams to regroup in the locker room to review the game play in the first half. The coach will likely go over what went well and in some cases what didn't. Much like CSI, the team is always looking to improve their performance going into the second half of the game. It is equally important to take times where your incidents can be reviewed by your either your service management or operational teams, or both. To improve on your delivery of services performing a review of this nature on a regular basis will better position your team to determine a strategy to reach your service management end zone.

Follow me on Twitter @ryanrogilvie

Monday, 20 January 2014

Communicating 9's of Uptime - Same Flavor Different Taste

If you were tell someone in the business that their application uptime was 99.987% last month and 99.894% the month previous they would likely give you the thousand mile stare because they wouldn't understand what that meant to them in terms of their service availability. Availability needs to be communicated in a way that has meaning. For example, your service was not available for 34 minutes last month.

Think about it this in day to day terms…

If I was to tell you that the availability of the lights in your house had an uptime of 99% what would that mean to you? After some mathematics it might imply that 72 hours of the year the lights would not ‘be available'. Pondering on this for a moment you might decide, “Well, I don’t need them all the time, I am asleep 8 hours a day so that should be an issue.” If the uptime was revolving around your business hours (when you actually need your lights available) there may be more potential impact.

So what do these 9’s of service really amount to from a time perspective? I have added a diagram to illustrate what this works out to based on 24 hour business need.


Fundamentally we need to understand what makes our services tick. What hours comprise of service availability? Do our customers need a 24 x 7 availability or are there specific business hours in which we must ensure operation. Much like the household example above, we might not be concerned (from an availability perspective) if the lights aren’t on over non business hours.



Another consideration might be how your infrastructure is built to support the service in question. Let’s assume your service is comprised of two web servers as shown in the diagram. Gathering metrics from them both you have identified that there is an availability of component 'x' at 96% and component 'y' is 99%. It turns out that the availability we are seeing does not match the 99% of component y. How can this be, you might question? There are many reasons how this could be occurring. To start with we didn’t specifically indicate how x and y work together. Despite the web boxes, we may have some network latency that is impacting performance. Depending on the issue it could be related to an inability to access the database for whatever reason which is an outage we aren’t even accounting for in this example. This is why we need to know all the pieces involved and determine a way to manage them effectively. We have identified when the service needs to be available, what is operating in the backend to ensure it is delivered but we also need to think of the processes which can assist the operations team in delivering that service. By understanding what drives the service operationally we will be able to regularly perform assessments to target areas for continual service improvement.

In the end it can be tricky to report against the uptime for the service because the final piece we may not even see is what is not reported in the first place. Relaying to the business that they had 2 hours of business outage time last month will only be as accurate as the input data. The customers may say that there were several hours of outage time that were not accounted for. Acknowledge this information and address where the lost time gaps are coming from so that you can always improve the customer experience. We might not get it right on the first try, which is why continual service improvement is a cyclical process.

  
Follow me on Twitter @ryanrogilvie



Monday, 13 January 2014

Event Management – Who Sees What

It is likely that you or your IT Operations are monitoring your infrastructure in some capacity. Whether it is network traffic, database activity, applications health or a combination of these, your goal is to ensure stability.

The first question I ask is this… “How well is this working out?” Depending on your IT organizational structure each “silo” may answer that it is working rather well, so I might adjust the question to ask “what value is your monitoring adding?” From a pure monitor perspective it is likely telling you exactly what it was meant to, whether the server is up or not, or the memory usage is high and so on. But without some mechanism to address the baselines and someone (or something) accountable to the output we may never really leverage the potential of the tools we have at our disposal.

One of the first challenges is identifying all the tools we are using to do the job. Are there any overlaps in what these tools are doing? What costs are additionally associated with monitoring in these capacities? What are the Operations teams doing with the alerts that they receive today? This last question takes a bit of professional honesty, if we are just opening the alert emails in bulk and acknowledging them in the monitoring tool we need to know that. We all know it happens, sometimes alert thresholds are set op out of the box and we just never seem to have time to go back and configure them.

Who currently has visibility on what is monitored and should anyone else have this visibility. Many tools have slick dashboards where execs can take a look at what is going on, while others drive workflows to create and escalate Incidents to the appropriate teams. You need to outline the process than manages this and what outputs you need to achieve success.

Ultimately you should be able to match any incident with an alert if one exists in your monitoring repertoire today. However when they don’t lineup this can be an important indicator as well, and may suggest one of the following issues:

1.     If you see incidents and no events you may need to address thresholds which are too low  

2.     If you see events and no Incidents you may have your threshold as it is too high

 

At either rate this does allow you to manage the baselines of your services.

Which brings us to what we are monitoring… and why

Just because we can monitor everything doesn’t necessarily mean we need to. We need to fully understand what services we are providing and how the events are triggered by the inability to use said service.

For example if we are monitoring up/down of a server there are several considerations to think of with regards to the business perspective

·         Does the time of day make a difference regarding the use of the service, is it 24x7 or not? Does an outage on Saturday make this less critical?
·         What is the impact of the service outage based on duration, does the impact increase the longer the service is unavailable?. Are there specific actions which need to be taken at intervals in time? If so, how often does the monitoring alert us to ensure we are correcting potential issues in a timely way.
·         Understanding the architecture of the service is equally important. If there is a clustered environment, for example, we may not have an outage should one server be unavailable. Despite the alert saying the server is down we may see degraded performance of the service instead

Some of these considerations play into availability a bit so we want to ensure we have enough ‘service’ for the business need. Remember overdoing it won’t necessarily be cost effective.

Remember to work with your business to prioritize your services based on business need, don’t make assumptions on what you think they need. This is just one more place where you can discuss with the business on what matter to them. Keep the dialog open


Follow me on Twitter @ryanrogilvie

Thursday, 9 January 2014

Reducing Incidents through Knowledge


How can you improve the customer experience though knowledge management you might ask? From an Incident perspective we should start with the Service Desk to determine exactly what types of calls are coming into the queue. A “top ten” of sorts, these may be common questions which if they are handled through some form of self-service functionality can reduce the calls into the Service desk right at the beginning. It is likely that your Service Desk has some of these posted already, but a review will ensure that we aren’t missing anything, or removing any material that no longer relevant.

Your Service Desk says “It looks like we are doing better,” but how can we quantify that? Take a closer look at what each service call costs your organization, your cost per contact. As we have more knowledge articles available we will want a consistent location for customers to go this ensures that your customer can easily find information rather than contacting your service desk. Enabling your customers in this way will drive your costs down by eliminating the “quick questions” and will your support staff to spend their time working on more complex issues which improves their engagement

In our journey to reduce service affecting incidents we need to also look at Incident KPI’s. for example, the current MTTR of Incidents can help us to determine a baseline of our resolution but we can also address areas where exceptions are noted. Was there information available that could have helped us resolve this incident quicker. If not do we now have something. Being able to produce information of this nature can help us to systematically go through some troubleshooting quickly with all support personal so that less time is spent trying to reinvent the wheel.

The Challenges
Like anything worth doing there are going to be some obstacles. While the initial pieces are beneficial to the Service Desk we need to start thinking long term to being a better business partner, which is where having a collaborative knowledge culture is critical. This part is not as easy. Some of the challenges may include:

Knowledge hoarding
There are people who believe that if they are in control of the knowledge that they are in a better position as the “go-to” folks. In reality this is not the case, they end up spending their time fixing the same work over and again rather than working complex issues. In a collaborative space we should ensure that the information is not only available but relevant as well

Making knowledge part of the way we do things
People can be change adverse so they will see this new activity as “more work” or they might say, I am not a writer. We aren’t necessarily looking for Pulitzer material, but think of it in these terms; lets suppose that you could reduce 10 calls a day by having a knowledge record available and that each call for that same issue is approximately 5 minutes. That works out to 50 minutes of work which could have been taken care of by spending 10 minutes of writing a knowledge article – 40 minutes saving. Making knowledge a part of the incident, problem or even change process could have domino effect in redundant work expenditures. It could be as simple as a record doesn’t close until the knowledge record has been created

Reporting
This will be for nothing if we cannot report on what we are seeing. We should be able to track through our knowledge repository what issues were corrected with the knowledge records (no incident created) as well as how many incidents were resolved as a result of knowledge. Being able to quantify the success of the knowledge process will position us better to make further improvements on future activities.

Feedback
Gathering feedback on the information in your knowledge can be easy, but managing the output of that feedback may be more difficult. If we are soliciting feedback, we may need a way to respond to those that are supplying it. Some foresight may need to be applied to this before proceeding, so my advice is to keep it simple to start

In the end you will be in a position to review and make adjustments where needed. You will also be in a far better place to work with the customers for these issues. Allowing you to further understand their business will enable your other ITSM process to make improvements as well.

 

Feel free to connect with me on Twitter @ryanrogilvie and/or on LinkedIn
If you like these articles please take a few minutes to share on social media or comment



 

 




 



 

Monday, 6 January 2014

ITSM New Year’s resolution - Knowledge Management

It is that time of year once again. I started to think what part of continual service improvement I wanted to see implemented to its fullest. While there are likely many areas to consider I always seem to go back to Knowledge as the great equalizer for improvement initiatives for so many other activities. It is likely that you have some form of knowledge management now. The question you need to ask yourself is whether or not you are gaining the value from it that you should be.


Much like many other New Year’s resolutions you may start out with good intentions and then in time fall short of the goals in which you set out. While Knowledge management could apply to various IT teams in this post I am going to talk specifically to getting things started with the Service Desk team.

If you were to ask the Service Desk, “Do we currently have a place where our customers and find answers to questions themselves?” they may say, “We have some how-to reference material on an intranet site”. While this is a good start but the follow up question to that should be, “is that information still relevant or are there more important things to have on there. It might be at this point where you might see some shoulder shrugs. Without a way to manage this we tend to “dump stuff” on these sites to stop the same complaints that are coming in. While this does satisfy the influx of calls in the interim it doesn’t really address serving the customer in ways that improve service.

It is likely that your services are ever changing. So it is important to have a knowledge process which is equally as scalable to keep up with demand of the customers. Move things around. You might not need a quick reference card for the installation of Netscape any longer.

More and more people are turning to search engines to find relevant solutions to their problems. As a result customer expectations on quick answers are increasing and their patience for waiting is dwindling.

In the beginning….
Ask your Service Desk manager what the top 10 calls are, you may find some low hanging fruit where knowledge management can provide some remedies. Starting simple will allow you to get some momentum without overcomplicating things but it will also give you the ability to report on what is going on. Remember you can’t manage what you can’t measure. Make sure you have an ability to report on what improvements you make through some type of analytics.
Think about a way to gather feedback on the customer experience regarding the knowledge articles where it makes sense. Be careful on this though as you want to be sure that you can manage what comes though this type of information gathering. Ultimately you want to validate that the information is correct and provides the most value and where applicable where other records could be of value. We don’t want to create a document locker of WORN information.

Allowing your frontline staff to address more technical issues and having your knowledge repository taking care of the ‘easy stuff’ it should improve the customer experience immensely. But this is just the start…..

Follow us on Twitter @ryanrogilvie