Think about it this in day to day terms…
If I was to tell you that the availability of the lights in your house had an uptime of 99% what would that mean to you? After some mathematics it might imply that 72 hours of the year the lights would not ‘be available'. Pondering on this for a moment you might decide, “Well, I don’t need them all the time, I am asleep 8 hours a day so that should be an issue.” If the uptime was revolving around your business hours (when you actually need your lights available) there may be more potential impact.
So what do these 9’s of service really amount to from a time perspective? I have added a diagram to illustrate what this works out to based on 24 hour business need.
Fundamentally we need to understand what makes our services tick. What hours comprise of service availability? Do our customers need a 24 x 7 availability or are there specific business hours in which we must ensure operation. Much like the household example above, we might not be concerned (from an availability perspective) if the lights aren’t on over non business hours.
Another consideration might be how your infrastructure is built to support the service in question. Let’s assume your service is comprised of two web servers as shown in the diagram. Gathering metrics from them both you have identified that there is an availability of component 'x' at 96% and component 'y' is 99%. It turns out that the availability we are seeing does not match the 99% of component y. How can this be, you might question? There are many reasons how this could be occurring. To start with we didn’t specifically indicate how x and y work together. Despite the web boxes, we may have some network latency that is impacting performance. Depending on the issue it could be related to an inability to access the database for whatever reason which is an outage we aren’t even accounting for in this example. This is why we need to know all the pieces involved and determine a way to manage them effectively. We have identified when the service needs to be available, what is operating in the backend to ensure it is delivered but we also need to think of the processes which can assist the operations team in delivering that service. By understanding what drives the service operationally we will be able to regularly perform assessments to target areas for continual service improvement.
In the end it can be tricky to report against the uptime for the service because the final piece we may not even see is what is not reported in the first place. Relaying to the business that they had 2 hours of business outage time last month will only be as accurate as the input data. The customers may say that there were several hours of outage time that were not accounted for. Acknowledge this information and address where the lost time gaps are coming from so that you can always improve the customer experience. We might not get it right on the first try, which is why continual service improvement is a cyclical process.
Follow me on Twitter @ryanrogilvie