The first question I ask is this… “How well is this working out?” Depending on your IT organizational structure each “silo” may answer that it is working rather well, so I might adjust the question to ask “what value is your monitoring adding?” From a pure monitor perspective it is likely telling you exactly what it was meant to, whether the server is up or not, or the memory usage is high and so on. But without some mechanism to address the baselines and someone (or something) accountable to the output we may never really leverage the potential of the tools we have at our disposal.
One of the first challenges is identifying all the tools we are using to do the job. Are there any overlaps in what these tools are doing? What costs are additionally associated with monitoring in these capacities? What are the Operations teams doing with the alerts that they receive today? This last question takes a bit of professional honesty, if we are just opening the alert emails in bulk and acknowledging them in the monitoring tool we need to know that. We all know it happens, sometimes alert thresholds are set op out of the box and we just never seem to have time to go back and configure them.
Who currently has visibility on what is monitored and should anyone else have this visibility. Many tools have slick dashboards where execs can take a look at what is going on, while others drive workflows to create and escalate Incidents to the appropriate teams. You need to outline the process than manages this and what outputs you need to achieve success.
Ultimately you should be able to match any incident with an alert if one exists in your monitoring repertoire today. However when they don’t lineup this can be an important indicator as well, and may suggest one of the following issues:
1. If you see incidents and no events you may need to address thresholds which are too low
2. If you see events and no Incidents you may have your threshold as it is too high
At either rate this does allow you to manage the baselines of your services.
Which brings us to what we are monitoring… and why
Just because we can monitor everything doesn’t necessarily mean we need to. We need to fully understand what services we are providing and how the events are triggered by the inability to use said service.
For example if we are monitoring up/down of a server there are several considerations to think of with regards to the business perspective
· Does the time of day make a difference regarding the use of the service, is it 24x7 or not? Does an outage on Saturday make this less critical?
· What is the impact of the service outage based on duration, does the impact increase the longer the service is unavailable?. Are there specific actions which need to be taken at intervals in time? If so, how often does the monitoring alert us to ensure we are correcting potential issues in a timely way.
· Understanding the architecture of the service is equally important. If there is a clustered environment, for example, we may not have an outage should one server be unavailable. Despite the alert saying the server is down we may see degraded performance of the service instead
Some of these considerations play into availability a bit so we want to ensure we have enough ‘service’ for the business need. Remember overdoing it won’t necessarily be cost effective.
Remember to work with your business to prioritize your services based on business need, don’t make assumptions on what you think they need. This is just one more place where you can discuss with the business on what matter to them. Keep the dialog open
Follow me on Twitter @ryanrogilvie