Monitoring tens of thousands of pieces of infrastructure and applications in a meaningful manner is an endeavor that is easier said than done for any large organization. At Go Daddy, we take our system monitoring very seriously as we strive to ensure the highest levels of availability to our customers.
One of the greatest challenges faced by any Network Operations Center (NOC) or similar entity is making sense of the data coming into the system in an actionable manner. Traditionally, most monitoring operations have relied on tried and true solutions such as Nagios, OpenNMS, Big Brother, and a myriad of commercial tools. The greatest challenge to these types of monitoring platforms is that they present a very flat view of the Universe. Essentially, you get screens and screens full of individual alarms without context to the actual impact to service or to the customer.
The illustration above is potentially an acceptable solution when dealing with smaller numbers of devices under management (100s) because most devices’ functions are well understood and the number of alarms is manageable. Now let’s scale that to something a bit larger… At Go Daddy, we look at over 4.5 million individual points of data per hour to ensure that our systems are performing. Even if .0001% of those data points comes back in an error condition, that is 450 alarms per hour. You can imagine that investigating each and every one of those alarms and trying to understand the customer impact can be a daunting task.
Now let’s break down the above example and give some context to the alarms.
- A CRITICAL alarm on a device with a failed hard drive sounds pretty serious, right? Well, imagine the hard drive runs an industry standard RAID configuration and that the individual device is just one of many redundant systems. It’s starting to sound like this one event is critical to the individual server, but is actually not such a big deal as when looked at as part of the whole.
- A MINOR alarm sounds pretty innocuous, right? How bad could something MINOR really be? Well, a minor amount of packet loss on a mission-critical network device could be devastating to service quality. Imagine loosing 1% of your customer traffic because there is a minor alarm on your main Internet facing router. Now that’s a BIG deal for any company.
So, how do we provide context to an alarm in an easy to understand manner? How do we know that a minor alarm is devastating to the whole, while a critical device alarm is just a side show? The Answer… Service Oriented Event Management and Monitoring.
In Service Oriented monitoring the primary concern is service delivery. Meaning, we are concerned with the overall health of the whole versus the state of any one individual contributing component. This view is obtained by layering systems components within a hierarchical model of system dependencies (You will need a quality CMDB for this, but that’s for another post). Once you have a clear picture of how a Service functions, you can then apply rules around how to treat a single device-level alarm within the context of the whole.
Notice the following in this example.
- The CRITICAL device alarm is the least important to fix.
- The MAJOR device alarm has only MINOR impact to the Service.
- The MINOR device alarm has CRITICAL impact and is the MOST important to fix.
It’s all about placing context around individual data points in relation to the big picture.
With some hard work and a little luck you can go from this…
And finally, really understand what is important to fix in relation to what the customer is experiencing.