NetOps and the Holy Grail

  • 09 February 2018

IT Knowledge is IT Power

We all know that knowledge is power. Conversely, lack of knowledge leaves you powerless to manage your infrastructure.

Andrew Lerner @ Gartner posted an article titled, “Simplicity Should Break Ties” and noted that “automation is probably one of the best kept secrets in networking in terms of improving availability and reducing operational expense” and I agree 100%.

Logs and other unstructured data are a vital component of technology optimization. Event metrics provide information on protocol changes, administrative activity, faults, security, and so much more. Furthermore, analyzing and correlating the data is a key part of root cause analysis. When data is collected over a period of time, it can provide important visibility into trends, recurring problems, or just the overall health of an infrastructure. Metrics of events are just as important as the contents of those events.

LogZilla goes beyond “just a log tool”. Anyone can make a dashboard, but everyone else is missing the point about Network Management. You need a solution that solves your pain, not a “tool” that points it out and emails you every 3 seconds. This is NetOps. This is LogZilla.

Reactive, Proactive, Preemptive

Networks have millions (and for some, billions or even trillions) of logs per day flowing in. The ability to quickly understand what is most important is normally a difficult and time-consuming task. LogZilla saves time and quickly identifies the most actionable problems and even those rare problems which would have otherwise been overlooked by legacy tools.


85% of the largest companies in the world still use event analysis in a reactive manner. This is partially due to the fear of “too much to look at” and the lack of a viable solution for removing the non-actionable events that clutter the user’s view.

Reactive analysis only serves to provide post-mortem information on why something went wrong. Ideally, this information should be collected so that the next time it occurs, the lesson learned can be avoided. Sadly, many companies fail to record this and turn that knowledge into actionable information by storing it somewhere such as a “Known Error Database”, or KEDB.

Put plainly, a KEDB is used to record lessons learned in order to apply automated actions using the management software the next time it occurs. In return, this event is now able to become a “Proactive” trigger for future occurrences.


In a proactive environment, the lessons learned from past mistakes (and recorded in a KEDB, or at least somewhere) are applied to events as they occur. This enables companies to avoid those past problems.

Let’s take a “Known Event” as an example. Almost every network engineer has seen this in some form or another (depending on the vendor hardware):

%CDP-4-DUPLEX_MISMATCH: duplex mismatch discovered on GigabitEthernet1/0/1 (not full duplex), with GigabitEthernet0/1 (full duplex).

The message indicates that the duplex configuration of an Ethernet port is different from the configuration set on at least one of the neighboring ports. This means that users, servers, or whatever is connected to that port are getting half of the bandwidth that they should be.

This is a good example of something that is quite simple to fix, but tends to go unchecked in many environments. This results in an increase of user complaints about slow network access. Why ignore it? Why not simply fix it? The answer lies in the sheer volume of them that get reported on a daily basis in large company networks. It requires a configuration change on the device and some companies are (correctly) a bit apprehensive of change. The right answer is to follow change control procedures and actually fix it; don’t ignore it.

The next step in the proactive process is to notify…tell someone or something that it happened. But is this really being proactive? Alerting is only part of the story. And in large networks, it doesn’t help much to just alert on bad things happening.


In a preemptive environment, we employ the use of both proactive knowledge coupled with external knowledge to make informed decisions about how to automatically remediate known errors. Gathering intel on the affected entities allows for event enrichment and intelligence obtained from multiple sources of information such as KEDB’s, SLA’s, device locations, device importance, Configuration and Compliance/Change Management (NCCM’s), Performance Management, Security, Network and Infrastructure Diagrams, even external sources of information such as Local Weather for that location, power outages, etc.

Let’s take that same “Known Event” used in the proactive model and extend it to an actionable, preemptive process.

%CDP-4-DUPLEX_MISMATCH: duplex mismatch discovered on GigabitEthernet1/0/1 (not full duplex), with GigabitEthernet0/1 (full duplex).

Now we know about the event, and now we know what it means (since we have it in our KEDB). But what do we do with it?

  • Where did it come from?
  • Who owns it?
  • Was this an authorized change?
  • What were the last 5 logins to that device?
  • What should that device’s configuration be?
  • Do we have a “Gold Standard” configuration database?

I’ll use the last two as an example, but throw in a bonus.

In LogZilla, we have a simple “trigger” that looks for this event.

Cisco Duplex Mismatch Auto Repair

Cisco Duplex Trigger

  • LogZilla detects the event within 1 second
  • LogZilla scans our configuration DB and looks the device up along with that interface configuration.
  • 1 second later, LogZilla has logged into the device and it’s fixed.

And, for a bonus, let’s tell someone what LogZilla did.

Bonus: What did you do?

Duplex to Slack

Our CEO did a short demo of this process a little while back which you can see here.

The LogZilla NetOps Platform delivers data enrichment, automation and simplicity, enables faster response, and preemptively identifies and resolves network problems before they become outages.

LogZilla is built By NetOps, For NetOps.