NetOps is a natural progression of the Network Operations paradigm which fosters efficiencies and more resilient infrastructures through automation and intelligence. Automation has a huge impact on operational awareness and provides a dramatic reduction in Mean Time To Restore (MTTR) services. Being able to utilize network information across functional organizations enhances the overall operation and ease the engineering bottleneck by capturing and using tribal knowledge allowing both the Network Operations and Security Operations groups to have gain visibility and actionable insight into their domains. NetOps platforms provide mechanisms for:
- Service Assurance
- Service Automation
- Event Enrichment
- Extensibility and Scale
- Agnostic Functions
Service Assurance is the result of the complete NetOps stack. Bringing your entire infrastructure’s telemetry under management in one place provides the ability to quickly identify actionable events. Until recently, it was not possible to keep up with the massive amount of data generated from so many disparate sources of information. This led to Network Management Architectures which contained multiple silos of information making it almost impossible to correlate and enrich data because they could only see part of the picture and sometimes had no visibility at all into service affecting issues.
Many organizations will still login to a suspect resource and look at the log files as a last step of the triage process. This is counter-intuitive as system logs are almost always the light that shines on the truth as to what went wrong. Consider this, Cisco’s Internetwork Operating System (IOS) has roughly 90 possible SNMP traps defined, but more than 40,000 possible log messages. Guess where the data required to solve most service impacting incidents lies?
There is more to Network Operations than just collecting the data; one has to have the ability to automatically filter the non-actionable events out and do something with actionable and unknown events. This type of methodology will reduce 90% of the junk messages and allow you to focus on what is important first. Now that you have successfully and automatically identified something that needs action, the next steps are to automatically remediate and/or provide event enrichment.
Service Automation is the unicorn riding on top of the rainbow that everyone talks about but very few people implement. I speak to clients every day who continue to manually remediate issues in their environment because they either don’t have the mechanisms to automate it, or they don’t realize that it can be automated. The scenarios are endless but the workflow is usually similar: You receive an actionable message, you automatically trigger an action that will login to a device and execute a command, the output provides information that you use to either execute an action or continue gathering data.
When the automation has completed, you will be notified that a corrective action had occurred, either by email, system notification or another messaging platform. One of our customers has an extremely large and dynamic network environment and when there is a problem, it causes major issues. A senior engineer can expect to spend somewhere between thirty minutes to several hours to gather the data required to resolve the problem and execute a solution. Less experienced engineers can take up to eight hours to fully resolve. Any problem that you can workflow a solution to should be automated. This allows your best engineers to construct a trigger that will automatically execute and resolve problems in real-time before anyone knows there was an issue and also removes the need for repetitive tasks which eliminates human error. Not only are you assuring availability, but you are freeing up resources and allowing your best people to concentrate on their jobs instead of fighting fires all day. Once you have successfully implemented several resolutions, re-using the automations will allow for quick updates to the run-book.
Event enrichment is used to add a layer of intelligence to information about affected devices and is a vital component in making informed decisions about what to during the automation process. This step of the information gathering process adds an average of 1 hour to the triage process when done by a human as opposed to mere seconds for LogZilla’s NetOps Platform. When an event comes into the NetOps system, having the ability to modify the payload, add tags, go to other sources of information and look up details such as device location, SLAs, Change Control policies or anything else that can be used to further group and identify greatly reduces the time needed to investigate and correlate service impacting events. I have a customer that used this tagging ability to identify all firewall related events. They had many firewalls from multiple different vendors, all with different messages. By tagging all events from each firewall as a security type, it was easy to build dashboards and automations for just the security messages regardless of where they came from. Additionally, when a message has been received, the ability to interface with other systems to gather more information allows them to open a trouble ticket which includes every possible piece of information known about that affected device. This automatic event enrichment is the top component for decreased MTTR.
Extensibility and Scale allows the NetOps platform to immediately provide value as new telemetry types become available and across platforms. Being able to scale the platform provides the ability to deal with bursts of event streams when anomalistic behavior happens. In a previous article, I wrote about a customer who was having a service impacting failure in their environment and the velocity of incoming data went from two thousand events per second to well over twenty five thousand events per second. It is imperative that your NetOps platform can accommodate this level of increase without dropping a single message. LogZilla is the most scalable platform on the planet, capable of managing billions of events per day just on a single server, scaling up to 100k events per second, where other vendors fall short around 7k events per second. Generally, there is a 1:10 server ratio with LogZilla.
Agnostic Functions allows for different areas of the organization to utilize the platform without concern for operational effectiveness. Network Operations, Security Operations, Server Operations, Data Analytics…anything capable of sending a message can be used as a data source and can reap the benefits of automatically identifying actionable and unknown events, real-time automatic remediation, and assured availability. Using role based access control prevents users and groups from seeing things they should not have access to. There is another side to this as well though; take the case of your standard NOC operations team who will not have access to login to the network devices to gather information: Many first level support engineers lack the permissions to log into individual devices and perform actions, whereas LogZilla’s NetOps platform will automatically provide them with the visibility they need for troubleshooting and triage in a matter of seconds. Being able to give operations this insight, coupled with automatic remediation and event enrichment frees up your senior engineering staff to do their job instead of fielding questions all day.
LogZilla is the only NetOps platform that can operate across the stack, out of the box. Until now, there were no NetOps Platforms capable of accommodating the volume of telemetry today’s large networks produce while returning actionable intel instantly. Using LogZilla as your front line management platform ensures operational effectiveness, increased availability, and automatic remediation of common, repetitive, tasks to streamline NetOps in your organization and make it run like a well-oiled machine. You will be amazed at what you have been missing.
LogZilla is built by NetOps for NetOps.