Common Concerns & Benefits of Network Automation
There are many tutorials, books, articles and educators out there dedicated to spreading the word of network automation, so why are more organizations not doing it? Recently, we broke down the basics of network automation in our blog Understanding Network Automation: Tools, Scripting & IaC. Here, I will address some of the main skeptics I have encountered over the years and drive home the benefit of implementing these technologies in your own network.
Repetitive Tasks
Whether performing regular maintenance or configuring the network to support a new technology, you still must manage network devices on a per-device basis. This could not be more time consuming, especially when it comes to doing the same thing on every single device. Simply logging into all these devices takes up time on its own, let alone the mundane task of copy and pasting text from a file into the terminal.
Not only does this eat up precious time, but it is error prone. How can you be sure you hit every single device when applying the same config to hundreds of devices? Ever run into the scenario where you’re managing a list of historical local passwords, because chances are one was missed? Using this scenario, if one were tasked with updating the local credentials on all network devices one could write this as a python script. Through scripting, not only is the same configuration being applied to a mass amount of devices, but error handling and verification can be written into the script, producing a list of devices that did need further attention. The organization can now have the peace of mind that not only are the devices more secure by rotating out the local password, but operationally it is known that all devices have the same password.
Doesn’t this mean automation will take my job? This is one of the major concerns from network professionals when it comes to implementing automation. The answer is straightforward and can be backed up: no, automation will not be taking your job. How many network teams are working on their heels? How long is the list of tasks that you have been meaning to accomplish on your network, but you just have not had time for? Is the network well-documented? Are standards and procedures clearly defined and published for new hires or for managements review? Automation frees up time from mundane work to perform more meaningful tasks such as focusing on design and proactively implementing solutions to increase the overall up time of the network.
Lower Risk of Human Error
How can automation be less prone to error? What happens if the automation is wrong and automates an outage? Should I trust automation? These are all legitimate questions, and it is true – bad automation can speed up bad things. Luckily, there are things that can be done to reduce the blast radius and lower the risk of error.
First, a test environment should be established. This does not need to be a full-blown copy of the production data center but perhaps a simple network virtual machine or network simulator. Either way, code needs to be tested before running it against every network device. As with any configuration change, all code should also be peer-reviewed for consistency and potential errors. After the solution has been reviewed and tested, a roll out plan can be established. Once the roll out is complete, it has been established that it is safe and can be reused in the future. No matter how many times a human performs a task, the risk always exists that something can be fat fingered. Code is much less prone to errors after it has been proven to work.
Build Your Own Tools
Let’s face it, no single commercial product has ever solved 100% of an organization’s network headache. How can we expect it to? Having a single product to solve all the problems for all organizations would be an unobtainable goal, as each organization is vastly different. Rather than continuing to spend money on multiple solutions, where each of which only solves a portion of problems, why not identify the problem and develop an in-house solution to solve it? This may sound like a huge effort, but learning the fundamentals of some of these tools will present to you how easily flexible they can be with minimal effort.
Telemetry streaming has started to become a hot buzz word in the world of networking as it should. One of the drivers for APIs was due to the many shortfalls that SNMP has presented us in the past. SNMP additionally relies on a system to poll the network device at given intervals. Telemetry on the other hand allows network operators to have the device push only the information requested to some kind of message bus, such as Apache Kafka, in real time. The information on the even bus can then be viewed using a graphical tool, such as Kibana.
Where to go from here
So how does an organization go about adopting some of these technologies for their network operations? In this section, we will provide a brief blueprint to get started. Not only will your organization benefit from it but so will the people who maintain the infrastructure. If you don’t start somewhere, you can’t build off it.
Gain Buy-In
This may be the biggest hurdle that those wishing to start automating on their network will have. Not everyone is keen on introducing automation in their organization, whether they’re peer engineers worried about being phased out of the job or management who has seen networks manually configured by a human for the last 20+ years. Automation should not be your private project intended to be revealed from behind the current. Mistakes will be made as they are with implementing any new technology. The important thing is that knee-jerk reactions don’t overpower the end goal of having a more stable, reliable and predictable infrastructure. If mistakes aren’t made, there is no way to learn from them and move forward. Has anyone been the victim of a network configured for VTP completely bringing a network to its knees by inadvertently blowing out VLANs from the VLAN database? If not, it would not be called out in the CCNA. Does this mean organizations stop plugging in network devices? Has a fat finger configuration ever brought down an entire site? Do we stop making configuration changes? All these scenarios communicate taking the same approach to automation. Learn from your mistakes and push forward, never backwards.
Automate Repetitive Tasks
When it comes to starting to implement automation, we recommend starting small while you and the organization get more familiar with automation technologies. Identify your repetitive tasks that are performed manually today and get them automated. Introducing a new feature or system that requires all devices to be reconfigured? Automate the deployment of the configuration. The time saved and reliability will present itself quickly and hopefully provide some motivation for growing the culture of automation in the organization. If you have to do something once, go ahead and automate it for the future.
Gather some facts and analytics about your network and output them to reports. Using templates, scripts or tools can be written to produce consumable data about the network whether it be a simple inventory file, available interfaces or whatever you want! That is the power and flexibility of automation!
Move Configuration to IaC
Start with building out the inventory file and learning how to group devices together. When the network is thought of holistically, intent-based networking will emerge. Start with a small site and build from there. If IaC is fully implemented and the desired end state of the network known other tools can begin to be implemented to help, self-correct the network in the event of an outage. CMDB can be automatically populated as the configuration items are all stored in flat text files. Imagine a network that is predictable, self-healing and fully documented. This should be the end goal with IaC.
Deploy with Tools
Start using tools to deploy your configuration, whether it be a custom script, Ansible or Terraform. Many organizations today have adopted the Public Cloud and/or some type of Hybrid Infrastructure. One of the core concepts for a Cloud to work is automation, as it allows us to stand up entire infrastructure by deploying a single Terraform template. What stands out about Terraform is its ability to keep track of the changes made to the infrastructure. If Terraform deploys it, Terraform knows about it therefore knows what to delete.
Let’s take the deployment of a cloud landing zone with a third-party NVA Firewall to facilitate the connectivity on prem. The desired end state of the infrastructure can be deployed with Terraform. If the deployment errors out or the end state is not what you thought, you can simply tear it all down with Terraform destroy. No harm done since the infrastructure was still in development at that time and the template should not be ready for production until it has been tried, true and tested anyway. Although the configuration for the NVAs could be staged and applied by terraform if Ansible is already utilized for deploying configuration you could complement your deployment by applying a configuration template after the deployment is complete. Start using these tools to level up your skill level with them.
Redefine Alerts
Alerts have historically been something that started with good intentions and are morphed into something that is completely ignored. Does anyone pay attention to the mailbox folder for alerts when the monitoring system sends an email every time a fan runs just a little high? Alerts should be tailored to items that require investigation, especially if a 24/7 NOC is not in place. With all the new technologies in place today, email is not the only option and network monitoring systems are not the only option for generating alerts. Alerts can be in the form of a NMS dashboard, email, webhook to post a message in a specific chat room, or even a text message.
By redefining and categorizing the alerts workflows can be developed on the follow-on actions that need to be performed to remediate the alert. With the alert defined and a remediation workflow developed, why do we need to wait on a human to stop what they are doing, investigate the issue with manual commands and manually remediate a known process? Instead, perhaps the alert sends an event to an event bus which in turn kicks off an automation script to gather facts about the state of the device, remediates the issue, sends the results as a report to the network operations team and notifies them that the process was executed. The outage was remediated, analytics about the device gathered to help identify the root cause and the necessary people were informed.
Conclusion
Network Automation is an umbrella term to cover a large pool of different technologies that work together to improve the overall operation, stability and predictability of our network. From freeing up hours from redundant work to changing the way the network is configured with IaC. This article is only intended to scratch the surface on these technologies. Many resources exist today the help organizations adopt an automated approach to managing and operating the network. Whether there is a pain point your organization would love to put behind them or a culture shift to starting a journey to a fully automated private cloud infrastructure ivision is eager to help. Automation is a true passion I have gravitated to ever since I wrote my first Ducky Script for a Rubber Ducky USB, and our network experts share the same passion for improving your network experience. Give ivision a call to get started in your automation journey.