My Network Automation Journey | zeroslash.io

My Network Automation Journey

Submitted by zeroslash on Sat, 01/06/2018 - 16:01

Before I get into my story I just want to give an overview of what Network Automation is as I understand not everybody who might read this will be familiar with it. If you are new to networking or has plans to jump on board, this is probably the most important thing that you need to know. The whole networking industry has been undergoing a shift for the last couple of years now. There is now a lot of focus on the programmability of the network and we are entering an era wherein network engineers are also expected to know how to code. Yes, that’s right, code as in like what programmers do. And no, you don’t have to be an expert, but it is slowly becoming the norm. If you want to see for yourself just search for network engineer positions in the top tech companies you know. As to why you should take this step, there's probably too many to mention for this post alone. But for now read on and you will pick up some background on the challenges.

 

Early in my career, in 2008, I started out as a network engineer working for a large Australian ISP. We were a small team of 5-6 engineers in Manila and there were about the same number in Sydney. Both worked on the same backbone but had slightly different responsibilities. In the very beginning the workload was ok, but then it quickly increased as the network grew double its size year after year. We also went through 2 big acquisitions within the 5 plus years I was there. I would say it was quite a ride trying to scale the network and keeping up with the demand.

 

This was, of course, good for the business and the only challenge is the amount of work and backlog we had to go through. As an engineer, you are presented with problems which if you fail in solving, will most likely strike back with a wide blast radius in damage if it hasn’t done so yet. Regardless of how it is, the point is that your responsibility covers a wide area and people, such as other teams and departments, most importantly the vast number of customers depend on what you do. If you’re a small team and is taking care of a rapidly growing network, apart from trying to scale the network you better start considering scaling yourself first.

 

This is where I started to find ways to cut through time and work a little bit more efficient, a little bit smarter. I had some programming knowledge, although It was mostly academic and was never used professionally. But hey, this is the real world. Pick up any weapon you need to get the job done. The thing is I can’t be coding in Java with the use case I was trying to solve. I learned it in school and I sucked at it anyway but let’s not get into that. I needed something more I can pick up quickly and start applying. A good scripting language. It was around 2010, at that time Perl was still a pretty good choice. This was before we migrated our NMS, we were still using MRTG as our monitoring system. It was written in Perl and so It seemed like a good fit. I wanted to start with our reports as my first project. It was an obvious target for me as it was the most time-consuming thing we do and was done every day. If you take away the human requirement to get the job done you are left with a repetitive and trivial task. For me, it was the best candidate for automation. I pulled down an open source script which can read RRD files (MRTG saves the collected data into RRD), wrote a script which traverses the directories where the graphs we were monitoring resides and I basically can now access the same data we put into our report by hand. I just had to make sure I was extracting the correct peak rates of the bandwidth utilization or whatever other metrics we are monitoring like CPU or memory, and push the output into a format which we could just copy over to the existing spreadsheet we used for the reports. It turned out to be fantastic! The only thing we did after that was the part which requires human analysis.

 

This script became successful and we adapted it to all the other reports we did. It was a big time saver. I had to sacrifice a little bit of my time for it but I knew it was all going to be worth it. Put in the work now and reap the benefits later. But it didn’t stop there. Soon it became obvious that there were other parts of our responsibilities which could benefit from automation. Another thing that we did regularly was grooming the traffic across the backbone. This means we need to make sure that links were properly utilized. Although sometimes congestion cannot be avoided in some areas, we also had to make sure there were no under-utilized links. Basically, it was our job to make sure traffic utilization was evenly spread across the network. This involved at least, for the most part, manipulating traffic via prefix-lists. Basically, we can move traffic around by shifting networks based on which prefix-list they belong to. So there was a lot of deleting and adding prefix-lists. You would think that this was pretty straightforward by copy-pasting and find and replace actions. That’s true, but doing this for like half your day at work does not sound so fun. You would rather focus on the traffic engineering aspect of it, because that’s the challenging part for our brain and just offload the mundane parts to the scripts, like the preparation of the config. And so I wrote the scripts and things started to become a little bit easier, and faster. One important aspect of this is that it also increases accuracy and decreases errors.

 

Things were doing great and I eventually moved on to tasks which required device access and pushing commands into the box. These were obviously the more risky types of activities, but as long as you are careful and make your code do exactly only what its intended to do, you should be ok. Along the way, I learned how to handle the interactive nature of routers when pushing commands to them, given that it’s CLI. You discover different types of error responses and learn how do deal with them. Soon I was off to things like moving customers from their terminating edge router to another. If you're familiar with ISP types of routers these are the BRAS (Broadband Remote Access Server) and LNS (L2TP Network Server). Things were pretty much just expanding from where I originally started. From reading database files to controlling where traffic flows through the network.

 

A few years later in 2013, I moved to another company. A global MPLS service provider with managed services. Only this time my focus was really more on the network automation side. Coding was something I did full time. I have come across a wide range of projects and problems to solve using automation. From things like outage detection and impact analysis to full-blown configuration deployment for the whole service lifecycle handling multi-vendor platforms such as Cisco, Juniper, and Alcatel-Lucent. Every new customer deployment, every modification to its service within the duration of its lifetime until it gets decommissioned, is done through a portal. Every variation of configuration is pre-defined and templated, and each goes through extensive testing before going through acceptance and finally into production. 

 

I'd also like to mention that I have already moved to Python at this stage. This was the point in time where Python was picking up its popularity in the network engineering space. Everything was being software-defined whatever, and Python was showing up as the language of choice for the network automation libraries coming out at that time. I honestly believe that Python didn’t really have much to do with the software-defined networking buzz that was going on, instead, Python and network automation was really making its mark because it was trying to solve the real problems that network engineers were facing. If we strip it down to the very basics, what people really needed was a way to keep up with all chaos in managing the whole network infrastructure. In the same manner that the server and systems world was trying to keep up with the explosion of virtualization and cloud computing, the network was also suffering from the same management plane challenges. The only thing is that the server and systems world was already ahead in terms of solving these challenges. They had a lot of management tools and frameworks which were already mature, while we, the network guys are still stuck with our well-beloved CLI. The networking industry was still managing networks the same way for more than 2 decades before it began to change, while the rest of I.T. has already moved on. There’s probably one exception to this, but I’ll leave that for another post.

 

In 2016 I moved again to work with one of the major networking vendors. I got exposed to some of the more cutting-edge products in the space, particularly for network orchestration and NFV. I like to point out the fact that network orchestration is different from network automation. It requires putting a lot of pieces of the infrastructure together and having them work in a coherent manner. The closest I have ever got to this type of project was when I did the configuration deployment system for the full-service life cycle of multiple vendors. The difference is that was an internal custom-built tool and this time around I’m working on an actual product which is, of course, more advanced and full-featured. Since this was really an NFV project, it required orchestration across the VNFs (Virtual Network Functions) themselves, OpenStack, as well as the phsyical components of the infrastructure. The orchestration layer is responsible for ensuring all the other components or layers were properly configured to get the whole service up and running end-to-end. 

 

Another thing that I got exposed to was Network Analytics, which I thought was very important. It turned out to add more to the perspective I had on automating infrastructure. I have worked with large amounts of data before, but the big difference is that data was at rest. It's quite a different challenge when you need to deal with high volume of incoming data. This was the case with the particular experience I had. I say it's different because you're not the one whos controlling the flow of the data as it comes in. It comes in, you filter it down to what you need, transform the data then store and index it for later analysis. With the help of pretty graphs, you get a totally different view of your network data.

 

I’ve learned so much over the years, and until today I feel I still have a long way to go. I’m still on my quest to conquer my next challenge and I’m really excited about the transformations we are about to go through. There’s so much going on in this industry and I feel these are really good times. Although people and organization and the different vendors have different views and approach as to how we should move forward I believe that this itself will breed innovation in many areas. One promising thing I’m looking out for is disaggregation or more commonly termed as white box networking or sometimes Open Networking depending on the type of approach or who you're talking to. I believe this has the potential to really change how we build and operate networks.

 

Are you interested in Programming for networks? I encourage you to check out the Facebook page for more related content. I post regularly to share knowledge and information about topics related to Networking and Programming as well as other I.T. related content.

 

Thank you and I appreciate you stopping by.