Networks Are Built to Be Up, But Often Are Not
I am constantly driving around town. When I am lucky, there are no problems and I can easily get to my destination. Rarely, there will be a scheduled event like a holiday parade or triathlon going through part of town requiring roads to be closed. Instead of trying to find an alternate path, I often end up turning around and going home.
In between these two situations, I usually end up stuck in some traffic. It may be the lunch time rush or the school bus is dropping off students after school. This delays my journey, but I bear with the unforeseen circumstances and eventually make it where I was trying to get to.
We build our transportation network so people can get from one point to another. The types of roads and amount of connectivity depends on the expected traffic volumes under various circumstances. We build IT networks with a similar goal in mind. They connect end-users to applications and data. The networks are designed to have a certain level of availability and reliability. The technologies and speeds we use depend on the application performance requirements and the end-user expectations.
When the networks and roads break down or are taken out of service due to maintenance or special events, it is easy to tell end-users that the system is down and they should come back later. Sometimes alternate routes are established in advance so the connection can still be made.
Not just 1’s and 0’s
The problem is when the infrastructure is congested somewhere, causing delays and issues throughout the system. How often have you been stuck in traffic for kilometers? When you finally reach the point where traffic clears up, you can find no reason for the traffic jam in the first place. By the time you reach the bottleneck, the problem has disappeared. Or, it is possible that the problem is somewhere else, and is indirectly causing the problem on the road you are on.
There is a significant cost to the business when there are delays and outages. The lack of access to the applications and data can translate directly to lost revenue. The time taken to operationally identify, troubleshoot, and mitigate issues costs money and resources. There may be a public cost through negative brand awareness or legal compensation.
Stop and go and…
We need to understand these three operational states of the network and application delivery. First, we will look into the designs and technologies used to keep the IT infrastructure running properly under normal operational circumstances. Next, we can see how the network and applications respond to outages and failures. Finally, and most importantly, it is critical to understand why performance degrades, how to identify the cause, and mitigate the problem.
Next week, we will build a stereotypical resilient network and explain what technologies to use and how to incorporate them into our design. The network will be designed assuming that there are no errors or failures. Later, we will break and saturate our network and look into the solutions we can use to mitigate these problems.