An Introductory Guide to Developing Fault-Tolerant Networks
In Greek mythology, the Titan Prometheus was chained to a rock. Every day, an eagle flew down and ate part of his liver. The organ regenerated during the night, replenishing the food source. The liver is one of the few organs in the human body that can spontaneously regenerate. Even more impressive, is the fact that while the liver is regenerating and fixing itself, it is still functional. The ancient Greeks knew of this capability and incorporated it to their mythology almost 3000 years ago.
Continuous Functionality
When we design networks, we want them to be functional, even when there is a disruption to the system. Hardware failures, fiber cuts, software glitches, and even squirrels chewing through cables happen. We are concerned about how the application delivery and network infrastructure responds to these problems. We architect technologies into our IT infrastructure to minimize the impact of the damage.
Like our livers, the network needs to be functioning even as it heals the damage caused to it. Applications need to be delivered and businesses still have work to be done. Early on, we developed dynamic network protocols like spanning tree protocol (STP) for layer 2 topologies and routing information protocol (RIP) for layer 3 topologies. Over time we have advanced these protocols to include layer 2-based rapid spanning tree protocol (RSTP) and layer 3 routing protocols including OSPF, ISIS, and BGP.
[You might also like: Networks Are Not Always Up or Down]
Moving On Up the OSI Stack
We still need to provide mechanisms for application availability and the delivery of the applications across the network infrastructure. This is where we introduced server load balancing (SLB) and dynamic DNS manipulation through global server load balancing (GSLB). They provide the mechanisms to detect application server failures and complete datacenter failures.
My Ideal Network (Kind Of)
If I were to design a network today, at a high level, it would look a lot like the above diagram. Redundancy is built into every aspect of the architecture. There are multiple servers, geographically diverse sites, and multiple network paths to the different components. There is no single point of failure. If one aspect fails, the dynamic technologies will automatically reconverge to determine a new best-path between client and application server.
There are a lot of fine details that I am not covering in this article. The actual design of the layer 2/3 network and device connectivity depends on ensuring that the different application delivery requirements are met to ensure application service level assurance (SLA) for all applications. Since we do not know what the applications are, we cannot make that determination. The other reason is that I would need to write a book to discuss all of the aspects necessary to build this network.
The key points to remember when designing your own self-healing, regenerating IT infrastructure are to:
- Build redundancy into the architecture
- Leverage dynamic technologies that automatically adjust to changing conditions
- Remember that the critical end goal is to ensure application SLA
Next, we will take this network and break various components to see how they affect the delivery of the application and what the end user perceive.