Networks Can Fail - Here’s What You Can Do About It

By Frank Yue February 17, 2016

Recently, my computer failed and I had to get it fixed and my data restored. It took several days to identify the specific problem (in this case, catastrophic hard drive failure) and restore access to my applications and data. In the meantime, my productivity dropped dramatically and it was hard to work using the alternative tools that were available to me.

This is often what happens in a disaster recovery (DR) scenario for IT networks. Separate backup tools are configured and set aside for use when the primary systems fail. Unfortunately, these backup tools have limited functionality, are not kept up to date, and are infrequently tested to ensure that they will work in case of a real emergency. Instead of focusing on a DR plan, it is smarter to design an IT infrastructure that mitigates your need to have to use the limited DR tools and facilities.

Key failure points

Last week, I talked about what it takes to build a resilient network architecture. This week, let's look at the different possible failure points in the design and how the various technologies can mitigate the problems.

[You might also like: Networks Are Not Always Up or Down]

Layer 2 switch – Layer 2 switches provide end-point and inter-device connectivity. To be fault tolerant, we multi-home them and utilize rapid spanning tree protocol (RSTP) to eliminate loops in the network. In an ideal network, the RSTP domains are small and converge in less than one second. When a switch fails, only the single-homed devices and servers directly attached to that switch lose connectivity.
Application server – It would be nice to multi-home the server to multiple switches and rely on a failover mechanism, but it is often easier to build multiple application servers and provide the reliability, scalability, and availability through the server load balancing (SLB) technologies that I talk about so much. The application delivery controllers (ADC) are redundant and can automatically detect a problem with the server and/or application within a few seconds. The ADC uses a variety of health-checks, or probes, to determine the status of the server and the applications it is hosting.
Layer 3 router – Routing protocols have been around almost since the beginning of the Internet. Today, most resilient networks rely on IGP (Interior Gateway Protocols) like OSPF (Open Shortest Path First) or IS-IS (Intermediate System-to-Intermediate System) to provide dynamic rerouting of traffic due to router or link failure. These protocols will typically converge in under 30 seconds, depending on the size and complexity of the network.

The BGP (Border Gateway Protocol) is an EGP (Exterior Gateway Protocol) that connects individual networks, or autonomous systems (AS). The mesh of BGP connected networks is typically considered to be the Internet. When a failure occurs that affects the BGP network, the protocol is designed to reroute traffic within several minutes.
Data centers – If there is a broader issue that causes an entire data center to go offline such as a power failure, natural disaster, or fiber cut, a redundant datacenter that has the same functionality and services should be able to absorb the increased load. DNS manipulating technologies like global server load balancing (GSLB) and IP routing functions like IP anycast can help ensure that traffic is steered to the appropriate site. Ideally, these technologies are always working and all data centers are active at the same time. If an entire site goes offline, the network technologies can adjust from seconds to minutes.
Data and databases – Databases are hard to provide redundancy and resiliency for. They are often updated and keeping multiple copies in sync is a challenge at best. Clustering technologies and data assurance technologies help address the reliability of any single database instance. Ideally, if a portion of a database fails, then the replication and synchronization technologies make the failure invisible to the end-user. For user-generated data, cloud technologies have enabled the customer to keep the data separate from the application and client devices. Depending on the data in question, the time to adjust due to a failure can be less than a second to hours if one has to restore the data from an archived backup.
End-user devices – Assuming that the data is stored in the cloud or a networked cluster as previously described, then the end-user device is providing access to the application. If the client device (laptop, tablet, smartphone, etc.) fails, then the end-user needs to find a new device that provides access to the application. The time to recover can be from seconds to minutes depending on the end-user's access to other devices.

These are the core components of a standard IT architecture. As long as these technologies are applied, we can build a flexible model that can withstand the failure of any single component or group of components. It is possible to create a fine-tuned application delivery architecture that can dynamically adjust for almost any failure scenario and be functional anywhere from seconds to minutes. When operational, this capability requires no human intervention since we have designed the intelligence and mitigation functions into the technologies.

Failure is not the only option

Ultimately, it is not the failures in the network and application availability that causes the most problems. It is the responsiveness and performance of the applications across the network infrastructure that causes the most problems. The performance and degradation of the application delivery infrastructure is also the hardest problem to identify and solve. Different applications have different performance requirements, even though they are all using the same infrastructure.

Next week, we will look into application performance. We will determine the metrics associated with application performance and understand how to monitor them. Based on the metrics, we will adjust the environment to create an optimally performing application delivery network.

Read "Keep It Simple; Make It Scalable: 6 Characteristics of the Futureproof Load Balancer" to learn more.

Download Now

Frank Yue

Frank Yue is Director of Solution Marketing, Application Delivery for Radware. In this role, he is responsible for evangelizing Radware technologies and products before they come to market. He also writes blogs, produces white papers, and speaks at conferences and events related to application networking technologies. Mr. Yue has over 20 years of experience building large-scale networks and working with high performance application technologies including deep packet inspection, network security, and application delivery. Prior to joining Radware, Mr. Yue was at F5 Networks, covering their global service provider messaging. He has a degree in Biology from the University of Pennsylvania.

AI and User Experience Next Generation User Experience The Future of User Experience of Network Service Deployment, Handling and ConfigurationEvery IT person knows the drill – Having a Itay Raviv |July 31, 2024

Customers What SOC Teams Really Need in 2025 – Straight from the Front Lines Cyber attackers are evolving fast—and many security operations centers (SOCs) are struggling to keep up. While infrastructures stretch across on-prem, cloud, and hybrid environments, SOC teams are expected to respond faster, smarter, and with fewer resources. Radware Customers |June 20, 2025

Customers Every Pixel Counts: How a Unified Design System Improves Your Experience Cybersecurity platforms aren’t the average digital product. They’re dense with information, require quick response times, and are built around rich and dynamic interfaces. Noy Cabel |June 11, 2025

Networks Can Fail - Here’s What You Can Do About It

Key failure points

Failure is not the only option

Read "Keep It Simple; Make It Scalable: 6 Characteristics of the Futureproof Load Balancer" to learn more.

Frank Yue

Contact Radware Sales

Already a Customer?

Get Social

By Industry

By Use Case

Application Protection

DDoS Protection

Application Delivery

Application Protection

DDoS Protection

Application Delivery

Protect Your Website From Dangerous Bad Bots

Documents

Blog

Free Assessment Tools

Events

Security Research Center

WHY RADWARE? Learn how Radware EPIC-AI™ rapidly resolves issues

CUSTOMERS Read case studies, reviews and customer testimonials

DIVERSITY & INCLUSION Get to know Radware’s fair and supportive culture

INVESTORS Get the latest news, earnings and upcoming events

PARTNERS Access the new partner tools, services and expertise

LOCATIONS Discover Radware’s offices and strong global presence

CAREERS Learn about our team, values and latest job openings

TRAINING Join in-depth training, live classes, workshops and more

CONTACT US Connect with a Radware expert today

Watch Radware’s New Series: Threat Bytes

Networks Can Fail - Here’s What You Can Do About It

Key failure points

Failure is not the only option

Read "Keep It Simple; Make It Scalable: 6 Characteristics of the Futureproof Load Balancer" to learn more.

Frank Yue

Related Articles

Contact Radware Sales

Already a Customer?

Get Social

What are you looking for?

Protect Your Website From Dangerous Bad Bots

WHY RADWARE? Learn how Radware EPIC-AI™ rapidly resolves issues

CUSTOMERS Read case studies, reviews and customer testimonials

DIVERSITY & INCLUSION Get to know Radware’s fair and supportive culture

INVESTORS Get the latest news, earnings and upcoming events

PARTNERS Access the new partner tools, services and expertise

LOCATIONS Discover Radware’s offices and strong global presence

CAREERS Learn about our team, values and latest job openings

TRAINING Join in-depth training, live classes, workshops and more

CONTACT US Connect with a Radware expert today

Watch Radware’s New Series: Threat Bytes