Fault Tolerance - Requirements and Solutions

May 3rd, 1998

Jan-Erik Sarparanta
Department of Computer Science
Helsinki University of Technology
Jan-Erik.Sarparanta@hut.fi

Abstract

This paper examines network faults in routed TCP/IP ethernet network and workarounds for them, but gives useful information to other network techniques as well.

Table of Contents

1. Introduction to the paper
2. Fault tolerance
        2.1. Fault Resiliency
        2.2. Fault Tolerance
        2.2. What is needed from fault tolerance
3. Physical level techniques
        3.1. Device oriented fault tolerance
                3.1.1. Power
                3.1.2. Duplicate switching fabrics
        3.2. Link oriented fault tolerance
                3.2.1. Transceivers
                3.2.2. 3Com Resilient Link
                3.2.3. Link Aggregation
                3.2.4. Cisco EtherChannel
                3.2.5. LAN Aggregation
4. Protocol based techniques
        4.1. Link State Protocols
                4.1.1. Spanning Tree Protocol
                4.1.2. VLAN
        4.2. Routing Protocols
                4.2.1. RIP
                4.2.2. OSPF
                4.2.3. EIGRP
        4.3. Router Protocols
                4.3.1. VRRP
                4.3.2. HSRP
5. The Big Picture
6. Conclusions
7. Glossary
8. References

1. Introduction to the paper

Today the growing need for networked environments are driving development to three different angle. Firs is cheap affordable network that has no special abilities. Second is high throughputs for backbones and for other demanding environments. Third is fault resiliency that is needed in critical environments that depends from communication. In this paper we examine a combination of these three driving force by choosing basic shared Ethernet for our network media (which is cheap) but adding some fault resiliency to it and simultaneously trying to get more speed of it. At the end we will notice that after these modification our cheap ethernet isn't longer a chap one, but can cost a lot of extra money.

The paper will be split in three logical parts. Those parts are fault recognition, technical solutions and fault management. At the end I try to achieve a working solution for a common campus network that can be used as it in most of moderate to big companies.

2. Fault tolerance

Fault tolerance means that there can be some faults in some parts of a system, but the system itself continues to run and serve it's clients. Fault resiliency means that the system tolerates some failures. I.e. in this paper a router with two power supply is somewhat fault resilient, but a network with many routers running some routing protocol is fault tolerant.

2.1. Fault Resiliency

Fault resiliency tries to resilient faults. That means that it tries to ignore faults so that they do not affect the system at all. A good example of fault resiliency is a device with has two power supply, and when other power supply fails it continues to work with other power supply. Some other implementations from fault resiliency is resilient network cards (NIC) and resilient links from 3Com.

2.2. Fault Tolerance

Fault tolerance is a working system, a single device or group of devices that work so that some component of the system may fail, but the concept will continue to work. Often it is hard to build a fault tolerant system without some fault resiliency [1]. A example could be a computer cluster which continues to work even if one of cluster computers fails. The performance will of course drop, or even stop for a while, but the concept will continue to work. Another example could be router network, that can survive of a dead of one router, and it will configure to use other routes, but there is a short (or log) break in traffic.

2.2. What is needed from fault tolerance

There are three basic concept that must be filled.
  1. When a fail occurs in a fault tolerant system the system must first separate a faulty component from rest of working system.
  2. The system must configure itself so that all operations must continue to work, but the performance can drop.
The minimum requirements are that the system doesn't break the connection between two systems. Because I can't know what kind of software and timeouts programmers put into their programs I have to relay only for TCP protocol, and it's timeouts. TCP Timeout should use some kind of round trip time (RRT) and one possibility is to use Smoothed Round Trip Time (SRTT) [2]. If there is enough data to transfer between two computers and the error rate of the network is minimal and no congestion is present the SRTT is same as RTT. Let us assume that there are some equipment's between these two computer's and those equipment's have some delay when packets go through them. Let's assume that there is a very small network arranged as in picture 1. The interconnections are all 100Base-FX, and the distance is minimal, so we can forget about those element's delays. Both switches work as 'Cut-Through', 'Fast-Forward' or something else that mean's that the packet is being forwarded as soon as the MAC header is inside the switch so that it knows where to forward it. This delay is in most modern switches below 40us [3]. Let's assume that these switches has delay of 35us. Now the hard part is the router, because normally it generates most of the delay in routed network's and in past network consultants advised to reduce routers and rise segment sizes. Nowadays when L3 switches (e.g. Hardware Routers) have come to the market the delay they generate to network has reduced to almost nothing. Example 3Com's CoreBuilder 3500 can route IP-packets as quick as it can switch ethernet frames. It's latency is between 15uS to 35uS + packet length.

So when we calculate the estimated round trip time in ideal condition (no load, small packet sizes etc.) it would look something like this.
 


RTT = 2 * ( 2 * Switch Latency + Router Latency + Route Latency) + Receiving Computer latency + Sending Computer latency

RTT = 2 * ( 2 * 35*10^-6s + 80*10^-6s + 2*10-6) + 2*10^-3 + 2*10^-3

RTT = 0.004s


 


If we think that TCP stack would retransmit packets N times, and the timeout would double every time a packet must be retransmitted the timeout for a TCP packet would be (2^N-1) * RTT. So if the computer will retransmit 5 times as Windows NT does for default and the network is like above the TCP timeout could be easily less than 5 seconds.

So to achieve these kind of uptimes from network there is few fundamental needs. First the devices should be up and running as much as possible. The second thing is to try to avoid triggering the convergence of routing protocol, because it always is more time consuming than physical level techniques.

3. Physical level techniques

For providing fault tolerance in computer networks there is two main categories implemented. The first method is to provide fault tolerance in physical level so that the physical network tries to be functional. The second method is to use some kind of protocol to examine faults and try to get rid of them. Physical fault resiliency can also be split into two main categories. The first is to provide fault tolerance to network links, and the second is to provide fault tolerance to devices.

3.1. Device oriented fault tolerance

Device oriented fault tolerance tries to make individual devices fault tolerant. The key idea is to duplicate almost everything inside a device. In some cases all duplicated units works same time and create same kind of load balancing. The usefulness of these techniques is great, but these cannot directly help network traffic. They only tries to keep devices up and running, witch is of course a good idea.

3.1.1. Power

The easiest and the most fundamental of all fault resilient devices is to have at least two different power sources. The most common thing to cause network downtimes is lost of power.[4] It can be due broken power supply, loose power cord or power blackout. To two first cases a second normal power supply or a different DC power connector can save one for those events. For the last case the only thing which helps is to get a UPS.And to not do the same mistake again one must remember that if device has two power supplies the other can be connected to UPS, but the other can not be connected to same small UPS. If one power supply can feed the device all power it needs, one can be connected to normal power network, and the other to UPS. In some cases a network device can hold several power supplies, and if some of them fails the the device can not be fully operational. 3Com's CoreBuilder 9000 is one of these kind of devices, and it can shut down it's modules in predefined order until the power is adequate again.[4]

3.1.2. Duplicate switching fabrics

If the switch is in a position where it can't be down it is wise to buy a switch where you can duplicate device parts so that if one or more thing begins to fail the switch itself continues to server the network. Most of Cisco Catalyst 5000 series and all of Cisco Catalyst 8500 series is this kind of devices. 3Coms CoreBuilder 9000 is also capable of holding two switching engines. All these switches tries to use passive backplane i.e. that the backplane itself doesn't do anything. In the case of CB9000 the backplane is extremely passive and consist of duplicate start topology. That means that from every card slot there goes one bus to first switching fabric and a second bus to second switching fabric. If one of these busses fails to work the device switches to use another bus. These two switching fabric elect which if operational and which is hot stand by. Both cards have two vote, and management card has one vote. So if switching fabrics cant decide which is operational the management card forces the election to be non equal.

3.2. Link oriented fault tolerance

In link oriented fault tolerance the goal is to keep network link operational. Link oriented fault tolerant techniques are all very fast, but the drawback is that they can see only between two devices and only physical level. If the physical level between two devices is working the link oriented techniques does not notice if the traffic cant be sent or received, or if the network is broken behind next switch.

3.2.1. Transceivers

There is a lot of  implementations from duplicate transceivers. Duplicating transceiver a network card has two link to switch, but the other is only operational. When the transceiver or the link fails traffic is being moved to other link. The NIC doesn't notice a thing, and there should be no break in the traffic.

3.2.2. 3Com Resilient Link [3]

3Com has proprietary method of creating fault tolerant network links. They are called Resilient Links. The idea is to take a pair of links and create resilient pair. If other link fails the other will take the others place. Resilient Link is a very good and handy thing to have around, but it is useful only in certain configurations. First of all the resilient link is configured only in other pair of links. The advantage is that the link does not loose packets. All packets will be forwarded to other link. The links can be different speed, mode (FD, HD), but must be ethernet and they must have same VLAN configuration. The other end does not need to know about resilient link and the other link from the pair can go to other switch.

3.2.3. Link Aggregation [4]

Link aggregation is a upcoming standard, and will be only implemented on ethernet. The idea is to group multiple ports to one logical high speed link. The ports must be in one device at each end of trunk, but they do not have to be same speed. The port group shows to switch as one port, and it does not different from other ports. The port group is load balancing, but the method how the load balancing should be determined is still open. In current situation it is proposed that one conversation should be in one cable, and the other may be in other cables. So the LA determines from receiver MAC and sender MAC pair where the packet should be sent. The conversations can't be spread to multiple cables, because ethernet standard does not allow that packets can arrive in different order that they have been sent.

In practice the LA works fine when having even some computers on LAN. If several clients are talking to several servers the load balancing of LA works so fine that packets are spread to different cables almost equally much. Link Aggregation provides smooth upgrade path to 100Mbps networks that do not need or do not have the many to buy a new GE network. The only thing to be remembered is that individual conversation can't have more bandwidth than one individual cable can carry. So in a situation where one has server farm behind a router and all clients are on the other side router that the servers there is no advantage servers to have LA build to their NIC, because all of the traffic is between the router and the server. Not between other servers. There is one case where LA can help servers. Most of today's new NIC come with multiple MAC, usually 16. This is because the thing that if NIC or Switch can't talk VLAN 802.1p, it has different MAC to different VLAN, and if server uses multiple MACS the LA can improve throughput. In any case the server will get the benefit of having fault tolerant second link to switch.

3.2.4. Cisco EtherChannel

Cisco has developed earlier it's own implementation to achieve the same goal as LA is trying to reach. Cisco's implementation differences in several ways from LA's. First the EtherChannel has to be on one card, so it can't be spread to different cards. This reduces fault tolerance, because if one card fails the whole EC fails. Second the EC can have maximum 4 ports while LA can have at least 16. And the bottom line is that EC has to be same type of links, e.g. 100Base-TX, FD etc. The load balancing is also done differently in EC than it has been done in LA. In LA the load is balanced using MAC pairs, but in EC Cisco uses only 4 bytes from MAC, and in this case the load balancing is not so efficient as it is in LA. [4]

On the other hand Cisco's EtherChannel has been around for several years, and it is used widely and it is real. LA is only a work targeting to standard, but some vendors like HP, Bay, 3Com have implemented LA in their switches. And the major drawback of EC is that it is Cisco's property, and no one else can implement that in their switches without Cisco's permission.

3.2.5. LAN Aggregation [4]

LAN Aggregation is only a idea and thus some vendors are studying this technique no one has yet implemented it in any device. The idea is similar to Link Aggregation, but this time it will be done to LAN instead of individual links. There would be multiple links to at least to independent device and from those devices there may be multiple links to different devices. The idea is that is one of these switches fails the other continues to work, and to not having fully meshed topology of Link Aggregation. The key function is to provide a technique to operate between multiple path to multiple LANs without the dancer of creating loops in ethernet.

4. Protocol based techniques

Because physical level techniques only provide fault tolerance to physical level and they can't know if there is some problems on higher layers e.g. Level 3, there has been developed several methods to find these problems.

4.1. Link State Protocols

Links state protocols tries to find faults in links, e.g. in connections between end points. These do not provide any load balancing and they usually are quite old and slow to converge

4.1.1. Spanning Tree Protocol

Spanning Tree Protocol (STP) is probably the oldest and mostly used in ethernet LANs. Spanning tree is a protocols that works in switched / bridged environment, and the bridges calculates the best working route. Each bridge sends periodically a HELLO packet out of it's interfaces and listen from others. If the bridge has not heard any HELLO packets, and the timeout has been reached the bridge calculates a new route. Spanning tree usually has a converge time between 30 seconds to few minutes. This is due to old equipment, that are not particular fast to calculate routes, and that the timers are usually quite large because these hello packets are being sent to LAN.

4.1.2. VLAN

Virtual LAN's in general do not provide any fault tolerant services, but they can help to more efficient use network. Spanning tree blocks all other links that the main links it uses. So if vendors VLAN is implemented so that STP can be ran inside VLAN and multiple VLAN can have multiple STP processes running it is possible to provide some load balancing with VLAN. If one has multiple links, and them are configured as VLAN Trunk (i.e. there goes multiple VLAN in one physical cable) one can configure STP so that in one VLAN it uses route 1 to destiny and other VLAN uses some other route to destiny. If one route fails the one traffic being in that route is moved to some other link, and now the traffic must compete with other traffic already running in that link.

4.2. Routing Protocols

Routing protocols provide fault tolerance to routed networks, and most of new routing protocols are designed to have relative fast convergence time. In this paper I comment only few most used routing protocol, since on that area there has been no much changes in past few years. Routing protocols in general do provide fault tolerance in network backbone but not in LAN.

4.2.1. RIP

RIP is probably the most known routing protocol, since it is quite old and it was used in Internet before OSPF was developed. RIP is a distant vector algorithm, and is not very useful when compared to OSPF or EIGRP. RIP can't scale well enough, and one can't control how far RIP will announce route changes. RIP always will announce it's routes to as far as 16 hops, and it can't know how fast some links are. So if there is lot's on routers on the network there will probably be some inefficient routes with RIP.

The other problem with RIP is that it's convergence is not fast or reliable. RIP convergence hop by hop, and on every route advertising the new route information gets only one hop further. This also takes a lot of time, because router 10 hops away must wait for 10 advertisement time. And due that time there is a change that there will be black holes or rings in routing.

4.2.2. OSPF

Open Shortest Path First is a popular and open routing protocol. It is link state based routing protocol, and it can have different costs for different links. That is a very good thing, because in that case one can have one or more primary routes to e.g. different town rented from ISP, and one low speed ISDN route as a backup route. The ISDN would ha higher costs so normally when everything is working fine routing would be done via leased lines. Normally the link costs are configured to be 100 000 000 / network speed, so that 100Base-TX would have cost of 1 and 10Base-T would have cost of 10. Of course nowadays when OC-12, GE are coming it would be wise to configure routes so that the link cost would be 100 for GE and 1000 for FE. In that way there would be left some room for Link Aggregation and for 10GE.

OSPF can also be configured as areas, and in that way one can control where route advertisement are being advertised. By defining and splitting large network to areas the convergence time will reduce thus there may become inefficient routes. OSPF can detect link failures in two different methods. First one is to get that information from lower protocols. I.e. if link fails in L3 switch, the interface notices the routing protocol about that, and the routing protocol triggers convergence. The other method is to send HELLO packets and if N numbers of HELLO packets has not been received OSPF starts converging. Normally the HELLO interval is 10 seconds, and the timeout is 40 seconds. This is not very fast, but normally link failures is indeed noticed much earlier from lower level protocols.

Digital had a project where customer needed fast convergence times, and in that project they used 1 second HELLO interval times and three second timeout.[12] They reserved two seconds for the route calculation and the whole network should be operational in less than five seconds. This is not nowadays hard to achieve, because one can create lots of small areas, use devices that can notice link level failures quickly and because the QoS and prioritisation one can use fairly small counters in OSPF network. With prioritisation OSPF packets can be send by all other packets through different queue as normal packets and in this method there is less chance that the packet could be lost in somewhere and the timeout can easily be placed to three seconds.

Quality of Service helps also building fast OSPF networks in broadcast networks like ethernet. One can define filters for OSPF multicast address 0x01005e000005 so that those packets can not be forwarded to networks where is no OSPF devices and one want's not to consume bandwidth or some else resources. On the other side QoS can guarantee some bandwidth to OSPF, so that it can always convergence as in low congested network.

4.2.3. EIGRP

Enhanced Interior Gateway Protocol is Cisco's routing protocols which seems very similar to OSPF. They both have same qualities, but one fundamental different there is. EIGRP is a link state protocol that calculates it's link distances as composite of available bandwidth, delay, load utilisation, and link reliability [9]. EIGRP uses also Distributed Update ALgorithm (DUAL), which suppose to be a little bit faster that OSPF's algorithm, but not as fast as HIPR could be.[7] [8] EIGRP has also ability to route other that IP-Protocol, and that makes it very powerful routing protocol. OSPF can route only IP-Traffic, but it works with multiple vendor devices.

4.3. Router Protocols

Router protocols means protocols that do not participate in routing itself, but they try to provide fault tolerant router interface to clients. This is needed because many network devices are so simple that they can hold only one gateway address at a time and they don't know how to make routing decisions. Most switches and printers falls to this category.

4.3.1. VRRP [10]

Virtual Router Redundant Protocol is a newly created RFC2338, which is going to be IETF standard this year. It works very similar to Cisco's own HSRP and in fact IETF had to use Cisco's patent in this implementation. VRRP is a protocol that works between two or more routers. The key idea is to provide one virtual router to clients and in fail over some other hot stand by router will continue to work as virtual router. The virtual router will be assigned a IP-Address and a virtual MAC address. The router having highest prioritisation will be elected as primary router. All other will remain backup routers.

Because we all hate to see that million dollar routers will stand in some dark corner just to wait that some other million dollar router will fail a VRRP router can be in several virtual router. It can be multiple master router and at the same time it can be multiple backup router to other routers. So a very used configuration is to have router A to be primary router to subnet A and router B to be primary router to subnet B. And router A and B will backup each other.

VRRP uses the same kind of method to discover master shutdowns as OSPF uses. VRRP sends periodically (default 1 second) HELLO packets to multicast address, and if no hello packets has been herd for 3-4 seconds the backup routers will elect new master. This time can be configured by using different timers than defaults. With VRRP one can also use same tactics as with OSPF with QoS and prioritisation.

4.3.2. HSRP [11]

Cisco's Hot Standby Router Protocols is very similar to VRRP. HSRP is Cisco's property and has been around for several years. HSRP works very similar to VRRP except in HSRP there is a master router and a backup router. All other routers are just listening, when these two routers are sending periodically hello packets. Both VRRP and HSRP can have multiple (256) virtual routers in same LAN, and the convergence time should me just few seconds in both protocols.

5. The big picture

To build a fast converged fault tolerant network one has to use several different techniques for different reasons. First of all there should be two or more server farms that have critical servers separated from each other. One of each server should be in different server farm (e.g. DNS, WINS, DHCP, BOOTP, Application Servers, Login Servers, Directory Servers, Print Servers). If it is possible there should some kind of data replication so that if one of server farms would be e.g. be without power the other server farm could serve some critical services. Of course is we are talking about really big services one could build diesel generators and distribute data storage in different cities e.g. but now we can thing that this is a fairly big business with campus network.

The backbone should be build upon reliable and fast switch / router like Cisco Catalyst 8540 or 3Com CoreBuilder 9000. There could be soma parts where this big switches are too big and there one could put smaller one like 3Coms CoreBuilder 3500 or Cisco's 8510. The whole backbone should run standard OSPF. In the building should come fibres from two different directions so that if one is being accidentally digged up the other would still be operational. Of course in highly critical environments one could use some radio links between buildings as a backup link. Everywhere should be UPS, and two or more power supplies in devices. Routers should run VRRP between each other if connected to building LAN.

To floors one could use Link Aggregation, and if using small switches one could use a lots of fibers to building cellar or one could use 3Com's resilient links of spanning tree if convergence time is not important. When LAN Aggregation is coming true that may be the answer to standard building internal fault tolerance. If clients can use multiple name servers and multiple gateways one should use them. Also some security aspects could also be noticed. Some computer criminals, spies and other enemies can try to get your network down. For reason and also for mis configuration almost every higher layer protocols support some kind of authentication. Plain password authentication can be useful, but strongly encrypted MD5 authentication between routers may provide the shield for unwanted attack. It can also provide shield from mis configuration and save your network from getting it down.

6. Conclusion

The conclusion is that there is going to be lots of opportunities to build a fault tolerant ethernet network. In fact there is a lot of work going on that tries to modify ethernet to look as ATM. ATM is also evolving and it will get those fault tolerant characteristics its services lack of. Cisco has a great amount of protocols and techniques to provide fault tolerant network, but unfortunate they are not standards. But in this year there is first time in ethernet history a possibility to build multi vendor network that can converge in less than five seconds. In most applications that five second should be enough, since it should not happen quite often. The second thing that is fine in these new techniques is that now network professionals can do some maintenance middle in the day and not bothering any one.

7. Glossary

FE              Fast Ethernet
GE             Gigabit Ethernet
STP           Spanning Tree Protocol
HSRP        Hot Standby Router Protocol
VLAN       Virtual LAN
VRRP        Virtual Router Redundant Protocol

8. References

[1]
David Iseminger. Inside Windows NT Infrastructures, John Wiley & Sons, Inc., 1998
[2]
Information Sciences Institute, University of Southern California. RFC793: Transmission Control Protocol.
[3]
3Com. SuperStack II Switch 1100, User Guide. 3Com Corporation, 1997
[4]
Unknown Speaker. Networks3, 3Com User Group, 1998
[5]
John T.Moy. OSPF - Anatomy of an Internet Routing Protocol, Addison Wesley, 1998
[6]
Bruce Schneier. Applied Cryptography, John Wiley & Sons, Inc., 1994
[7]
Zhengyu Xu, Sa Dai, J.J.Garcia-Luna-Aceves. A More Efficient Distannce Vector Routing Algorithm, University of California, 1997
[8]
Shree Murthy, J.J.Garcia-Luna-Aceves. Loop-Free Internet Routing Using Hierarchical Routing Trees, University of California, 1997
[9]
Cisco System. Introduction to Enhanced IGRP (EIGRP), 1997
[10]
Network Working Group. RFC2338: Virtual Router Redundant Protocol, The Internet Society, 1998
[11]
Network Working Group. RFC2281: Cisco Hot Standby Router Protocol (HSRP), The Internet Society, 1998
[12]
Peter L. Higginson, Michael C. Shand. Development of Router Cluster to Provide Fast Failover in IP Networks, Digital Technical Journal, Vol.9, No.3, 1997