Multi-gigabit Routers

May 3rd, 1998

Klaus Lindberg

CSC - Tieteellinen laskenta Oy
Klaus.Lindberg@csc.fi

Abstract

In the recent years the Internet traffic has increased extremely rapidly, but the Internet Protocol (IP) routing techniques hasn't been able to fulfill the requirements of the backbone core network. It has been already argued that the routers have come to the end of their life span. The object of this paper is to present new IP router technologies and show that there is a need for gigabit routing. An overview of routers and their functions is given. Four new routing table lookup algorithms are reviewed and the flow-based solutions are discussed. The currently installed router base has many bottlenecks, which can be solved by using the gigabit router architectures. Some examples of the new router generation are presented. The impact of the growth of the Internet user community and the role of the TCP/IP protocol is also discussed. For background material the recent results of the Internet traffic measurements are shown and the differences between the character of raw bit, IP packet and flow bandwidths are clarified.


Table of Contents

1 Introduction
2 The Internet Traffic
2.1 Best-effort
2.2 Bits, Packets and Flows
2.3 Traffic Characteristics of Internet Traffic [4]
3 Routing
3.1 Routing Protocols
3.2 Packet Forwarding
3.3 Routing Table Lookups and Flows
3.3.1 Fast Routing Table Lookup Techniques
3.3.2 Flow-based Protocols
4 Router
4.1 The Bottlenecks in Routers
4.1.1 Router Backplane
4.1.2 Routing table lookup
4.1.3 Packet Header Processing
4.1.4 Buffers
4.2 How to Route Gigabits per Second
4.2.1 Distributed Routing Engine
4.2.2 Backplane
4.2.3 New Hardware and Header Processing in ASIC
4.2.4 New Buffer Architectures
5 Some Examples of Gigabit Routers
5.1 BBN MGR Project
5.2 Tiny Tera Project
5.3 Core Routers
5 LAN Switches
6 Conclusions
7 References


1 Introduction

The consequences of the exponential expansion of the Internet can already be seen. The Internet protocols and the model of network operation have difficulties to follow the development. The requirements of the Internet network have changed, in comparison to the old days, when the user community was small. First, the equipment and data links should be fast enough to transmit the increasing IP (Internet Protocol) traffic. Second, the routers should be able to control the so-called "not TCP-friendly" [1] traffic, such as UDP (User Datagram Protocol) or modified TCP (Transmission Control Protocol). This property is useful, when the routers start to drop data packets in congestion. In this situation TCP traffic starts to reduce its bandwidth, but the "not TCP-friendly" traffic consumes all the freed resources. Third, the trust in the good behavior of the Internet user community has gone. The mail spam and other kinds of attacks give just a foretaste of the future. The network and the users should have an easy and effective way to protect themselves from ill behavior. This is a short and challenging wish list for the protocol developers and gigabit router vendors.

This study is about gigabit routers. It looks at the subject from an Internet Service Provider's (ISP) point of view. The new generation of the core routers is already emerging in the network markets of today and the intelligent terabit switches, multi-gigabit line cards, flow-based protocols and fast routing lookup algorithms are under intense research. The paper discusses the demands of gigabit routing and the existing solutions. The basic router functions are explained, and the gigabit techniques and the new equipment are presented. As background material the Internet traffic patterns and the analyses of measurements made in one of the MCI's core network is discussed.

2 The Internet traffic

This section gives an overview of what kind of traffic there is in today's core Internet.

2.1. Best-effort

If we compare Internet to traditional telephone networks, there is one fundamental difference. The Internet traffic patterns are self-similar or fractal [2][5]. Thus the patterns are unpredictable in any time scale and don't follow any expected probability distribution. That is why resource allocation mechanisms are very difficult to use. Internet is built on friendly protocols like TCP and best-effort traffic, where all applications have an equal chance to get the network resources. At the moment it is impossible to give a predefined level of service for an application over the Internet. To guarantee some sort of service to the user applications an ISP has to overestimate the bandwidth need in their backbone networks. In the Internet2, there is a plan to test applications that use Quality of Service (QoS) [11].

2.2.Bits, Packets and Flows

The usual way to think of data traffic is in kilo- or megabits per second. The bit bandwidth is a good measure of how much traffic fits in a standard optical or an electrical cable. The cable bandwidth today usually starts at 10 Mbps Ethernet and can go up to 622 Mbps 0C-12 ATM or 2,4 Gbps SDH link. Very soon the Wavelength Division Multiplexing (WDM) techniques are expected to increase the bandwidths to tens of gigabits.

When we think about the IP the bit bandwidth doesn't tell the whole story. Data communication equipment handles packets or cells when they are transferring data from one network to another. The normal IP packet size usually varies from 40 bytes TCP acknowledgement to 1500 bytes Ethernet Maximum Transfer Unit MTU [3][4]. Shorter and longer packets are also possible, but are seldom seen in core networks. An average packet size is less than 400 bytes [4]. Later on it is shown that 95% of the data is TCP. It is easy to calculate from these numbers (40 per 400) that over 10% of the Internet traffic is nothing else but IP, TCP and UDP headers. As we will see the processing of the IP packet headers uses most the router performance. Thus in Internet it is often much more informative to talk about packets than about bandwidth.

Today's Internet packet traffic has grown so high that it starts to be impossible for an Internet core router to process all the packets with its main CPU. Therefore modern routers and some network protocols handle packet trains as data flows. A flow is a group of packets with the same source and destination address-port-pair (or only address-pair). However the flow is not the same as a connection established by telnet, web or some other application, because one connection can be formed of many sequenced flows and one flow can be formed of many application sessions. Thus the start or the expiration of a flow is not necessarily the beginning or the end of the TCP or UDP session. In the flow protocols the packets that belong to a certain flow are marked with labels or tags. In addition to the destination address the tags and the labels may include also quality information. These protocols maybe a key to the better Internet service and to QoS. Nevertheless in the following section, which deals with Internet measurements, the flows are only a way to visualize the TCP and UDP data streams and have nothing to do with the new protocols.

2.3 Traffic Characteristics of Internet Traffic

The results of measurements in the MCI's Internet [4] network are a good example of the traffic characteristics of the Internet. During the measurement time the data peak load was about 50 Mbps and 20 000 packets per second. The peak number of flows existing at the same time is 240 000. The statistics show that most of the traffic in the Internet is TCP protocol, namely World Wide Web (WWW), Simple Mail Transfer Protocol (SMTP), Network News Transfer Protocol (NNTP), File Transfer Protocol (FTP) and Telnet application. The most used UDP applications are Distributed Name Service (DNS) and Real Audio(tm). TCP covers about 95 % of the bandwidth, over 90% of the packets and about 80% of the flows. The average packet size is from 175 bytes to 400 bytes, but 40% of the packets are only 40 bytes long. Similar patterns are found in the measurements of the FUNET network in 1997 [5].

The measurements show the important role of web traffic in the Internet. It accounts for 65-80% of bytes, 55-75% for packets and 65-75% for flows. The web client and server traffic is equivalent in packet count, but there is an asymmetry in flow count. The clients make only about 6-8 % of all the bytes. The server traffic accounts for 55-70 percent of all the bytes. It consists of an average of 10-15 seconds long flows, 14-18 packets per flow and 9-12 kilobytes per flow. The FTP protocol doesn't dominate any more as it did before the web boom. It is only 2-8% of all the bytes and 1% of all the flows and 3% of the packets. An average duration of a flow is 20-500 seconds transferring 200 kilobytes. The NNTP protocol accounts for 1-4% of all the bytes. The average number of packets per flow varies from 200 to 800, with a flow generally lasting 100-200 seconds and transferring 50-300 kilobytes. DNS is only 1-2% of the bytes, but 15-25% of all the flows. The average number of bytes per flow is about 500.

The study of the results show that optimizing (for example if we could switch the flows) 50% of the packets would require optimizing 20% of the flows and optimizing the 10 % of the flows affect roughly 40% of the packets [4].

3 Routing

Now that we have discussed the traffic pattern of the Internet we may move to the equipment that handle the IP packets, the router. A simple way to tell what a router does is to say that it forwards IP packets and gathers information of the network topology.

3.1 Routing Protocols

Routers get the knowledge of the network routing topology from routing protocols. The topology information is stored in a routing table, which includes the destination network address, next hop (next hop is the address where the packet is next sent) and interface number. The network addresses are 8-32 bit long (prefix). In figure 1 the network prefix data gathered is from MAE-EAST Network Access Point (NAP)[6]. A NAP is a place where ISP's networks exchange data and routing network information.

[MEDLINE Format]

Figure 1. The spread of network route length in MAE-EAST NAP [6].

The most used routing protocols are OSPF, RIP, BGP and EIGPR. Here we discuss the Border Gateway Protocol (BGP) and the Open Shortest Path First (OSPF) protocols in more detail.

BGP4 is the de facto protocol used in routing between Internet Autonomous Systems (AS). Every AS has a specific number that identifies it. The AS is a network area usually administered by an ISP or another large organization. At the edge of an AS routers must run BGP to be able to get information on Internet routes. The neighboring BGP routers change the route and AS routing information using AS-PATH attributes. As an example there are over 50000 routes in each of the FUNET core routers and it receives about 6000 AS-PATH attributes when it starts to build the route database [8][9].

The BGP has an important task of building and updating the routing information in the routing tables. This will be dealt with later in the section on packet forwarding. In addition to the route updating process, BGP-4 also handles the route aggregation so that information about groups of networks may be represented as a single entity in routing tables. For current routers BGP starts to be quite a light protocol. The processing of BGP protocol needs considerably CPU resources only when the process is started.

OSPF is widely used in inter domain routing, inside Autonomous Systems (AS). It is a link-state protocol, which means that every router keeps track of the connection to its neighbors. Each router calculates an "open shortest path" to the other routers. It builds a tree with itself as the root. The tree may be calculated using Dijkstra algorithm [7]. Usually the closed OSPF routing area has a default route to a core Internet router, which is running BGP. In this way the core router knows about both the local AS and other parts of the Internet.

OSPF is nowadays quite an easy protocol to handle for a router that has a fast RISC processor. The CPU-time is mainly used to compute the shortest path, which depends on the number of the routers and the changes in the routing topology. The memory requirements are not critical anymore.

3.2 Packet Forwarding

The process, which moves packets from routers in coming interface to the out going interface, is called forwarding. This process consults routing table information. A simple picture of how a router does the packet forwarding is shown in the figure 2. First the network interface card reads the incoming IP packet to its input buffers and sends the header to the CPU. The CPU reads the packet IP address and checks if it has a corresponding forward interface number and a next hop MAC address in the cache. If the cache doesn't contain the entry the CPU does an address lookup to the routing table. After this the packet header counters are increased, the packet is put together and forwarded to the outgoing interface buffer. This seems like an easy task, but problems arise when you have to do it almost in real time for millions of packets. The performance of a router depends mainly on how fast it does the forwarding process.

[MEDLINE Format]

Figure 2. CPU performs a routing table lookup for the packet address and receives the MAC address for the next hop and the switch port number where the packet data is sent.

The routing table has a mapping from the network to the next hop. There should be a destination (or default route) for every incoming IP packet. The number of entries in the table depends on the function and displacement of the router in the Internet. The core routers may have over 40000-50000 network entries. The core network routing tables are updated frequently, from one to some hundred times per second [6].

For core routers with large routing tables the challenging task is to get the route lookup process fast enough for gigabit speeds.

3.3 Routing Table Lookups and Flow

There are different ways to tackle the problem of quickly making forwarding decisions. One way is to develop algorithms, implemented in hardware or software, which optimize the route update and lookup or compression of the network prefix data. Other solutions are flow-based protocol.

As mentioned before the problem with routing table information is that its not of fixed length. In the basic concept Internet address hierarchy has only A (8 bits), B (16 bits) and C (24 bits) class networks, which would make the life much easier. In the beginning of the 90's it was seen that A and B classes consumed the address space too quickly. After that organizations could order only chunks of C class networks. The routing table sizes started to grow. The Supernetting and Classless Inter-Domain Routing (CIDR) was a solution, the purpose of which was to aggregate network information to reduce the routing table size [19][20]. This resulted in a large variation in the network address prefix, which made the lookup process more complicated. When a router searches for the next hop for the packet it can't make an exact match for the packet address. The router has to find from its routing table the longest prefix that matches the address. A network bridge has a much simpler task when it makes an exact match in the MAC address table. Also the LAN routers (they don't have real routing tables) use this mechanism when they have only a couple of network entries in their routing table.

We can also argue that route lookup is a protocol problem. The protocol-based solutions try to overcome the route lookup problem by changing the varying length network address to a fixed length flow label. The label information is sent with the packet. The label information lookup is more easily done because of the exact match. If we start to play with the labels, there is a network at the edge that uses normal IP addresses (at least some time). Some equipment does the network-to-route and route-to-label conversion. Usually it is a router. It is still unclear how the label concepts scale at the edges. One benefit of labeling is that we don't have to do the routing table lookup in every router from source to destination, but only at the edge. A drawback with protocol based solutions is of course that they require that the routers have to implement the label protocols. In IP version 6 there is also a 24-bit flow field reserved for labeling purposes.

One thing that should be taken into account in routing and forwarding is multicast, even if it is not discussed in more detail in this presentation. The lookup algorithm or label protocols should be ready to scale if the use of multicast traffic should grow.

3.3.1 Fast Route Lookup Techniques

Scalable routing table lookup techniques are not based only on normal hashing and binary search algorithms. To overcome the problem of the varying length of the route prefixes some special techniques have been developed. Some algorithms are planned to scale when the address space becomes larger, if IPv6 comes in use. There isn't exact knowledge of the IPv6 route aggregation and network address dispersion thus we don't know if this is of relevance. Here we only point out the data structures the algorithms use and no order of quality is made. Deeper discussion can be found in the references.

In the route lookup algorithms the network prefix data or any other variable length binary data is usually presented in a trie [13] shown in figure 3a. A trie is a tree structure where binary data is presented as leafs of the tree. A turn to left denotes 0 and a turn to right denotes 1. The tree starts from the root node. A child is a branch from a node above. Here are some examples of how the trie structure is used.

An algorithm that uses binary search on hash tables organized by prefix length is one of the ways to speed up the lookup [10]. In this case the number of worst-case hash lookups is a function of logarithm 2 of the number address bit. For IPv4 the number is 5 and for IPv6 it is 7. By optimizing the algorithm for an IPv4 core router with 33 000 entries, the number of hash lookups can be reduced to two, of which the first one can be made a fast indexed array access.

Another algorithm has a different approach to represent the trie structure. First the average depth of the trie is decreased by path compression [13] shown in figure 3b. Every internal child of only one node is removed and marked with a skip value.

[MEDLINE Format]

Figure 3. Three different trie structures a. Normal binary trie (leaf number 0 is string 0000, leaf number 1 is string 0001, leaf number 2 is string 00101 and so on) b. Path-compressed trie c. Level compressed trie [13]

The trie is then compressed more using the Level Compression (LC-trie), which reduces all the dense populated parts of the trie shown in figure 3c. It replaces number of i levels of the complete binary trie with a node value of 2 to the power of i. This is done to every part of the trie.

The leaves of the tree contain a pointer to a so-called base vector where the complete strings are stored. The depth of the LC-trie is typically 2-3. One memory lookup is performed for every node level and two additional memory accesses for the base vector and the next hop table. In certain cases the searched address doesn't match the base vector string because there are exceptions in the address space. This needs one extra lookup. This algorithm is very effective in memory usage. Typically the trie needs less than 250 kilobytes of memory and the whole table needs less than 1 megabyte. This is small enough to fit in a second level cache of a CPU. Simulation with real data shows that a 133 MHz Pentium could do about half a million lookups per second.

There is yet another technique [14], where the trie is divided into three levels. The first one is 16 nodes deep from nodes 1 to 16, the second level is from node 17 to 24 and the last level from node 25 to 32. Tests showed that this algorithm could do at least 2 million routing lookups per second, on a 200 MHz Pentium Pro, when the forwarding table fit in a secondary cache.

Another way to design an algorithm is to assume that memory is cheap [6]. This algorithm is optimized for current IPv4 network address space. The approach is designed to be implemented on a dedicated hardware. The goal is that the route lookup procedure takes only one memory access time. If you look at the length distribution of the network prefixes shown in figure 1, only few addresses are longer than 24 bits. In this algorithm the route table is divided into two arrays: the first one of 16 million entries long consisting of all the addresses from 0.0.0 to 255.255.255 and the second array of length 256 times "number of longer than 24 bit prefixes" entries. The first array has all the routes that are shorter than 25 bits and their next hops or a pointer to "the longer prefix" table. In this way, by using the memory inefficiently, the algorithm can search the routing table information in one memory access. Two memory accesses are needed for long prefixes, but they can be pipelined because the accesses are located in different memory banks. The total cost of memory is estimated to be 33 Mbytes.

The algorithms above optimize the memory usage, memory access or processing time. Using an algorithm that packs the varying prefix length data and needed pointer structure with as few overheads as possible minimizes the memory usage. Performance is gained when most of the data fit in the processor cache memory. Memory accesses and processing time are saved when the algorithm has a low depth in its storage structure. Processing time is minimized when the storing structure is as easy as possible. Usually the number of memory accesses is also decreased by partitioning the first search table to smaller arrays of tables that are processed separately. This increases memory usage, but speeds up the search.

3.3.2 Flow-based Protocols

Multiprotocol Label Switching MPLS is a working group in IETF that is making a new standard for label switching. One purpose of the group is to make the router forwarding process easier by using fixed length labels to identify flows or streams (aggregate of one or more flows)[15]. The MPLS domains are formed of continuous sets of MPLS routers or forwarders. They are also in one routing domain. Edge nodes connect MPLS domains together as well as networks that don't run MPLS. Egress and ingress routers handle traffic, that is coming or leaving MPLS network. Routers use label distribution protocol (LDP) to distribute the label information to the whole MPLS domain.

Ipsilon, nowadays owned by Nokia, started the boom of flow switching. The basic idea is similar to other label switching concepts. The edge of the IP Switch network identifies the packets that belong to a flow of multiple packets. The edge device makes a decision whether to establish a switched connection or not. For a small flow, such as DNS query, a connection is not established. These packets are handled with normal forwarding engines. The label for the switched connection is an ATM VCI (Virtual Channel Identifier). After the label is given there is no need to routing table lookups for these packets [16].

There is an interesting simulation analysis of IP Switched network [18], in which some network traces have been used to simulate IP switching. The results show that the IP Switch equipment, if used in measured networks, should have at least 10 000 Virtual Circuits (VC). Prediction to Internet core networks is 64k VC's, which is quite a lot for current ATM switches. The critical timers are flow deletion delays (FDD) and flow creation delays (FCD). FDD is estimated to be less that 30 and the FCD about 20 ms. Large FDD value lead to exhausting of VC and reduction of switching performance.

Tag switching, developed by Cisco Network, is another way to diminish the route lookups. You don't have to do variable length prefix matching, when you have fixed length tag information. Tags are small labels that are associated with every IP route. Tagging is used in closed tag-networks, where packets are handled by switches or routers. An ingress and egress router is used to apply and strips off tags from the packets. The network topology is learned by the normal routing protocols. In addition to the routing table, routers and switches also have a Tag Information Base (TIB). This includes tag information associated with a certain route. TIB information is distributed by Tag Distribution Protocol inside the tag network. [17]

In all flow protocols there is the known problem of scaling at the edges of the network. How should the egress and ingress routers do the route-to-flow and flow-to-route matching? The other thing of concern is the label table managing.

4 Router

4.1 Bottlenecks in Routers

4.1.1 Router Backplane

Today most of the routers have their interface cards, processors and managing units connected to a single bus. There may be up to 10 or 15 interfaces connected to a full-loaded chassis. All these interfaces share the same bus bandwidth. All the packets that are moved from one interface to another go through the router backplane. A high-end bus has a bandwidth of less than 1 gigabits per second speed. This year we will see Gigabit Ethernet (GE) environments where the network backbone speed is gigabits per second. If a router is supposed to handle 5 GE interfaces, with even 500 megabits per second throughput per port, bus based backplanes are not good enough. The solution is a switch backplane, that guarantees at least 1 Gbps point-to-point bandwidth for every port pair. All the new high-end routers use switches in their backplane.

4.1.2 Routing Table Lookup

As the data streams get faster also the number of routed packets per second increases. The time to forward a packet decreases. The worst bottlenecks in core routers are the routing table lookups, when there isn't any specialized hardware for this purpose. The hardware is usually called route engine. For example, if we have on average 300 byte TCP-packets of 3 Gbps traffic in a core router, we get over one million route table lookups per second. This problem may be solved by caching the most used forwarding information. How well the caching works depends on the localization of the data. Studies have shown that in Fix West (one of the US core routers) core routers the 12000-entry cache has a 95 % hit rate [21].

4.1.3 Packet Header Processing

First routers processed all the packets in the CPU. Today you move only the headers to the CPU to be processed. The data has a different path. This causes still a lot of copying from buffers to memory. This is very time consuming and usually done with software. A new way is to have an ASIC chip that is specialized to handle IP and TCP headers to minimize the time spent in header processing.

4.1.4 Buffers

Another problem is how to allocate buffers for different interfaces and different media. In IP networks the packet size can vary from some tens of bytes to over four thousand or nine thousand bytes. Static allocation is easy but requires many resources. Dynamic buffer management is difficult to implement. There is also much discussion about input and output buffering. Input buffering has the Head-of-Line Blocking (HOL-blocking) [27] problem, which occurs when many packets are going to same output port. The packet in the head of the input queue waiting its turn to go to the congested output port blocks the other packets way in the same input queue even these packet would go to an empty output port. Output buffering is usually considered easier to implement, but has problems when line speeds increase. The interconnect must have a very high bandwidth to allow multiple packets to enter the same output queue.

4.2 How to Route Gigabits per Second

4.2.1 Distributed Routing Engine

The routing tables are updated, when there exists a change in the table. This happens rarely if we compare it with the number of routing table lookups per second. The main CPU, which also calculates topology changes, controls updating. It is sensible to move the routing engine, which performs the lookups, near the interface. See figure 4. This way the bus or switch is only transferring the routing table changes to the route engines. By distributing the whole forwarding process near the interfaces the bottleneck caused by the bus or switch is removed. To prevent the forwarding process from stopping, when the tables are updated, there should be two tables on every engine. They are switched when changes are updated. More speedup can be gained by using fast local route caches.

[MEDLINE Format]

Figure 4. Shared bus and switch architecture (LC=line card, MA=media adapter, RE=Route Engine, MEM=Memory, IF=Interface). The figure also shows the idea of dividing interface into a media independent line card and a media adapter.

4.2.2 Backplane

As said before a bus can't keep up with multi-gigabit speeds. The switch technology has been used for many years in supercomputer environments and now also in ATM switches. The first implementations in routers, providing 1 Gbps non-blocked access from port-to-port, are based on crossbars or shared memory. To get the full throughput from a switch there must be input queuing and a clever switch allocator algorithm. From the switch input-output connection requests the allocator must make a fair and efficient connection pattern for each transfer time slot. The switch should support also multicast traffic on hardware.

4.2.3 New Hardware and Header Processing in ASIC

Older routers used easily implemented CPU architectures, for example the Motorola 68xxx series. The new Digital Equipment Alpha (tm), Silicon Graphics MIPS (tm), Sun Sparc (tm) and Intel Pentium (tm) RISC architectures are much faster and have found their place in router design. The interface cards have also good processing power and are able to take away the many tasks from the main CPU. More and more things are done on hardware, because the router markets are increasing all the time and the volumes getting large enough to use ASIC design technology.

The packet route lookup, IP, TCP and UDP header processing, checksum calculation, counter handling, service classes and packet tagging are usually been software processed. This is slow if you don't have optimized software and a processor dedicated to this task. The router vendors have come out with new ASIC based hardware for header processing. This kind of hardware can handle Gigabit traffic "wire speed" (full speed of the interface). However, this doesn't usually solve the routing table lookup problem, if there isn't a special hardware for this purpose. In LAN environments, where the routing tables are small, the gigabit speeds can easily be achieved.

4.2.4 New Buffer Architectures

Normally the fast switches use input buffering. New buffer models try to overcome the HOL-blocking by putting bypass mechanisms in the input buffers. Then the blocked packet can bypass the first packet in the buffer, if there is space in the output buffers. Intelligent output buffering is becoming even more important, as QoS and multicast issues are implemented in routers [26].

5. Some Examples of Gigabit Routers

5.1 BBN MGR [21]

BBN Corporation is part of GTE Internetworking, a unit of GTE Corporation. The research group of BBN (www.bbn.com) has published a design for "A Fifty Gigabit Per Second IP Router". The goal of the project was to show that new multi-gigabit routers are able to drive the fast link speeds like OC-48C (2.4 Gbps). The router achieves three basic design goals: it has enough internal bandwidth, it can forward several million packets per second, and it is based on standards.

MGR is built around a high-speed custom design 15-port switch, where the line cards are plugged. Each card supports many network interfaces. Forwarding engine cards, having a complete set of routing information, are separate from the line cards. Much effort is put into building a very fast forwarding engine.

All the line cards are able to translate the link-layer header to an abstract link-layer header having information required for IP forwarding. In this way it was possible for the forwarding engine to receive packets from line cards of different link layer types.

Each forwarding engine has one super-scalar Alpha 21164 RISC processor. The system is optimized so that the most used part of the code fits into the first level 8kB Instruction cache (Icache). The on-chip 94kB secondary cache is filled with 12000 cached route entries. A third level 16 MB off-chip cache is reserved for the complete forwarding table. This memory is divided into two 8 MB memory banks. The route updates are made by copying a new table to one bank and then disabling the previously used memory bank.

The switch is input-queued. All inputs have a separate FIFO for each output. The scheduling guarantees that there is no HOL-blocking and the switch operates with full throughput. The switch is clocked at 51.84 MHz and each transfer consists of 1024 bits, making 49.77 Gbps aggregate bandwidth. The scheduling of the switch is pipelined so that the switch allocator can form a switch pattern in every 300 nanoseconds. The allocator has to choose the way in which 225 (15x15) possible input-output pairings are to be connected (from the 1.3 trillion possible) to serve all connection effectively. There is one limitation in the point-to-point switch: it can't make one-to-many transfers. That's why multicast packets are copied to each line card separately.

A QoS Processor is put to the outbound line card to include the Quality of Service processing. The Network Processor, where routing protocols are run, is built on a PC having a PCI interface. The motherboard has an Alpha processor and it runs NetBSD UNIX.

At the moment of the writing of this paper, the router has been fabricated and tested except for the interface cards.

5.2 Tiny Tera [22]

Tiny Tera is a Stanford University research project, the goal of which is to design a small, 1 Tbps packet switch using normal CMOS technology. The system is suited for an ATM switch or Internet core router. It efficiently routes both unicast and multicast traffic. The switch has three basic elements: port, crossbar and scheduler. The current design version has 32 ports each operating at a 10 Gbps (Sonet OC-192 rate) speed.

The switch is a small stack composed of a pile of round shaped crossbar slices and a scheduler. See the figure 5. Each slice (6 cm diameter) contains a single 32x32 1-bit crossbar chip. A port is connected to the slices radially. The port design is scalable in data rate and packet size. The basic switching unit is 64 bits, called a chunk. Each incoming or outgoing packet is written into a number n times 64-bit-wide SRAM before and after switching. The number n depends on the size of the real data packet. The packet is actually in a 2D structure where it is possible to change the data rate by processing a different number of chunks at a time.

[MEDLINE Format]

Figure 5. The design of Tiny Tera switch.

All traffic is input-queued (and output-queued on exit). Unicast traffic use a buffering scheme called "Virtual Output Queuing" [25] technique, where every input has a separate queue for each output. When the 64-bit data chunks are transferred over the 32x32 switch the scheduler uses a heuristic algorithm called iSLIP [25]. It "achieves fairness using independent round-robin arbiters at each input and output. This leads to maximum throughput of just 63 %, but slight modifications give 100 % throughput". If the iSLIP algorithm is implemented in hardware it can make decision in less than 40 nanoseconds.

The switch has special input queues for multicast. A multicast input can deliver simultaneously to many outputs. The switch uses fan-out splitting, which means that the crossbar may deliver packets to the output over a number of transfer slots. Developing good multicast scheduling algorithms was an important part of the Tiny Tera project [26].

5.3 Core Routers

The following texts are mainly based on marketing material found in the Companies Web Sites. These are only examples, so there exists also other gigabit class router in production.

Cisco Systems has long been the main vendor of the high-end routers and it has been the market leader for a very long time. Cisco's largest router has a backplane capable of switching 16-ports simultaneously, each with 2.4 Gbps line rate. The full interface bandwidth of the router scales from 5 to 60 Gbps. The router has been designed in close relation with Tiny Tera project [28].

Also Torrent Networking Technologies Corporation has announced a Gigabit Router product family. Its switching capacity is over 20 million packets per second. Its routing engine is based on Torrents own patented Ajuha-Shah-Illingworth-Kanakia-algorithm (ASIK). It is implemented so that the worst case prefix match takes 16 memory accesses. According to the vendor, this makes it possible to do "on the fly" routing (the routing table lookup is not the bottleneck). It can handle a routing table containing 200 000 entries with wire speed. The Full Fast Path Architecture forwards all packets via a non-blocking, low-latency fabric. The Service Policies are maintained in "flow routes" within the routing table. The per-flow queuing fabric ensures delivery without contention. [23]

Cabletron Systems has introduced a new product called Smart Switch Router, which is based on YAGO Systems, Inc. router development. There is a 16 Gbps non-blocking switching fabric in the backplane and the full routing table of maximum of 250,000 routing entries. Each port can switch 50,000 MAC addresses. Both OSPF and BGP protocols are implemented [24].

5.4 LAN Switches

There are many companies that sell a very fast LAN switch, which also does limited routing. The switch bandwidths vary from 4 to 45 Gbps. The packet forwarding capacity can be up to 10 million packets per second and more. The switches can learn over 100000 MAC addresses. In the tests the best routing capacity was a rate of almost 12 million packets per second [30]. The header processing is pipelined and it's done on ASIC-based hardware. The internal switching is done with flows. Some systems handle RIP and OSPF routing protocols, but they usually don't run BGP and don't have much space for routing table. Many of these products are closer to MAC layer switches than routers. Very soon they surely will gain more routing functions and become similar to core network routers. 

6 Conclusions

If and when the IP packet maintains its position in the networking, the gigabit routers will have a significant role in the coming evolution of the Internet. Thus the importance of the fast IP packet processing and the effective routing is obvious. At the moment some real gigabit routers for Internet core network are already available. The backplane bandwidths reach speeds of up to tens of gigabits per second. A fully loaded router with 2.4 Gbps speed line cards can achieve 60 Gbps full bandwidth. As routing table lookup algorithms are developing rapidly the routing engines can fetch routes for a million packets per second. Because the IP equipment and the network market is increasing also the hardware based ASIC solutions have become competitive in price. The bottlenecks of the current routers are known. Thus there are still many interesting areas to explore in fields of routing, buffering, switching and scheduling. The flow and label protocol development should also catch up with the gigabit speeds. The next target in research is already the terabit routers.

7 References

[1]
S.Floyd and K.Fall. Router Mechanisms to Support End-to-End Congestion Control <ftp://ftp.ee.lbl.gov/papers/collapse.ps>. Technical report of Lawrence Berkeley National Laboratory, February 1997.
[2]
W. Willinger, M. S. Taqqu, R. Sherman and D. Willson. Self-Similarity Through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level <http://http.cs.berkeley.edu/~kkeeton/Netread/self-sim-sigcomm95.ps>. Proceedings of the ACM/SIGCOMM '95.
[3]
W.R.Stevens. TCP/IP Illustrated, Vol 1: The Protocols. Reading, Addison-Wesley, 1994
[4]
K. Thompson, G.J. Miller, and Rick Wilder. Wide Area Internet Traffic Patterns and Characteristics. IEEE Network November/December 1997
[5]
E. Kinnunen. FUNET network traffic analysis- a view of current state. Master's Thesis, Helsinki University of Technology, Espoo 1998.
[6]
Pankaj Gupta, Steven Lin, and Nick McKeown. Routing Lookups in Hardware at Memory Access Speeds <http://tiny-tera.stanford.edu/~nickm/papers/Infocom98_lookup.pdf>. IEEE Infocom April 1998, San Francisco.
[7]
Moy, J. OSPF protocol analysis <ftp://ftp.nordu.net/rfc/rfc1245.txt> , RFC 1245, July 1991.
[8]
Rekhter Y., and T. Li. A Border Gateway Protocol 4 (BGP-4) <ftp://ftp.nordu.net/rfc/rfc1771.txt> RFC 1771, March 1995.
[9]
Rekhter, Y., and P. Gross, Editors. Application of the Border Gateway Protocol in the Internet RFC 1772, T.J. Watson Research Center, IBM Corp., MCI, March 1995.
[10]
Marcel Waldvogel, George Varghese, Jon Turner, and Bernhard Plattner Scalable High Speed IP Routing Lookups , ACM SIGCOMM'97 Conference on Applications, Technologies, Architectures and Protocols for Computer Communications. (Student Paper Award). Conference details: Cannes, France, September 16-18
[11]
Internet2 project <http://www.internet2.edu/html/wg-qosfeb98.html>, Internet2 Home Page, 1998.
[12]
S. Deering, R. Hinden. Internet Protocol, Version 6 (IPv6) Specification <ftp://ftp.nordu.net/rfc/rfc1883.txt>, RFC 1883, December 1995
[13]
Stefan Nilsson and Gunnar Karlsson. Fast address lookup for Internet routers <http://www.sics.se/~gk/fast-lookup.ps>. To be presented at the IFIP TC 6 Fourth International Conference on Broadband Communications, Stuttgart, Germany, April 1-3, 1998.
[14]
Mikael Degermark, Andrej Brodnik, Svante Carlsson, Stephen Pink. Small Forwarding Tables for Fast Routing Lookups < ftp://cdt.luth.se/micke/lookup97.ps>. To appear in Proceedings of the ACM SIGCOMM'97 Conference on Applications, Technologies, Architectures and Protocols for Computer Communications. (Student Paper Award). Conference details: Cannes, France, September 16-18 1997.
[15]
E.C.Rosen, A.Viswanathan, R.Callon. Multiprotocol Label Switching Architecture <ftp://ftpeng.cisco.com/mpls/Internet-Drafts/draft-ietf-mpls-arch-01.txt >, Internet Draft, March 1998
[16]
P.Newman, G.Minshall, T.Lyon and L.Huston. IP Switching and Gigabit Routers IEEE Commmunications Magazine, Jan. 1997
[17]
Y. Rekhter, B. Davie, D. Katz, E. Rosen, G. Swallow. Tag Switching Architecture Overview <ftp://ftp.nordu.net/rfc/rfc2015.txt> RFC 2015, February 1997
[18]
Steven Lin, and Nick McKeown. A Simulation Study of IP Switching <http://tiny-tera.stanford.edu/~nickm/papers/Sigcomm97_IPSwitch.pdf >. Proceedings of ACM Sigcomm, September 1997,
[19]
Y. Rekhter, T. Li. An Architecture for IP Address Allocation with CIDR. <ftp://ftp.nordu.net/rfc/rfc1518.txt> , RFC1518, September 1993
[20]
V.Fuller, T.Li, J.Yu, and K.Varadhan, Supernetting: an Address Assignment and Aggregation Strategy <ftp://ftp.nordu.net/rfc/rfc1338.txt> ,RFC 1338, June 1992.
[21]
C. Partridge et. all A Fifty Gigabit Per Second IP Router to be published in IEEE Transactions Networking
[22]
Nick McKeown, Martin Izzard, Adisak Mekkittikul, Bill Ellersick and Mark Horowitz. The Tiny Tera: A Packet Switch Core <http://klamath.stanford.edu/tiny-tera/papers/papers/HOTI_96.pdf >. IEEE Micro, Jan/Feb 97
[23]
Torrent Networking Technologies. High-Speed Routing Table Search Algorithms <http://www.torrentnet.com/general/download/highspeed.txt >. A Technical Paper of Torrent Networking Technologies, 1998
[24]
Cabletron System <http://www.cabletron.com >. Cabletron System web site, 1998
[25]
Nick McKeown and Martin Izzard. High Performance Switching <http://klamath.stanford.edu/tiny-tera/papers/papers/proposal.pdf >. Proposal to Texas Instruments, November 1995.
[26]
Balaji Prabhakar, Nick McKeown and Ritesh Ahuja. Multicast Scheduling for Input-Queued Switches <http://klamath.stanford.edu/tiny-tera/papers/papers/IEEE_JSAC_MAY_96_submitted.pdf >.IEEE Journal on Selected Areas in Communications, May 1996
[27]
Craig Partridge. Gigabit Networking", Addison-Wesley,1993
[28]
Nick McKeown. Fast Switched Backplane for a Gigabit Switched Router <http://www.cisco.com/warp/public/733/12000/fasts_wp.pdf >. Cisco Systems web site, 1998.
[29]
Cisco Systems Inc. Cisco 12000 Series Gigabit Switch Router <http://www.cisco.com/warp/public/733/12000/gsrfs_ds.htm >. Cisco Systems web site, 1998
[30]
R. Mandeville and D. Shah. Gigabit Ethernet Gets It Done. Article in Data Communication, February 1998