Klaus Lindberg
CSC - Tieteellinen laskenta Oy
Klaus.Lindberg@csc.fi
This study is about gigabit routers. It looks at the subject from an
Internet Service Provider's (ISP) point of view. The new generation of
the core routers is already emerging in the network markets of today and
the intelligent terabit switches, multi-gigabit line cards, flow-based
protocols and fast routing lookup algorithms are under intense research.
The paper discusses the demands of gigabit routing and the existing solutions.
The basic router functions are explained, and the gigabit techniques and
the new equipment are presented. As background material the Internet traffic
patterns and the analyses of measurements made in one of the MCI's core
network is discussed.
When we think about the IP the bit bandwidth doesn't tell the whole story. Data communication equipment handles packets or cells when they are transferring data from one network to another. The normal IP packet size usually varies from 40 bytes TCP acknowledgement to 1500 bytes Ethernet Maximum Transfer Unit MTU [3][4]. Shorter and longer packets are also possible, but are seldom seen in core networks. An average packet size is less than 400 bytes [4]. Later on it is shown that 95% of the data is TCP. It is easy to calculate from these numbers (40 per 400) that over 10% of the Internet traffic is nothing else but IP, TCP and UDP headers. As we will see the processing of the IP packet headers uses most the router performance. Thus in Internet it is often much more informative to talk about packets than about bandwidth.
Today's Internet packet traffic has grown so high that it starts to
be impossible for an Internet core router to process all the packets with
its main CPU. Therefore modern routers and some network protocols handle
packet trains as data flows. A flow is a group of packets with the same
source and destination address-port-pair (or only address-pair). However
the flow is not the same as a connection established by telnet, web or
some other application, because one connection can be formed of many sequenced
flows and one flow can be formed of many application sessions. Thus the
start or the expiration of a flow is not necessarily the beginning or the
end of the TCP or UDP session. In the flow protocols the packets that belong
to a certain flow are marked with labels or tags. In addition to the destination
address the tags and the labels may include also quality information. These
protocols maybe a key to the better Internet service and to QoS. Nevertheless
in the following section, which deals with Internet measurements, the flows
are only a way to visualize the TCP and UDP data streams and have nothing
to do with the new protocols.
The measurements show the important role of web traffic in the Internet. It accounts for 65-80% of bytes, 55-75% for packets and 65-75% for flows. The web client and server traffic is equivalent in packet count, but there is an asymmetry in flow count. The clients make only about 6-8 % of all the bytes. The server traffic accounts for 55-70 percent of all the bytes. It consists of an average of 10-15 seconds long flows, 14-18 packets per flow and 9-12 kilobytes per flow. The FTP protocol doesn't dominate any more as it did before the web boom. It is only 2-8% of all the bytes and 1% of all the flows and 3% of the packets. An average duration of a flow is 20-500 seconds transferring 200 kilobytes. The NNTP protocol accounts for 1-4% of all the bytes. The average number of packets per flow varies from 200 to 800, with a flow generally lasting 100-200 seconds and transferring 50-300 kilobytes. DNS is only 1-2% of the bytes, but 15-25% of all the flows. The average number of bytes per flow is about 500.
The study of the results show that optimizing (for example if we could
switch the flows) 50% of the packets would require optimizing 20% of the
flows and optimizing the 10 % of the flows affect roughly 40% of the packets
[4].
Figure 1. The spread of network route length in MAE-EAST NAP [6].
The most used routing protocols are OSPF, RIP, BGP and EIGPR. Here we discuss the Border Gateway Protocol (BGP) and the Open Shortest Path First (OSPF) protocols in more detail.
BGP4 is the de facto protocol used in routing between Internet Autonomous Systems (AS). Every AS has a specific number that identifies it. The AS is a network area usually administered by an ISP or another large organization. At the edge of an AS routers must run BGP to be able to get information on Internet routes. The neighboring BGP routers change the route and AS routing information using AS-PATH attributes. As an example there are over 50000 routes in each of the FUNET core routers and it receives about 6000 AS-PATH attributes when it starts to build the route database [8][9].
The BGP has an important task of building and updating the routing information in the routing tables. This will be dealt with later in the section on packet forwarding. In addition to the route updating process, BGP-4 also handles the route aggregation so that information about groups of networks may be represented as a single entity in routing tables. For current routers BGP starts to be quite a light protocol. The processing of BGP protocol needs considerably CPU resources only when the process is started.
OSPF is widely used in inter domain routing, inside Autonomous Systems (AS). It is a link-state protocol, which means that every router keeps track of the connection to its neighbors. Each router calculates an "open shortest path" to the other routers. It builds a tree with itself as the root. The tree may be calculated using Dijkstra algorithm [7]. Usually the closed OSPF routing area has a default route to a core Internet router, which is running BGP. In this way the core router knows about both the local AS and other parts of the Internet.
OSPF is nowadays quite an easy protocol to handle for a router that
has a fast RISC processor. The CPU-time is mainly used to compute the shortest
path, which depends on the number of the routers and the changes in the
routing topology. The memory requirements are not critical anymore.
Figure 2. CPU performs a routing table lookup for the packet address and receives the MAC address for the next hop and the switch port number where the packet data is sent.
The routing table has a mapping from the network to the next hop. There should be a destination (or default route) for every incoming IP packet. The number of entries in the table depends on the function and displacement of the router in the Internet. The core routers may have over 40000-50000 network entries. The core network routing tables are updated frequently, from one to some hundred times per second [6].
For core routers with large routing tables the challenging task is to
get the route lookup process fast enough for gigabit speeds.
As mentioned before the problem with routing table information is that its not of fixed length. In the basic concept Internet address hierarchy has only A (8 bits), B (16 bits) and C (24 bits) class networks, which would make the life much easier. In the beginning of the 90's it was seen that A and B classes consumed the address space too quickly. After that organizations could order only chunks of C class networks. The routing table sizes started to grow. The Supernetting and Classless Inter-Domain Routing (CIDR) was a solution, the purpose of which was to aggregate network information to reduce the routing table size [19][20]. This resulted in a large variation in the network address prefix, which made the lookup process more complicated. When a router searches for the next hop for the packet it can't make an exact match for the packet address. The router has to find from its routing table the longest prefix that matches the address. A network bridge has a much simpler task when it makes an exact match in the MAC address table. Also the LAN routers (they don't have real routing tables) use this mechanism when they have only a couple of network entries in their routing table.
We can also argue that route lookup is a protocol problem. The protocol-based solutions try to overcome the route lookup problem by changing the varying length network address to a fixed length flow label. The label information is sent with the packet. The label information lookup is more easily done because of the exact match. If we start to play with the labels, there is a network at the edge that uses normal IP addresses (at least some time). Some equipment does the network-to-route and route-to-label conversion. Usually it is a router. It is still unclear how the label concepts scale at the edges. One benefit of labeling is that we don't have to do the routing table lookup in every router from source to destination, but only at the edge. A drawback with protocol based solutions is of course that they require that the routers have to implement the label protocols. In IP version 6 there is also a 24-bit flow field reserved for labeling purposes.
One thing that should be taken into account in routing and forwarding
is multicast, even if it is not discussed in more detail in this presentation.
The lookup algorithm or label protocols should be ready to scale if the
use of multicast traffic should grow.
In the route lookup algorithms the network prefix data or any other variable length binary data is usually presented in a trie [13] shown in figure 3a. A trie is a tree structure where binary data is presented as leafs of the tree. A turn to left denotes 0 and a turn to right denotes 1. The tree starts from the root node. A child is a branch from a node above. Here are some examples of how the trie structure is used.
An algorithm that uses binary search on hash tables organized by prefix length is one of the ways to speed up the lookup [10]. In this case the number of worst-case hash lookups is a function of logarithm 2 of the number address bit. For IPv4 the number is 5 and for IPv6 it is 7. By optimizing the algorithm for an IPv4 core router with 33 000 entries, the number of hash lookups can be reduced to two, of which the first one can be made a fast indexed array access.
Another algorithm has a different approach to represent the trie structure. First the average depth of the trie is decreased by path compression [13] shown in figure 3b. Every internal child of only one node is removed and marked with a skip value.
Figure 3. Three different trie structures a. Normal binary trie (leaf number 0 is string 0000, leaf number 1 is string 0001, leaf number 2 is string 00101 and so on) b. Path-compressed trie c. Level compressed trie [13]
The trie is then compressed more using the Level Compression (LC-trie), which reduces all the dense populated parts of the trie shown in figure 3c. It replaces number of i levels of the complete binary trie with a node value of 2 to the power of i. This is done to every part of the trie.
The leaves of the tree contain a pointer to a so-called base vector where the complete strings are stored. The depth of the LC-trie is typically 2-3. One memory lookup is performed for every node level and two additional memory accesses for the base vector and the next hop table. In certain cases the searched address doesn't match the base vector string because there are exceptions in the address space. This needs one extra lookup. This algorithm is very effective in memory usage. Typically the trie needs less than 250 kilobytes of memory and the whole table needs less than 1 megabyte. This is small enough to fit in a second level cache of a CPU. Simulation with real data shows that a 133 MHz Pentium could do about half a million lookups per second.
There is yet another technique [14], where the trie is divided into three levels. The first one is 16 nodes deep from nodes 1 to 16, the second level is from node 17 to 24 and the last level from node 25 to 32. Tests showed that this algorithm could do at least 2 million routing lookups per second, on a 200 MHz Pentium Pro, when the forwarding table fit in a secondary cache.
Another way to design an algorithm is to assume that memory is cheap [6]. This algorithm is optimized for current IPv4 network address space. The approach is designed to be implemented on a dedicated hardware. The goal is that the route lookup procedure takes only one memory access time. If you look at the length distribution of the network prefixes shown in figure 1, only few addresses are longer than 24 bits. In this algorithm the route table is divided into two arrays: the first one of 16 million entries long consisting of all the addresses from 0.0.0 to 255.255.255 and the second array of length 256 times "number of longer than 24 bit prefixes" entries. The first array has all the routes that are shorter than 25 bits and their next hops or a pointer to "the longer prefix" table. In this way, by using the memory inefficiently, the algorithm can search the routing table information in one memory access. Two memory accesses are needed for long prefixes, but they can be pipelined because the accesses are located in different memory banks. The total cost of memory is estimated to be 33 Mbytes.
The algorithms above optimize the memory usage, memory access or processing
time. Using an algorithm that packs the varying prefix length data and
needed pointer structure with as few overheads as possible minimizes the
memory usage. Performance is gained when most of the data fit in the processor
cache memory. Memory accesses and processing time are saved when the algorithm
has a low depth in its storage structure. Processing time is minimized
when the storing structure is as easy as possible. Usually the number of
memory accesses is also decreased by partitioning the first search table
to smaller arrays of tables that are processed separately. This increases
memory usage, but speeds up the search.
Ipsilon, nowadays owned by Nokia, started the boom of flow switching. The basic idea is similar to other label switching concepts. The edge of the IP Switch network identifies the packets that belong to a flow of multiple packets. The edge device makes a decision whether to establish a switched connection or not. For a small flow, such as DNS query, a connection is not established. These packets are handled with normal forwarding engines. The label for the switched connection is an ATM VCI (Virtual Channel Identifier). After the label is given there is no need to routing table lookups for these packets [16].
There is an interesting simulation analysis of IP Switched network [18], in which some network traces have been used to simulate IP switching. The results show that the IP Switch equipment, if used in measured networks, should have at least 10 000 Virtual Circuits (VC). Prediction to Internet core networks is 64k VC's, which is quite a lot for current ATM switches. The critical timers are flow deletion delays (FDD) and flow creation delays (FCD). FDD is estimated to be less that 30 and the FCD about 20 ms. Large FDD value lead to exhausting of VC and reduction of switching performance.
Tag switching, developed by Cisco Network, is another way to diminish the route lookups. You don't have to do variable length prefix matching, when you have fixed length tag information. Tags are small labels that are associated with every IP route. Tagging is used in closed tag-networks, where packets are handled by switches or routers. An ingress and egress router is used to apply and strips off tags from the packets. The network topology is learned by the normal routing protocols. In addition to the routing table, routers and switches also have a Tag Information Base (TIB). This includes tag information associated with a certain route. TIB information is distributed by Tag Distribution Protocol inside the tag network. [17]
In all flow protocols there is the known problem of scaling at the edges
of the network. How should the egress and ingress routers do the route-to-flow
and flow-to-route matching? The other thing of concern is the label table
managing.
Figure 4. Shared bus and switch architecture (LC=line card, MA=media
adapter, RE=Route Engine, MEM=Memory, IF=Interface). The figure also shows
the idea of dividing interface into a media independent line card and a
media adapter.
The packet route lookup, IP, TCP and UDP header processing, checksum
calculation, counter handling, service classes and packet tagging are usually
been software processed. This is slow if you don't have optimized software
and a processor dedicated to this task. The router vendors have come out
with new ASIC based hardware for header processing. This kind of hardware
can handle Gigabit traffic "wire speed" (full speed of the interface).
However, this doesn't usually solve the routing table lookup problem, if
there isn't a special hardware for this purpose. In LAN environments, where
the routing tables are small, the gigabit speeds can easily be achieved.
MGR is built around a high-speed custom design 15-port switch, where the line cards are plugged. Each card supports many network interfaces. Forwarding engine cards, having a complete set of routing information, are separate from the line cards. Much effort is put into building a very fast forwarding engine.
All the line cards are able to translate the link-layer header to an abstract link-layer header having information required for IP forwarding. In this way it was possible for the forwarding engine to receive packets from line cards of different link layer types.
Each forwarding engine has one super-scalar Alpha 21164 RISC processor. The system is optimized so that the most used part of the code fits into the first level 8kB Instruction cache (Icache). The on-chip 94kB secondary cache is filled with 12000 cached route entries. A third level 16 MB off-chip cache is reserved for the complete forwarding table. This memory is divided into two 8 MB memory banks. The route updates are made by copying a new table to one bank and then disabling the previously used memory bank.
The switch is input-queued. All inputs have a separate FIFO for each output. The scheduling guarantees that there is no HOL-blocking and the switch operates with full throughput. The switch is clocked at 51.84 MHz and each transfer consists of 1024 bits, making 49.77 Gbps aggregate bandwidth. The scheduling of the switch is pipelined so that the switch allocator can form a switch pattern in every 300 nanoseconds. The allocator has to choose the way in which 225 (15x15) possible input-output pairings are to be connected (from the 1.3 trillion possible) to serve all connection effectively. There is one limitation in the point-to-point switch: it can't make one-to-many transfers. That's why multicast packets are copied to each line card separately.
A QoS Processor is put to the outbound line card to include the Quality of Service processing. The Network Processor, where routing protocols are run, is built on a PC having a PCI interface. The motherboard has an Alpha processor and it runs NetBSD UNIX.
At the moment of the writing of this paper, the router has been fabricated
and tested except for the interface cards.
The switch is a small stack composed of a pile of round shaped crossbar slices and a scheduler. See the figure 5. Each slice (6 cm diameter) contains a single 32x32 1-bit crossbar chip. A port is connected to the slices radially. The port design is scalable in data rate and packet size. The basic switching unit is 64 bits, called a chunk. Each incoming or outgoing packet is written into a number n times 64-bit-wide SRAM before and after switching. The number n depends on the size of the real data packet. The packet is actually in a 2D structure where it is possible to change the data rate by processing a different number of chunks at a time.
Figure 5. The design of Tiny Tera switch.
All traffic is input-queued (and output-queued on exit). Unicast traffic use a buffering scheme called "Virtual Output Queuing" [25] technique, where every input has a separate queue for each output. When the 64-bit data chunks are transferred over the 32x32 switch the scheduler uses a heuristic algorithm called iSLIP [25]. It "achieves fairness using independent round-robin arbiters at each input and output. This leads to maximum throughput of just 63 %, but slight modifications give 100 % throughput". If the iSLIP algorithm is implemented in hardware it can make decision in less than 40 nanoseconds.
The switch has special input queues for multicast. A multicast input
can deliver simultaneously to many outputs. The switch uses fan-out splitting,
which means that the crossbar may deliver packets to the output over a
number of transfer slots. Developing good multicast scheduling algorithms
was an important part of the Tiny Tera project [26].
Cisco Systems has long been the main vendor of the high-end routers and it has been the market leader for a very long time. Cisco's largest router has a backplane capable of switching 16-ports simultaneously, each with 2.4 Gbps line rate. The full interface bandwidth of the router scales from 5 to 60 Gbps. The router has been designed in close relation with Tiny Tera project [28].
Also Torrent Networking Technologies Corporation has announced a Gigabit Router product family. Its switching capacity is over 20 million packets per second. Its routing engine is based on Torrents own patented Ajuha-Shah-Illingworth-Kanakia-algorithm (ASIK). It is implemented so that the worst case prefix match takes 16 memory accesses. According to the vendor, this makes it possible to do "on the fly" routing (the routing table lookup is not the bottleneck). It can handle a routing table containing 200 000 entries with wire speed. The Full Fast Path Architecture forwards all packets via a non-blocking, low-latency fabric. The Service Policies are maintained in "flow routes" within the routing table. The per-flow queuing fabric ensures delivery without contention. [23]
Cabletron Systems has introduced a new product called Smart Switch Router,
which is based on YAGO Systems, Inc. router development. There is a 16
Gbps non-blocking switching fabric in the backplane and the full routing
table of maximum of 250,000 routing entries. Each port can switch 50,000
MAC addresses. Both OSPF and BGP protocols are implemented [24].