Both variable stride multibit tries and LC-tries attempt to reduce the height of the multibit tries by varying the stride. However, both algorithms have problems. While variable stride multibit tries can be tuned for a shorter height to speed up the search, it is possible only at the expense of wasted memory due to the presence of empty locations in intermediate trie nodes. On the other hand, LC-tries choose strides such that the array locations in the trie node are completely filled without wasting any memory. However, it loses the flexibility to tune the height and consequently increases the height of the trie thereby making the search slower.
The Lulea algorithm is a fixed stride trie nodes of larger stride that uses bitmap compression to minimize storage. Using large strides in fixed stride multibit tries results in a greater number of contiguous nodes with the same best matching prefix and next-hop information. The key idea of the Lulea scheme is to replace all consecutive elements in a trie node that have the same value with a single value, and use a bitmap to indicate which elements have been omitted. This can significantly reduce the number of elements in a trie node and save memory. Before discussing the details of the algorithm, let us understand bitmap compression, using a simple example.
Consider the prefix sequence of A1A1A1A2A2 A2A3A3A1A1A1A4A4A4. This can be represented using the bitmap 10010010100100, where bit 1 indicates the start of a new prefix in the sequence and 0 indicates repetition. This bitmap is referred to as a bitarr. This bitmap alone is not sufficient to get back the original sequence since it does not capture any prefix information. Hence, a compressed sequence A1A2A3A1A4 of the original sequence called valarr needs to accompany bitarr.
To illustrate the concepts behind the Lulea algorithm, consider the 3-bit fixed stride multibit trie shown in Fig. 1. It shows that some of the entries contain a prefix as well as a pointer to a subtrie. For instance, consider the fifth entry with prefix P2. It contains the prefix information for P2 and a pointer to the node 1 subtrie containing the prefix P6. To minimize storage and make compression easier, each entry is allowed to contain either a prefix or a pointer but not both. Hence the prefixes in the intermediate nodes are pushed down to the trie leaves. The leaf nodes instead of a pointer store the next-hop information while the internal nodes just store pointers to children. Such a trie is referred to as a leaf pushed fixed stride multibit trie.
A leaf pushed fixed stride multibit trie for the trie shown in Figure 1 can be created as follows: prefix P2 stored in the root entry for 100 is pushed down to all the entries for 100, 101, 110, and 111 in the node 1 subtrie. Prefix P5 stored in the root entry for 111 is pushed down to all the entries for 000, 100, 101, 110, and 111 in the node 2 subtrie. Prefix P6 stored in the node 1 entry for 001 is pushed down to all the entries for 000, 001, 010, and 011 in the node 3. The final leaf pushed fixed stride multibit trie is shown in Figure 2.
Now let us look at how the bitmap compression can be applied to a leaf pushed trie. Consider the root node that has the sequence of (P3, P3, P1, P1, ptr1(100), P4, P2, Ptr2(111)), where ptr (xxx) represents the pointer stored in the entry xxx. Once the repeated values are replaced by the first value, we get the new sequence (P3, −, P1, −, ptr1(100), P4, P2, ptr2(111)). The resulting bitarr is 10101111,which indicates the repeated value positions by 0 and the valarr contains the sequence (P3, P1, ptr1(100), P4, P2, ptr2(111)). The same process is repeated for all the subtries and the final compressed trie is shown in Figure 3, which shows both the bitarr and valarr for each node.
What is the maximum number of entries that can be in valarr? Assume that a trie node contains N prefixes. Each prefix represents a range of addresses. These N prefixes partition the entire address space into no more than 2N + 1 disjoint subranges. Since each subrange can be represented by at most one compressed entry, there can be at most 2N + 1 entries in the valarr.
To analyze the space savings, assume that in a trie node array there are M elements, each of size W bits. Out of these M elements, if only S of them are distinct, then the total memory required is M+SW bits. Directly storing the M elements will require MW bits. Now let us calculate the space savings between the leaf pushed trie and the Lulea compressed trie assuming the size of a prefix information and a pointer to the child is 32 bits each. Then the size of the root node of the leaf pushed trie in Fig. 2 will be 256 bits (8 × 32) and each child will be 256 bits (= 8×32) long. Hence, the entire trie will require 1024 bits (= 256+3×256) of storage space.
Now let us consider the Lulea compressed trie in Fig. 3. The root node will need 8 bits for bitarr and 192 bits (= 6 × 32) for valarr. All the other nodes will also need 8 bits each for their bitarr. Both Node 1 and Node 2 will require 128 bits (= 4 × 32) each for their valarr and node 3 will need 64 bits (= 2 × 32) for its valarr. Hence the total space required is 536 bits (= 8 + 192 + 8 × 3 + 128 x 2 + 64), which represents a space savings of 488 bits. This space savings can be quite significant when very large strides are used.
Route Lookup: The search algorithm starts with the root node and uses the same number of bits as the stride at the first level to index into its bitarr. As the root trie node is compressed, the same index cannot be used to obtain the corresponding entry in valarr. Instead, the index to the uncompressed bitarr should be mapped to a different index in the compressed valarr. This mapping requires counting the number of 1 bits occurring in bitarr up to the indexed bit position. Then the count is used as the index into valarr to fetch the corresponding entry. If the entry is a pointer, the search fetches and follows it. This process is repeated for each subsequent level until either a leaf or a null pointer is reached. Let us walk through an example to understand this clearly. Consider searching for a destination address begins with 1110100 (Fig. 3). Since the root node is compressed, the first 3 bits 111 are used to index into bitarr(Root) of the root node. To access the corresponding pointer entry, the number of 1 bits in bitarr(Root) needs to be counted up to the indexed position 111. As the count is 6, the sixth element in valarr(Root) of the root node containing the pointer ptr2(111) needs to be fetched. Using this pointer, the search proceeds to the second level. The stride at this level is 3 and hence the next three bits 010 of the address are used to index into bitarr(N2). It has no outgoing pointer buta corresponding entry contains P7’s next hop information. To access it, the search needs to find the position where the prefix occurs in valarr(N2). Hence the algorithm needs to count the number of 1 bits up to the bit position indexed by 010 in bitarr(N2) and use the count to index into valarr(N2) to find the best matching prefix. The number of 1 bits is 3 and hence the third element in valarr(N2) gives the correct result, P7. As can be seen, both steps require counting the number of 1 bits in bitarr until a given position. In the following section, we discuss how to efficiently count the number of 1 bits in a large bitmap.
Counting the number of 1 bits 
“The original Lulea algorithm used a fixed stride multibit trie of size 16, 8, and 8 for the first level, second level, and third level, respectively. As a result, for the first level, the algorithm requires a bitmap of size 64K. Obviously, naive counting of such a large bitmap will be inefficient. Instead, the entire bitmap is divided into chunks of size 64 bits each and for each chunk the number of 1 bits is precomputed during trie construction. These precomputed counts are kept in a separate table called a summary table. An entry at position k in the summary table includes the cumulative count of 1 bits of the bit sequence up to chunk k−1.
The size of the summary table is small compared to the actual bitmap size. For a rough idea of the space savings, we need to know the maximum value of the count that needs to be stored in an entry in the summary table. This occurs when all the bits in the bitmap are set to 1 and this needs to be stored in the last entry of the summary table. For a bitmap of size 64K divided into chunks of size 64 bits, the maximum value of the count will be 65472 (=64× 1024 − 64). This will require 16 bits of storage for each entry and hence a total of 16K bits (= 16×1024) is required, which is 25% of the memory used by the bitmap itself.
Now counting the number of bits up to a given position i in the bitmap consists of two steps. First, the chunk j into which i falls is computed as j = Floor(i/64). Second, the entry at position j is accessed in the summary table. This gives the cumulative count of 1 bits in all chunks until j − 1. Finally, the bitmap chunk j itself is retrieved and the number of 1 bits is counted up to position i−j ×64. The sum of the two values gives the final count.
Example: Counting the number of 1 bits in a bitmap.
Let us try to count the number of 1 bits up to position P, which is the fourth bit located in chunk X as illustrated in Fig. 4.
For illustration, we use 8-bit chunks as opposed to the 64-bit chunks used in the algorithm. In Fig. 4, the bitmap is laid out on top while the summary table is at the bottom. The first entry in the summary table is zero, the second entry has a count of 4 (because the bitmap in the first entry is 10010011, which has 4 bits set), and the third entry contains a count of 7 (the bitmap in the second entry 01010001 has 3-bit sets and this is added to the previous count of 4 for a cumulative count of 7). Note that the first entry is always zero.
To count the number of 1 bits up to position P, first the entry corresponding to chunk X in the summary table is accessed to retrieve the count of the number of 1 bits up to chunk X −1. For the sake of discussion, let us assume that this value is Count[X −1]. Then we retrieve the actual chunk X containing the bit sequence 01111000. Now the number of bits up to position 4 is counted. Since the first four bits of the sequence is 0111, the count is 3. Thus, the overall bit count until the desired position P is given by Count[X −1] + 3.
The choice of the chunk size presents a tradeoff between memory and speed. If the chunk size is the entire bitmap, then the bits have to be counted every time. This could incur substantial CPU cost that decreases the lookup speed. However, if the chunk size is 1 bit then a simple retrieval of the count from the summary table gives the intended result. However, the downside is that the summary table will be much larger than the original bitmap. With a bitmap of 64K entries, the summary table will require 1M bits (= 64×1024×16).
Using the summary table during search requires at least three memory references: one for retrieving the appropriate entry from the summary table, another access for retrieving the actual chunk (depending on the chunk size and memory word size, it can be more than 1 memory accesses), and the final access for retrieving the element from valarr.
The Lulea algorithm applies the same optimization for next hops as well since the nexthops belonging to shorter prefixes will be at several consecutive locations. Because of such optimizations, the storage required by the Lulea algorithm is very compact. On a routing table with 38,816 prefixes, the compressed database size is 160 KB, which represents 4.2 bytes per prefix. Such compact storage can easily fit in the L1 cache of a conventional general-purpose processor or in an on-chip SRAM, a necessity for providing wire speed lookup at 40-Gbit speeds.
However, the optimizations made by the Lulea algorithm do introduce a few disadvantages. The counting of bits requires an additional memory access per node. The benefits of the optimization are heavily dependent on the structure of the forwarding table. Hence, predicting the worst-case storage requirements of the data structure as a function of prefixes is difficult. The implicit use of leaf pushing makes insertion inherently slow in the Lulea scheme. For instance, consider the insertion of a prefix to the root node entry that points to a subtribe containing thousands of leaf nodes. The next-hop information associated with the new prefix has to be pushed to thousands of leaf nodes and, furthermore, many entries in the summary table need to be updated, which could be slow. In the next section, we outline tree bitmap algorithm that is as space efficient as Lulea but overcomes the incremental update problem by avoiding leaf pushing”.