Libart

I have found an ART implementation working for strings: armon/libart: Adaptive Radix Trees implemented in C. Here are the benchmarking results, shown next to that of JudySL, and the L2 Si trie I have used in similar posts. All the three data structures are string to pointer maps.

For a brief description of the data sets, see my earlier The New Indexers, Masstree post.

Item CountByte SizeKey Average Size (bytes)
PKD9.158M104MB10.33
ENW10M212MB20.23
URL6.795M978MB143

The memory usage is given in MB.

Libart InsertJudySL InsertL2 Si InsertLibart LookupJudySL LookupL2 Si LookupLibart Mem UsageJudySL Mem UsageL2 Si Mem Usage
PKD8.7s6.5s5.83s6.01s5.1s4.07s703409186
ENW12.2s9.6s8s10s8.4s7.4s912523309
URL13.5s12s7.7s11.2s11s6.91s1514412248

I don’t like these results much. To be sure, I never expected ART to be very good. Checking summarily the libart code, I guess(?) that it is doing some sort of path compression. The implementation seems to follow closely the paper: you get the four node types, with their suggested sizes, SSE2 searching of the Node16 array, et al.

(The numbers for L2 Si are not the same as those from the post quoted earlier, as I had rerun the benchmark.)

What libart does, and other ART implementations don’t do, is to store both the key and the value. The memory figures given for it (for all the three tries, in fact) include storing the key. The JudySL and L2 Si tries store common prefixes only once, and they are able to reconstitute the original keys from pieces. If some of the memory usage figures are below the data size, it’s not because they lose the data — it’s because they compress it. By contrast, libart allocates a leaf (value + key length + actual key) for each stored key. So, for example, out of the total 1.5GB taken for the last set, about 1G is used to keep the original key (and value) data.

For a nicer comparison with pretty plots between the JudySL and L2 Si tries, see String Associative Tables, III and String Associative Tables, IV.

String Organizers

What’s a good data structure for storing strings? What’s a bad one?

It all depends on use, of course. And one selects a data structure accordingly.

But it’s fair to say good time and space characteristics are important. Especially when the data starts to get big.

So is Libart good or bad? Is JudySL good or bad? Are these the “tries” we are looking for? What is a “trie” anyhow? And should we look at alternatives?

It makes sense to look at a basic trie implementation — a simple, but not ridiculously naive one. An implementation that takes some time to code, but not much.

As for alternatives, the tries are the choice when talking about sorted collections. We should not even consider hashes (because they’re not sorted collections.) But what if we did?

I have two more contenders, for which I’ll show figures next to that of Libart.

Libart InsertJudyHS InsertL2 Sf InsertLibart LookupJudyHS LookupL2 Sf LookupLibart Mem UsageJudyHS Mem UsageL2 Sf Mem Usage
PKD8.7s4.72s6.47s6.01s3.6s6.25s703383438
ENW12.2s5.1s10.6s10s3.8s11.13s912593655
URL13.5s5.06s9.92s11.2s4.57s11.17s15141284559

The L2 Sf trie is fairly basic, and most unimaginative. It uses a single type of tree node representation — nothing adaptive about it. The keys and pointers are stores in sorted arrays, which are grown as data is added. The data structure is conservative in its memory usage.

JudyHS is a hash built over other Judy data structures. It’s not a trie. It’s not ordered. It has very little functionality beyond adding and querying data. Hashes are good when you don’t require the functions only a sorted collection can offer. Or are they? Hashes don’t know what to do with strings, they are not able to exploit data redundancy for time and space gains.

One of the main reasons I started looking at tries is exactly that. A trie can store a set of string compactly. It can strip the common prefixes and retain the free suffixes. Very useful when the former are long.

Leave a comment