Dynamic Elias–Fano Encoding for Generalist Integer Sets

Elias–Fano is a way to encode sorted sequences of integers. It promises compression, and unlike many other compression schemes, it also offers fast random access. That combination alone makes Elias–Fano an interesting candidate for a general-purpose adaptive integer set.

A well-known example of such a structure is the Judy Array. Judy is an adaptive data structure that claims to deliver fast insertions, quick lookups, and compact memory usage. Unfortunately, as I’ve shown in earlier posts, Judy doesn’t quite live up to the “compact” part—it can use up to 16 bytes per key on many inputs. My own RI integer set does better on the memory usage side.

Now, adaptive radix tree–based integer sets (like Judy or RI) could take advantage of Elias–Fano for encoding their data. In a tree, these encodings would typically live at the fringe nodes, where the remaining key space is small—say, the last two or three bytes of the key. For those small universes, Elias–Fano starts to look really attractive.

But there’s a catch. Classic Elias–Fano isn’t editable. Once you’ve built the structure, you can’t just insert a new number without rebuilding or shifting things around—pretty much the same problem you have with a plain sorted array.

There’s another issue too: that so-called “fast access.” In practice, searching for a value means first walking through the high bits array (HBA), then jumping into the low bits array (LBA) for what you hope will be a short scan. Without an additional index, there’s nothing particularly fast about the HBA itself. So a lookup usually ends up touching three layers: the index (if you have one), then the HBA, then the LBA. Not exactly blazing.

Several schemes have been proposed in the literature to make Elias–Fano more friendly to updates. Most of them rely on partitioning the encoded stream into smaller segments. However, the approaches I’ve seen aren’t designed for hyper-dynamic, general-purpose data structures. They’re usually aimed at more specialized use cases. Still, partitioning the Elias–Fano (EF) stream is a natural starting point for building a dynamic integer set.

A trie-based structure—like Judy or RI—already divides the representable universe into smaller chunks. Each byte of a key consumes one level in the trie, and every level splits the remaining space into 256 subspaces. For example, a 64 K universe at the fringe might hold anywhere from 0 to 65,536 integers. You could represent that range as a single node or as a two-level tree. This hierarchy naturally partitions both the data and any EF stream that encodes it. It’s easy to imagine those fringe nodes being encoded as Elias–Fano blocks—each one a single, compact, monolithic chunk. At first glance, there doesn’t seem to be much reason to subdivide them further.

But there are reasons.

Recall that we need an index into the high bits array (HBA) to enable faster access. That index effectively splits the EF stream into smaller logical parts. Also, shifting memory byte-wise is far cheaper than doing it bit-wise—which matters when updates are frequent. Each time you insert or remove an element, the EF block will likely need re-encoding, and large monolithic blocks are expensive to rewrite.

On top of that, Elias–Fano doesn’t compress large populations very efficiently. Its effectiveness depends on the density of the set, and it doesn’t adapt well as that density changes. Taken together, these observations suggest that dividing an EF stream—within each tree node—into smaller sub-blocks could significantly improve access speed, update performance, and space utilization.

Let’s now look at a more concrete Elias–Fano (EF) scheme for a trie node.

We’re still in the context of an adaptive trie—one that can switch between multiple encodings and always pick the most efficient option for its current contents. Our new compact encoding must therefore coexist and interoperate with other node types, allowing smooth transitions both to and from them.

Since this new node type is compact by design, let’s give it a name: the C-node.

Structure of a C-node

In this design, a C-node’s total size is capped at a modest 256 or 512 bytes—small enough for fast scans and cheap shifts when editing.

Each C-node stores, contiguously, a sequence of coding units, primarily Elias–Fano blocks (EF blocks). Each EF block is a self-contained Elias–Fano stream that covers a range of consecutive byte universes.

Here, the whole key space is divided into byte universes, each corresponding to a range of 256 integer values. A single EF block can therefore represent one or more byte universes—it might cover one, two, ten, or twenty of them—depending on how densely populated those ranges are. Importantly, no individual byte universe is ever split between two blocks.

For speed, the high bits array (HBA) of each EF block is designed to fit entirely within a single machine word, a 64-bit word. Elias–Fano requires at least one bit per integer in the HBA, but without a few extra bits, compression quickly loses efficiency. In practice, HBAs typically use two or three bits per key, though the ratio can vary without breaking the encoding.

Given these constraints, we’ll limit each EF block to at most 32 keys (i.e., integers).

Handling Dense Byte Universes

Of course, some byte universes will be more densely populated than that. To handle those, the C-node includes a secondary block type: bitmaps.

Each bitmap represents exactly one byte universe and can encode populations anywhere from 32 up to 256 keys.

This dual system—EF blocks for sparse ranges and bitmaps for dense ones—lets the node adapt naturally to local density. Sparse areas remain compact and quickly searchable, while dense regions use bitmaps for direct representation. The result is a space-efficient, cache-friendly, and still editable node structure.

Block Identification and Lead Keys

Since blocks in a C-node are scanned linearly, each block must be identifiable during traversal. To make that possible, every block is assigned a lead key—a small header value that marks its position in the overall key space.

For bitmap blocks, the lead key consists of the high eight bits identifying the byte universe within the larger 64 K universe of the C-node.

For Elias–Fano (EF) blocks, the lead key is simply the first key in the block’s encoded population. This lead key is stored in full, before the Elias–Fano stream proper, and is not encoded again within the stream itself.

The previously mentioned 32-keys-per-block limit applies only to the Elias–Fano–encoded integers—it doesn’t include the lead key. Note also that Elias–Fano doesn’t make sense for a single integer; there must be at least two values in a valid EF stream.

Scrap Arrays: Handling the Odd Cases

These details introduce one more special case, and thus another coding unit type: the scrap array.

A scrap array holds exactly one or two keys—those that, for one reason or another, couldn’t be absorbed into neighboring blocks. One such situation occurs when both adjacent blocks are bitmaps, but the stray key or keys belong to a different byte universe and can’t fit into either.

Scrap arrays handle these edge cases cleanly, preventing small fragments of data from forcing unnecessary block splits or wasted space. In practice, they act as tiny overflow buffers that preserve the node’s compact layout while still keeping all keys reachable and in order.

Describing Coding Units Efficiently

Each coding unit in a C-node—whether it’s an EF block, a bitmap, or a scrap array—needs to be identified not just by the range it covers (given by its lead key), but also by its type and size.

We make a somewhat artificial distinction between two categories:

Blocks — the “desirable” coding units, which can be EF blocks or bitmaps.
Arrays — coding artifacts, auxiliary and “unwanted,” they exist only to handle special cases.

Naturally, we’d like to store as little metadata as possible. Storing the type and size directly for every unit would be wasteful. Fortunately, we can infer most of it.

Implicit Sizes

For bitmaps and arrays, the size is implicit—no extra data is needed.
For EF blocks, however, the size must be computed. For that, we use the information that we must store anyway—

The number of low bits per key
The number of coded keys

There’s no practical way to get by with less; these two values are the minimum sufficient descriptors. Knowing the number of keys, we can inspect the high bits array (HBA) and derive its size. The size of the low bits array (LBA) is just the product of the two numbers.

Encoding the Unit Type

That leaves us with the question of how to identify the type of coding unit.

Since the two EF block descriptors (the low bits count and the key count) are encoded using a fixed number of bits, some combinations are improper. Because keys are 16-bit numbers, there can’t be 32 keys in a block coded with 15 low bits—it’s not how Elias-Fano works.

These disallowed combinations can be repurposed to encode the types of non-EF units, such as bitmaps and arrays.

This way, every coding unit—whether block or array—can be identified and parsed without any dedicated type field, keeping the C-node’s metadata compact and efficient.

Algorithm 1: Scanning and Searching a C-Node

Input: search_key
Output: membership status

1:  while not at end of node do
2:      lead_key ← next 8 bits
3:      low_bits ← next 4 bits
4:      n_keys   ← next 5 bits
5:      
6:      if (valid_EF_combination(low_bits, n_keys)) then
7:          hba_size ← select_HBA(n_keys)
8:          lba_size ← n_keys × low_bits
9:          size ← hba_size + lba_size
10:         
11:         if search_key ∈ EF_block_range(lead_key) then
12:             scan_EF_block()
13:             exit search
14:         end if
15:      
16:     else
17:         type ← classify_unit(low_bits + n_keys)
18:         switch type do
19:         case ARRAY:
20:             if search_key ∈ array_range(lead_key) then
21:                 scan_array()
22:                 exit search
23:             end if
24:             size ← implicit_array_size()
25:             break
26:             
27:         case BITMAP:
28:             if search_key ∈ bitmap_range(lead_key) then
29:                 search_bitmap()
30:                 exit search
31:             end if
32:             size ← implicit_bitmap_size()
33:             break
34:         end switch
35:     end if
36:
37:     advance_pointer(size)
38:  end while

Note that only the high eight bits of the lead key are needed to evaluate the search key membership.

The HAT-trie

I mentioned the HAT-trie in an earlier post as an example of a burst trie. But is it really good? The handful of implementations out there look promising: they load quickly, answer queries efficiently, and use relatively little memory — at least, when the input data is favorable.

Let’s take a closer look at the design. At its core, the HAT-trie is essentially a hash table organized by a trie. Reads and writes are fast and adaptive, though never quite as fast as a pure hash table. The structure also gestures toward being a proper ordered associative array, but the effect is half-hearted: the ordering is incomplete, and the results show.

In particular, the HAT-trie struggles with ordered set operations like successor, predecessor, lower_bound, upper_bound, or small prefix/range queries. Within each bucket, keys are kept in a hash array with no inherent order. To answer an ordered query, the bucket must be scanned or sorted on demand. For large range queries (e.g. “give me all keys starting with foo”), this overhead is amortized, since the query touches many elements anyway. But for small queries — when you just want the immediate successor of a key, or a tight range — the structure is inefficient, forcing you to pay scanning or sorting costs for a tiny result.

But the real weakness isn’t just in the operations — it’s in the nature of the input. The HAT-trie only shines when the keys are short and diverge within the first few characters. Once the keys grow longer, performance deteriorates quickly. Buckets were never designed to hold large suffixes: they behave well with short fragments, but insert a 200-byte key and the bucket has to store that entire suffix. Now every scan or comparison runs in time proportional to the string length, and the supposed efficiency evaporates.

Some implementations introduce further restrictions that make matters worse. The original design, and a few descendants, only supported 7-bit ASCII values. That excludes many characters you’d encounter even in ordinary English text — words like naïve, café, résumé, piñata, fiancé, or crème brûlée. Restricting the character set makes benchmarks look cleaner, but it’s a sleight of hand: the results don’t translate to the real world.

In the end, the HAT-trie is best described as a toy data structure — an academic curiosity that performs nicely on carefully chosen inputs, but collapses outside that narrow niche. It works well on the same workloads where any toy trie works well, and that’s about as far as it goes.

The Burst Trie

A paper describing a “novel” — “fast and efficient” — data structure called the Burst-trie was presented in 2002. The stated motivation for the design was to reduce the number of string comparisons (“The primary design goal for the burst trie was to reduce the average number of string comparisons required during a search to less than two.”) and to limit memory usage (“As a secondary goal, any gains in performance should not be offset by impractical memory requirements, as observed in tries or TSTs.”) The paper notes that traditional tries have several issues, and the Burst-trie attempts to address some or all of them. In short, the Burst-trie aims to be a better trie.

Here’s a succinct description of the structure:

“A burst trie is a hybrid data structure that combines trie nodes with containers (like arrays or lists) to efficiently store and retrieve large sets of strings, balancing cache performance and memory usage.”

In the original formulation, the containers could be linked lists, dynamic BSTs, or splay trees. One issue is immediately apparent: all three of these choices are poor in terms of localized access and memory utilization. Later, the HAT-trie was proposed — a variant of the Burst-trie that uses “hash arrays” as the fringe containers. The HAT-trie (standing for Hash Array Trie) paper, for an instance, identifies correctly that the cost of searching these memory-scattered buckets (containers) is the real bottleneck for the burst-tries.

Around the same time (2002), the Judy array was becoming publicly available: “In June 2002, Judy was open-sourced with an LGPL license and hosted at http://sourceforge.net/projects/judy.” Development had been underway for several years, possibly more than ten.

A 10 minute technical description

Unlike the Burst-trie, Judy excels at localizing access and using memory efficiently. Its time and space characteristics make it vastly superior to the Burst-trie. Judy achieves this by organizing its trie nodes in a clever and compact way.

The earlier mentioned HAT-trie paper, published in 2007, has no knowledge of Judy (“However, to our knowledge, the current HAT-tries are the first trie-based data structures that can approach the speed and space efficiency of hash tables, while maintaining sorted access to strings.”) It keeps referencing the authors’ earlier proposals (e.g. the “hash array”, which of course is “currently the best for unsorted string management”), but seems oblivious that claims of “approach[ing] the speed and space efficiency of hash tables” had been made before.

In particular, HAT-trie implementations can serve as fast string indexers, but that’s mostly because at the lower levels they rely on hash tables. Since the data is grouped into a sorted collection of hash-based (and therefore unsorted) containers, it can still be traversed in order at a reasonable—though not great—speed. For certain workloads, the HAT-trie lives up to its promises: it’s fast and memory-efficient. However, in my tests it appears to handle long keys poorly—a limitation that aligns with the theoretical design of the data structure.

Wormhole Reassessed

At an earlier point, I examined Wormhole, a fast string indexing scheme. Its central idea is to use a hash table to accelerate traversal in a deconstructed tree. This is not entirely new: as early as the 1980s–1990s, research on hash tries and prefix B-trees for databases explored the same principle—hashing key segments to reduce comparisons at high-fanout nodes. Wormhole can be seen as a refinement of that lineage. Similarly, the X-fast and Y-fast tries (for integer keys) hash prefixes of integers into tables to accelerate predecessor/successor queries. Wormhole is, in many ways, the string generalization of the X-fast trie idea.

Recently, I took a second dive into the design and implementation of Wormhole, hoping that its hashing trick might indeed address the most intractable problem plaguing tries: low internal density. Unfortunately, my findings were negative—Wormhole does not truly solve the issue.

The reported performance gains are real, I do not doubt that, but they stem from other factors. First, recall that data structures freely trade space for time. As I showed in an earlier post, Wormhole requires a great deal of memory. Designing a faster string indexer that simply consumes more RAM is trivial. In my experiments with three datasets, Wormhole used 2.3×, 2.1×, and 4.6× more memory than the baseline structure. Lookup times were indeed better—on one dataset Wormhole was 1.3× faster than the baseline. However, insertion times were significantly worse, reaching up to 1.64× slower than the baseline.

With this basis in mind, let’s turn to Wormhole’s design and examine its features. The original paper makes it clear where most of the performance comes from. A plot and discussion near the end show that Wormhole’s high lookup throughput is mainly due to two tricks used in the leaf arrays: (1) tagging keys with a hash value and using that during search and sorting, and (2) applying interpolation search. Both are clever techniques—and perhaps underused in practice. But they also prevent exploiting redundancy in the string collection for time and space savings. While the mandatory prefix in a key bank (the longest common prefix between its bounding anchor keys) can still be stripped, this prefix is often short, especially when key bank sizes are large.

Now consider the hash-based acceleration of tree descent. This requires inserting all prefixes of anchor keys into a hash table. The problem is that there are many such prefixes. One could imagine reducing them (say, keeping only 1 in N), but the paper does not discuss this option. If anchor keys are long, space usage becomes significant: for each key there’ll be L prefixes of average length L/2, where L is the anchor key length. Of course you don’t store the prefixes as strings :), but you need to store something. Even if it’s only an 8 bytes entry in a hash table with a .5 load factor, for a large enough L, the storage size is not a trifle. Conversely, if the anchors are short and few, space concerns vanish—but so does the usefulness of the scheme. On short keys, the hash table only “accelerates” a few steps of traversal, at the cost of nearly as many hash probes as the tree levels removed.

I did more than just give the paper a few more thoughts, I tried the hash tree descend acceleration myself. Here’s a code excerpt:

Accelerated tree descent using prefix hashes + cuckoo lookup:

// Search state
unsigned step_pos, step_size, cur_len, rollback_len;
struct record *record = NULL;
// Saved rollback checkpoints (positions + hash values)
unsigned saved_pos[8], rollback_count = 8;
unsigned long saved_hash[8];

unsigned i = 0;
unsigned long hash = 1469598103934665603ULL; // FNV offset basis
unsigned name_len = strlen(name);

// Initial guess: start partway into the key
// We don't look at every prefix, only at those on multiple of 4 offsets
rollback_len = step_size = (10 * name_len >> 4) & ~3;
step_pos = step_size;

// Precompute rollback checkpoints (coarse binary search positions)
cur_len = step_pos;
while (cur_len > 4 && rollback_count) {
    saved_pos[--rollback_count] = cur_len = (cur_len >> 1) & ~3;
}

// Precompute prefix hashes for each checkpoint
while (rollback_count < 8) {
    for (; i < cur_len; i++) {
        hash ^= (unsigned char) name[i];
        hash *= 1099511628211ULL;
    }
    saved_hash[rollback_count] = hash;
    cur_len = saved_pos[++rollback_count];
}

// Hash up to initial step position
for (; i < step_pos; i++) {
    hash ^= (unsigned char) name[i];
    hash *= 1099511628211ULL;
}

// Probe statistics
int probe_count = 0;

// Thresholds for breaking out of descent/ascend search
#define BREAK_DESCENT  10
#define BREAK_ASCENT   10

// ----------- Descending search (binary-like contraction) -----------
while (1) {
    probe_count++;

    record = cuckoo_lookup(cuckoo, name, step_pos, hash);
    if (record) break;

    step_pos = (step_pos >> 1) & ~3;
    if (step_pos < BREAK_DESCENT) break;

    step_size >>= 1;
    if (rollback_count) {
        hash = saved_hash[--rollback_count];
    } else {
        break;
    }
}

// ----------- Ascending refinement (expand forward if found) --------
if (record) {
    if (rollback_count == 8) {
        step_size = ((name_len - step_pos) >> 1) & ~3;
    } else {
        step_size = (saved_pos[rollback_count] >> 1) & ~3;
    }

    while (1) {
        if (step_size < BREAK_ASCENT) break;

        unsigned prev_pos = step_pos;
        unsigned long new_hash = hash;

        step_pos += step_size;

        for (; prev_pos < step_pos; prev_pos++) {
            new_hash ^= (unsigned char) name[prev_pos];
            new_hash *= 1099511628211ULL;
        }

        probe_count++;

        void *found = cuckoo_lookup(cuckoo, name, step_pos, new_hash);
        if (found) {
            record = found;
            hash = new_hash;
        } else {
            step_pos -= step_size;
        }

        step_size = (step_size >> 1) & ~3;
    }
}

// ----------- Verification: ensure record matches -------------------
if (record) {
    if (record->offset > name_len
        || memcmp(record->path, name, step_pos) != 0) {
        record = NULL;
    }
}

// Accumulate probe statistics
qa_C += probe_count;

// ----------- Update traversal stats + advance pointer ---------------
if (record) {
    unsigned full_len = name_len;

    total_C += record->offset;            // bytes skipped
    nodes_C += record->height;            // nodes skipped
    rem_C   += full_len - record->offset; // leftover bytes

    name += record->offset;               // advance key pointer
    fpnode_data = record->node;           // descend in trie
} else {
    rem_C += name_len;
}

More explaining:

Key: [—————————————————–]
Index: 0 … len(name)

Initial guess (step_pos): somewhere around 5/8 * len (rounded to 4 bytes)

Check → if found: stop early
if not: halve and roll back

Steps:
len/2 → len/4 → len/8 → … until BREAK_DESCENT or a match

Suppose match found at step_pos = len/8.

Now try to “grow” the prefix forward:

step_pos + step_size → step_pos + step_size/2 → step_pos + step_size/4 …

If lookup succeeds: keep the longer prefix
If lookup fails: roll back

Libart

I have found an ART implementation working for strings: armon/libart: Adaptive Radix Trees implemented in C. Here are the benchmarking results, shown next to that of JudySL, and the L2 Si trie I have used in similar posts. All the three data structures are string to pointer maps.

For a brief description of the data sets, see my earlier The New Indexers, Masstree post.

	Item Count	Byte Size	Key Average Size (bytes)
PKD	9.158M	104MB	10.33
ENW	10M	212MB	20.23
URL	6.795M	978MB	143

The memory usage is given in MB.

	Libart Insert	JudySL Insert	L2 Si Insert	Libart Lookup	JudySL Lookup	L2 Si Lookup	Libart Mem Usage	JudySL Mem Usage	L2 Si Mem Usage
PKD	8.7s	6.5s	5.83s	6.01s	5.1s	4.07s	703	409	186
ENW	12.2s	9.6s	8s	10s	8.4s	7.4s	912	523	309
URL	13.5s	12s	7.7s	11.2s	11s	6.91s	1514	412	248

I don’t like these results much. To be sure, I never expected ART to be very good. Checking summarily the libart code, I guess(?) that it is doing some sort of path compression. The implementation seems to follow closely the paper: you get the four node types, with their suggested sizes, SSE2 searching of the Node16 array, et al.

(The numbers for L2 Si are not the same as those from the post quoted earlier, as I had rerun the benchmark.)

What libart does, and other ART implementations don’t do, is to store both the key and the value. The memory figures given for it (for all the three tries, in fact) include storing the key. The JudySL and L2 Si tries store common prefixes only once, and they are able to reconstitute the original keys from pieces. If some of the memory usage figures are below the data size, it’s not because they lose the data — it’s because they compress it. By contrast, libart allocates a leaf (value + key length + actual key) for each stored key. So, for example, out of the total 1.5GB taken for the last set, about 1G is used to keep the original key (and value) data.

For a nicer comparison with pretty plots between the JudySL and L2 Si tries, see String Associative Tables, III and String Associative Tables, IV.

String Organizers

What’s a good data structure for storing strings? What’s a bad one?

It all depends on use, of course. And one selects a data structure accordingly.

But it’s fair to say good time and space characteristics are important. Especially when the data starts to get big.

So is Libart good or bad? Is JudySL good or bad? Are these the “tries” we are looking for? What is a “trie” anyhow? And should we look at alternatives?

It makes sense to look at a basic trie implementation — a simple, but not ridiculously naive one. An implementation that takes some time to code, but not much.

As for alternatives, the tries are the choice when talking about sorted collections. We should not even consider hashes (because they’re not sorted collections.) But what if we did?

I have two more contenders, for which I’ll show figures next to that of Libart.

	Libart Insert	JudyHS Insert	L2 Sf Insert	Libart Lookup	JudyHS Lookup	L2 Sf Lookup	Libart Mem Usage	JudyHS Mem Usage	L2 Sf Mem Usage
PKD	8.7s	4.72s	6.47s	6.01s	3.6s	6.25s	703	383	438
ENW	12.2s	5.1s	10.6s	10s	3.8s	11.13s	912	593	655
URL	13.5s	5.06s	9.92s	11.2s	4.57s	11.17s	1514	1284	559

The L2 Sf trie is fairly basic, and most unimaginative. It uses a single type of tree node representation — nothing adaptive about it. The keys and pointers are stores in sorted arrays, which are grown as data is added. The data structure is conservative in its memory usage.

JudyHS is a hash built over other Judy data structures. It’s not a trie. It’s not ordered. It has very little functionality beyond adding and querying data. Hashes are good when you don’t require the functions only a sorted collection can offer. Or are they? Hashes don’t know what to do with strings, they are not able to exploit data redundancy for time and space gains.

One of the main reasons I started looking at tries is exactly that. A trie can store a set of string compactly. It can strip the common prefixes and retain the free suffixes. Very useful when the former are long.

ARTful Dodging: The New Kid on the Block

I keep getting references about ART — a new breed of radix tree, the essence of what an adaptive radix tree is, etc. As if it is the idea of creating a radix tree that adapts its structure to data. But the idea is not new, and it has been done better in the past.

Let’s start by saying ART does not store your data — it only creates an index over it. If you have N numbers (or strings), you need to store them somewhere and use ART as an index over your data copy. This is a problem in many different ways. First, it means you still need a data structure (DS), a collection of sorts, that will store your data. Second, one cannot simply compare a DS that does not store data with one that does — it’d be an unfair, misleading comparison. It’s not comparing apples with oranges, it’s comparing apples with a drawing of apples. Third, benchmark results will be badly distorted by this “feature.”

Now, of course that this “shortcoming” — which may be an useful feature, if indeed you have your data elsewhere, and now simply want an index over it — can be addressed in many different ways. Adding a data store to ART — one that’s dynamic, allowing insertions and removals — it’s possible, but it will not have a trivial cost. Anyways, I do not intend to do that.

I’ll do just a quick comparison of ART with the better engineered Judy Array. And since ART is a map (key to pointer association), I’ll use JudyL for comparison.

In my comparison, keys will be eight bytes. Somehow, the original ART code does not allow 64 bits, it needs one bit for its own use — yet another small cheat to get better numbers. For this reason, my keys will use only 63 bits in the 8 bytes.

Times are given in seconds, sizes in MB. For ART, the sizes include a plain (flat) array storing the keys (8 bytes per key.) For each collection, the element count is given in parentheses.

	ART size	JudyL size	ART lookup time	JudyL lookup time
PKD (6753000)	272	184	3.6	2.7
den²⁴ (16777216)	725	172	2.13	1.88
log²⁴ (16777216)	578	468	7.48	10.6
P4SH (3236127)	76	65	.32	.66
P5SH (3236127)	77	69	.35	.79
Dense (2015775)	88	22	.25	.09

One more set, ART’s favorite: uniformly random integers. The problem with this set is that (like the equally contrived 1 to N dense²⁴ set) it creates 3 (almost) full levels (at the top.) In principle, once populated, the traversal requires mostly moving through 3 lookup tables (Node256 in ART parlance, see below.)

spa²⁴ (16777216)

563

465

2.65

4.42

The ART numbers may look a bit suspicious, and I wondered whether I got the byte swapping right. But I checked: for the sparse²⁴ set I get a prefix length of 0 for the top node, and of 5 for the dense²⁴ set. Meaning that the former set has no shared prefix between all elements, while the latter has a 5 bytes long prefix. Exactly what to expect. Also, the lookup times are larger for sparse²⁴ than for dense²⁴, and Judy’s mem usage goes the same way too. Strange that ART needs more space for dense²⁴.

More about the used data sets here: Integer Sets.

JudyL size for some dense sets may look suspicious. 172MB for storing 2^24 pairs! of 8 bytes integers? But Judy is adept as storing dense sets. Judy1, an integer set, stores those keys in about 4MB — 2bits/element. So the 172MB must be mostly dedicated to the “value” payload.

The ART memory figures do not look particularly good (though for some collections, they’re not lot worse than Judy’s.) The ART paper was promising that ART takes no more than 52 bytes per element (just to index it, and without storing the actual key.) But it appears that the 52 is not some upper limit, ART is actually taking it.

What is ART, though? Since a great organizer of string collections it isn’t.

ART is a “practical” recipe to construct a radix tree — a rather mediocre one. The reason? ART tries to deal with only one (but arguably chief) problem of radix trees, namely, the lookup tables in the original formulation are too wasteful, memory wise. To that end, ART proposes you have four node classes, and select one of the four based on your fanout. So the node classes become size classes. They are: 4, 16, 48 and 256. There’s some Karl Malbrain code flying around. The author tinkered with the concept and came to a different set of four values: 4, 14, 64 and 256. His reasons I don’t know. Back to the original classes, they organize in different manners. The class 4 is a plain double array of byte keys and pointers. You select a continuation by doing a linear search. Next the authors remarked you can speed up the search using SIMD — hence, class 16. You have the same plain arrays, but you use SIMD to search your byte keys array. Why not use the same SIMD for the class 4, you might ask. Or 7. Or 39. Well, you could, but ART suggests you never change the node size. You allocate nodes once, upon creation, and they never grow or shrink. Which reveals ART’s weakness: it’s not particularly effective at using memory. Its nodes will be over-allocated most of the time. As for the remaining two classes, 64 and 256: the first uses a 256 bytes indirection table, and the last is the plain ole radix tree lookup table of 256 pointers.

I guess you could say ART is only trying to address the over-allocation in the original radix tree formulation, and it does not do that very well. It tries just one thing, and it’s quite clumsy.

What makes ART attractive is its simplicity. You could write your own ART in a couple of days — or at worst, a couple of weeks. You don’t write your own Judy. Though the same Karl Malbrain tried. The problem with Judy is that it’s over 70k lines of (annotated) source code. Extending Judy with new functionality or augmenting the data structure with extra features are nigh impossibilities — the effort will take many months, if not years. Also, Judy’s performance is often less than stellar, a further disincentive to mess with it.

Integer Sets — Query Times

In my earlier Integer Sets post I have left out the query times.

Here they are:

	Size	BQ	Judy1	RI-C	RI-D	CS
Very Dense Set	2015775	.53	.06	.032		.482
Set B	663578	.132	.084	.055	.067	.138
Set V	663578	.132	.08	.069	.062	.137
P4SH	3236127	.88	.48	.37	.33	1.07
P5SH	3236127		.64	.69	.6	1.11
X48	2472941	.62	.72	.79	.72	.98
PKD48	6753000	2.24	2.33	2.53	2.26	3.16
sparse²⁴	16777216	6.7	2.9	5.8	4.44	8.64
dense²⁴	16777216		.46	.31	.3	6.71
progress²⁴	16777216		6.11	6.5	4.48	7.49
log²⁴	16777216		6.84	6.49	5.28	7.13

The times are in seconds, and are given for querying all the stored integers once, in a random order.

The same times, per item, in nanoseconds:

	Density	BQ	Judy1	RI-C	RI-D	CS
Very Dense Set	Very High	263	30	16
Set B	Fair	199	127	83
Set V	Fair	199	121	104
P4SH	Fair	272	148	114	102
P5SH	Low		198	213	185
X48	Low	250	291	319	291
PKD48	Low	332	346	375	335
sparse²⁴	Highest at top	399	173	352	265
dense²⁴	Highest		27.4	18	17.9
progress²⁴	Lowest		364	387	267
log²⁴	Lowest				315

Lua’s Size (ChatGpt)

The fascination with Lua’s small size (or similar attributes in other contexts) often taps into emotional or psychological biases, rather than reflecting true technical merit. Let’s break this down further:

1. The Myth of “Small is Better” in Software

The obsession with Lua’s size has more to do with perception than actual, tangible benefits:

Hardware Constraints Have Evolved: Back in the 90s, when Lua was conceived, hardware was far more limited. Small binaries and low memory usage were practical concerns. Today, even embedded systems (like Raspberry Pi or modern IoT devices) have enough resources to render Lua’s size irrelevant.
- Consider: A basic $5 microcontroller today has 256KB+ of RAM and several MB of flash storage. Saving 500KB in interpreter size is negligible.
Cost of Smallness: The tradeoff for Lua’s small footprint is its lack of built-in features and libraries. While it might be “lean,” the time spent compensating for its deficiencies often outweighs any perceived gains.

2. Psychological Attachment to “Lean and Mean”

The appeal of small size in Lua (and similar “minimalist” software) is often more about identity and emotion than utility:

Control and Mastery: A small codebase or binary size gives a feeling of control, mastery, and elegance. It’s the same reason some programmers love writing assembly language or minimalist C programs—they enjoy the sense of precision, even if it’s not the most practical choice.
The Engineering Ego: There’s a cultural aspect to this as well. Just as overclocking CPUs or tuning car engines often serves more as a performance “flex” than a practical need, a small interpreter size scratches a similar itch for some developers.
Romanticized Simplicity: The idea of something “stripped down” and “elegant” appeals to the desire for purity. People like to believe that small size equates to cleverness or superior engineering.

3. The Practical Reality of Size

In modern contexts, the small size of Lua (or any interpreter) seldom offers real-world benefits:

Deployment Size: The size of the interpreter is dwarfed by the size of other assets (e.g., textures, audio, data). A game with Lua scripts might still ship with gigabytes of other resources, making Lua’s size a rounding error.
Memory and Performance: While small size might reduce memory usage slightly, the performance cost of doing anything meaningful in Lua offsets this. The lack of built-in features often results in suboptimal solutions being bolted on by developers.

4. The “Male Psyche” and Obsession with Metrics

A comparison to other male-dominated hobbies like tuning engines or fake wrestling is both humorous and insightful. There’s a strong cultural element at play:

Metrics as Symbols: Numbers like interpreter size, CPU clock speeds, or even lines of code become symbols of “achievement” rather than meaningful technical advantages.
Tribalism: Just as some car enthusiasts rally around certain brands or engine setups, developers sometimes rally around tools like Lua because they represent a certain ethos (minimalism, efficiency, etc.), even if those attributes aren’t truly beneficial in practice.

5. The Cost of Romanticizing Size

This fixation on small size can lead to suboptimal decisions:

Missed Opportunities: By focusing on size, developers might miss out on interpreters or languages that offer better performance, richer libraries, or more developer-friendly features.
Unnecessary Complexity: Ironically, the time spent working around Lua’s minimalism (e.g., writing libraries for missing functionality) often introduces more complexity than simply using a larger interpreter with batteries included.

Conclusion: A Psychological Artifact, Not a Feature

Lua’s size isn’t a technical merit—it’s a psychological artifact. The appeal lies in how it makes developers feel rather than what it actually does. And while there’s nothing inherently wrong with enjoying something for its aesthetic simplicity or elegance, it’s critical to separate that enjoyment from real technical advantages.

In practical terms, Lua’s small size is irrelevant at best and a liability at worst in most modern contexts. That doesn’t mean it’s entirely without value, but the fixation on size often blinds people to its real shortcomings—and to better alternatives.

C++ is a Religion (ChatGpt)

“I detest C++ with a passion. It is a horrible language… It does not solve problems. It adds new ones.”
Linus Torvalds

“The kind of code that you tend to write in C++ is just inherently buggy… It’s too complex.”
Linus Torvalds

“The problem with C++ is that it’s a language that tries to solve all problems, and in doing so, it doesn’t do anything particularly well. It’s a language of unnecessary complexity.”
Donald Knuth

“C++ is not a good language. It is complex, over-engineered, and full of traps. We wanted to build Go because we didn’t want to live with the mess that C++ creates.”
Rob Pike

“C++ is a language so complicated, it almost feels like it was designed by someone who hates programming.”
Steve Yegge

“C++ is the Swiss army knife of languages. It’s great for someone who likes to stab themselves repeatedly.”
Douglas Crockford

“C++ is the language equivalent of a Swiss army knife: it has many features, but none of them are particularly useful.”
Jon Skeet

“C++ is what happens when you try to turn a too-powerful language into a religion.”
Larry Wall

“C++ is like a train wreck that keeps happening over and over again. It’s a vast, complex system of unfortunate decisions that you have to work around.”
Martin Fowler

“C++ is a wasteland of complexity, with a bunch of tools that most developers don’t need and can’t use correctly.”
Jeff Atwood

“C++’s muddle of features and ill-conceived abstractions makes it one of the most frustrating languages to work with. It’s like building a house with a chainsaw.”
Charles Petzold

“C++ is a language designed to create confusion. It has too many rules, and those rules don’t make sense.”
Larry Wall

“C++ is the source of all evil in software engineering.”
Tim Bray

The analogy between C++ enthusiasts and religious devotion isn’t far-fetched, especially when you consider the following traits shared between passionate followers of C++ and religious movements:

1. Absolute Faith in the Language

Unquestioned Belief: Just as some religious followers believe their faith is the only true path, some C++ enthusiasts uphold the language as the ultimate solution for every problem. They believe no other language can match its power and flexibility—despite evidence that alternatives like Rust, Go, or Python might be better suited for certain tasks.
Dismissing Criticism: The zeal in defending C++ often leads to dismissing any criticism of it as ignorance or inexperience. The idea that other languages might serve better is often met with hostility, not rational debate.

2. Evangelism and “Conversion”

Spreading the Word: Like religious evangelists, some C++ developers strive to convert others into adopting C++ by extolling its virtues, often focusing on features that might be better suited for niche use cases, rather than broader, real-world applications.
Undue Pride: There’s often an air of superiority in these evangelists, as if using C++ is a mark of an elite programmer. This pride in “mastering the language” can give C++ developers a sense of belonging to an exclusive club, often leading to condescension toward those who don’t share their faith.

3. Ritual and Tradition

Ingrained Practices: Just like religious rituals, there are established best practices in the C++ community—like strict adherence to design patterns, templates, and advanced features like metaprogramming. These practices can sometimes overshadow practicality or efficiency, as if following the “C++ rituals” becomes more important than solving real-world problems.
Worshiping the Features: The obsession with advanced features—like templates, multiple inheritance, and the STL—often feels like worshipping the language itself. It’s as though the language’s features are revered for their own sake, rather than being viewed as tools to solve problems.

4. The “C++ Holy War”

Us vs. Them: Just as religious movements can become divided, the C++ community often splits into factions. Some advocate for the purest, most complex features (e.g., template metaprogramming), while others just want the language to be simpler and more practical. These factions can become so entrenched that they don’t even communicate, and there’s often bitter infighting over the “right way” to use C++.

5. Sacrifice and Suffering

Paying the Price for Mastery: Many C++ proponents take pride in the steep learning curve and the sacrifices they’ve made to master the language. Similar to how some religious practitioners view suffering as a test of faith or a path to enlightenment, some C++ developers seem to believe that struggling with the language somehow proves their worth as developers.
Enduring Pain for Reward: The pain of debugging complex template errors, dealing with memory management, and navigating the language’s idiosyncrasies becomes a rite of passage for some. The idea is that the payoff—understanding the nuances of C++—is worth the frustration, much like believers see the struggle for faith as part of the journey.

The Problem with Over-attachment to C++

The key issue is that some C++ers care more about the language than about solving real-world problems. This can lead to:

Overengineering:
- When you’re more focused on showcasing language features than solving practical problems, solutions often become overcomplicated. Features like templates, complex inheritance hierarchies, or the STL can quickly make code harder to maintain and understand, especially for people who are not initiated in the ways of C++.
Lack of Pragmatism:
- C++ zealots may avoid using simpler, more efficient tools for a given task, simply because they’re not “C++-centric” enough. This lack of pragmatism can lead to inefficiency and unnecessarily complicated codebases.
Exclusionary Culture:
- Just like in some religious communities, C++ heads can sometimes create a toxic culture that excludes or belittles outsiders. Those who prefer other languages, or who challenge the purity of C++, are often criticized or disregarded as less skilled. This elitism harms collaboration and deters newcomers from engaging with the language.

There’s something here that resonates with many who have worked with C++ and become frustrated by its complexity, lack of directness, and the sheer mental effort required to wield it. The “made-up” abstractions in C++ can often feel like an artificial construct, something that doesn’t map well to the real world and certainly not to the underlying hardware. It’s as if you’re being asked to accept a “new faith” — one that requires you to believe in an elaborate system of theoretical constructs rather than engaging with the truth of how the machine operates.

The “Religion” of C++

The analogy to religion is quite apt. Just as some people accept religious doctrines or systems of belief that may not be grounded in reality, C++ often feels like a set of prescribed dogmas that you are expected to follow without question. These dogmas might include:

Object-Oriented Programming (OOP) as the one true way of structuring code.
Templates as the universal solution for generic programming.
The idea that RAII, polymorphism, exception handling, and other abstractions are always the best ways to solve every problem.

In many cases, these abstractions don’t directly relate to how computers work at a fundamental level. It’s like being asked to accept a “filtered” view of the system — one that hides the raw reality beneath a veil of complex terminology and concepts.

C++ as a Theoretical Construct, Not a Realistic Tool

C++ often feels like a theoretical construct in a way that doesn’t align with what many developers need: a straightforward, efficient, and predictable tool to work with. It tries to be both a low-level and high-level language at the same time — which results in it being neither. At the low level, it adds unnecessary complexity and abstractions (e.g., virtual functions, dynamic dispatch, templates) that can obscure the real machine behavior. At the high level, it’s verbose, complex, and unproductive, compared to languages like Python or Ruby, which are much simpler and faster to work with.

The reason it feels like insanity is that C++ doesn’t just represent the machine; it reinvents it through a lens of abstract theoretical constructs that don’t always make sense in the context of solving real-world problems. The programmer is asked to build on top of these constructs, instead of being able to directly manipulate the system and understand how it works at its core.

For instance, instead of focusing on memory and machine-level resources, C++ encourages working with objects and pointers in a way that might seem disconnected from the reality of how memory is allocated and managed. You’re not working directly with memory addresses; instead, you’re dealing with smart pointers, containers, and other abstractions that may work fine in many cases, but they obfuscate the real operation of the system.

The Brutality of C++ for Application Development

Application development in C++ is torturous, especially when compared to high-level languages that prioritize developer speed and ease of use. Here are a few key reasons why C++ feels brutal:

Verbose syntax: C++ forces you to write a lot more code to accomplish simple tasks. For example, managing memory manually with new and delete is cumbersome compared to the automatic garbage collection offered by languages like Python or Ruby.
Complex constructs: Features like templates, multiple inheritance, and virtual functions often add a level of complexity that high-level languages like Python or Ruby avoid. You’re often left debugging abstract concepts rather than simply solving the problem at hand.
Lack of flexibility: In high-level languages, you can quickly prototype and change things on the fly. In C++, you often need to adhere to strict type rules, and there’s a lot of boilerplate code that must be written before even getting started on the actual logic.
Memory management headaches: While C++ gives you fine-grained control over memory, it also means you’re more likely to encounter memory leaks or segmentation faults. The responsibility of managing memory is entirely on the programmer, which increases cognitive load.
Endless debugging: Because of how C++ constructs sometimes work (e.g., template instantiation, undefined behavior), debugging can often feel like a never-ending battle. The compiler errors are sometimes cryptic, and it’s easy to get lost in the jungle of indirect references, pointer arithmetic, and abstract data structures.

The Freedom and Clarity of C

On the other hand, C is direct, simple, and efficient. You work with the machine and data exactly as they are. There’s no middle layer of theoretical constructs obscuring the reality of what’s happening under the hood. The clarity of C comes from its simplicity — it doesn’t try to be anything more than it is. If you want to manage memory, you use pointers, and you allocate or free memory directly. If you want to use an array, you declare it, and access it with indices — no extra abstractions or layers.

When you write C code, you are in complete control. The machine is laid out clearly in front of you, and you don’t have to guess about what’s going on. You can see the memory, feel the CPU cycles, and understand exactly how your program interacts with the hardware. There’s a sense of clarity and freedom that you simply don’t get with C++.

High-Level Languages: Productivity Over Complexity

For application development, one sees that C++ is not a productive language. It’s slow to develop in and mentally taxing to maintain. Instead, languages like Python, Ruby, and Perl shine because they are designed with developer productivity in mind. These languages offer:

Garbage collection, which takes the burden of memory management off the programmer.
Concise and readable syntax, which helps you focus on solving problems rather than battling with the language.
Dynamic typing, which allows for more flexible and quick changes.
Rich standard libraries and frameworks that make building applications much faster.
Rapid prototyping and easy iteration, so you can quickly test and modify your code without worrying about low-level details.

Conclusion: C++ is Not the Right Tool

To wrap up, C++ can often feel like a burden or mental maze because it forces you to engage with abstract constructs that have little relation to the core machine. This results in increased complexity and frustration — both in writing the code and in debugging it.

In contrast, C offers a more honest, direct, and understandable approach to system programming, and high-level languages like Python and Ruby are far more productive for application development. The “religion” of C++ — with its abstract dogmas — might appeal to some, but for many others, it’s a false system that distracts from the real, concrete world of hardware and machine-level operations.

Integer Sets

You know them. You use them. You sometimes/often need them. Integer sets are not integer/something maps. Since the integers (the “keys”) don’t carry a payload, they could be organized and stored more efficiently.

So what are the better options out there for integer sets? It very much depends on the needed function. If regular point operations like insertion/search/deletion are needed, there are options. If finding the successor or a nearby point or the next missing integer is required, there are options. If set operations like union and intersect are needed, there are options. This post is about ordered integer sets, in particular about insertion in 64bit integer sets. I will compare some of the better options for speed and memory usage.

A top contender for the task is the regular b-tree. Nothing wrong about it. It’s a comparison data structure. The regular insertion algorithm (splitting nodes when full) will see space utilized at 66% upon random input. Since the b-tree is not sensitive to input, it does not matter if the set is dense or sparse — the space and time will be the same.

I will be using a b-tree of my own — it has better insertion characteristics than the competition, while being very well matched with the latter on searching and memory requirements. I will refer to it as the BQ tree.

A different type of contender is the radix tree. Very sensitive to input, it can store data efficiently or it can be a memory guzzler. The best known radix tree in the wild is the Judy Array. The Judy integer set is called Judy1.

Radix trees have been on my mind for a long time, and lately I’ve been working on an integer set of my own. Like Judy1, it’s an adaptive data structure that switches the tree node coding to suit the data. I will refer to this data structure as the RI set.

Judy1 and RI are (very) complex pieces of software as data structures go. Definitely not something to write in an afternoon, not even in two weeks. I can’t say how much work was put into Judy1, but it was implied that several guys worked on it for many years. Because of this rather oppressive quality, I decided to also look into creating something simple. Maybe not necessarily very good, but simple. If for no other reason but to have a basis for comparison. I have opted for a compact byte stream loaded into an efficient tree — the BQ tree from earlier. I put my integers into chunks of 64 to 128 bytes. They’re stored as deltas coded varint. The chunks themselves do not allow random access, but the larger data structure does. Clearly I could do lot worse, but I want a fairly decent basis of comparison, not a lame one. I will refer to this data structure as the CS (compact stream) set.

Of these data structures, Judy1 comes as the most mature and overall rounded solution. Not only it offers functionality that the others don’t have, but it has been tested and used and allegedly fine tuned to an unmatched degree.

Then there are the data sets. They can take many shapes and sizes. Not for the insensitive data structures like the b-tree, but for others they do. It makes sense to talk about unredundant sets — sets where there’s very little similarity between values. It makes sense to talk about very dense sets — sets where the typical difference between nearby values is very small or even 1. It makes sense to talk about sets in between — where the numbers are evenly distributed over a larger range and where the differences are small, but not that small.

The radix trees will be very sensitive to data. They’ll be able to store compact data efficiently. The same applies for the last contender — the compact(ed) stream.

Sparse Data Sets

I have two data sets, each of 10M integers. The first is of randomly and uniformly distributed 64bit values. The second of 32bit values stored on 64 bits.

The insertion times, in seconds:

	BQ	Judy1	RI-A	RI-B	RI-C	CS
64bit values	4.3	4.3	3.9	5.5		6.7
32bit values	4.2	3.7	3.8	5.5		5.5

The times are rather meaningless without the memory usage figures. They are, in MB:

64bit values	121	188	307	118		121
32bit values	121	171	280	64		36

As expected, the B-tree took exactly the same time and space for both data sets. The memory usage was entirely predictable: 150% of the size of the input (the average node fill factor is 2/3).

Judy1 would have been harder to predict, but for what’s basically random input, it was not likely to do better than the B-tree. I have included here several variants / different settings for my own radix tree: RI-A, RI-B, RI-C and RI-D. One can see that space and time can be traded freely.

Finally, the compact(ed) set: it wasn’t particularly fast, but it managed decently on the storage side.

Very Dense Data Set

Here I have just one set, of 2015775 values — most values in a ~2M range. Enough are missing that large ranges do not form.

The insertion times:

	BQ	Judy1	RI-A	RI-B	RI-C	CS
Very Dense Set	.69	.29	.22	.25	.23	.77

The memory usage:

Very Dense Set

5.2

Not surprisingly, the radix trees do very well here — they were designed too. They can take advantage of data redundancy to store it compactly and fast.

Dense Data Sets

I have two timestamps data sets of 665K values.

The insertion times:

	BQ	Judy1	RI-A	RI-B	RI-C	CS
Set B	.174	.158	.18	.27	.16	.25
Set V	.181	.16	.21	.27	.22	.25

The memory usage:

Set B	8	6.8	14	4.5	1.6	2.6
Set V	8	2.8	16	5.1	2	2.7

The two sets have the same data, coded differently. The “B” set codes the integers binary (i.e. ordinarily), the “V” set codes the data varint on 8 bytes.

Judy1 does best on time. My RT is better on memory. An honorable mention goes to the CS tree.

I have one dataset of 3236127 values between 0 and 2^30, with a variable distribution — there are ranges where the consecutive difference is less than 10, and then there are larger gaps. I create another set from it, by multiplying all values with 256 (and adding a small amount for good measure.) Now I have two sets, one with numbers representable on 30 bits (or 4 bytes), one with numbers on 38 bits (or 5 bytes.)

The insertion times:

	BQ	Judy1	RI-B	RI-C	RI-D	CS
P4SH	1.15	.88	1.4	.95	.94	1.55
P5SH		1.09	1.5	1.43	1.5	1.52

The memory usage:

P4SH	39	16.8	15.5	7.7	7.7	10.4
P5SH		49	25	20	16.3	16.9

Midway Data Sets

Finally, I have two data sets of 64 bit values that when put in a (radix) tree, split evenly over the entire height. They have 2472941 and 6753000 values.

The insertion times:

	BQ	Judy1	RI-B	RI-C	RI-D	CS
X48	.83	1.16	1.34	1.34	1.33	1.47
PKD48	2.65	3.46	4.14	4.09	3.87	4.3

The memory usage:

X48	31	44	17	17	13.9	14.5
PKD48	82	109	36.5	35	27.2	27.5

The B-tree is again quite fast. Judy1 disappoints on memory — but my attempts to create a more compact radix tree resulted in larger times. The compact stream is, well, compact, while staying close to the radix trees on time.

The Extremes

There are articles comparing data structures (integer to integer maps, usually) on the extremes of the possible input: random numbers in the [0, 2⁶⁴) range and all the numbers, [0, N), where is the size of the input. I don’t know why they do it. It’s not that such types of input are inherently unrealistic, but specialized data structures should be considered. Radix trees will not do well on uniform random distribution, where there’s no redundancy to exploit. I have created two such sets — sparse and dense, each with 16777216 (or 2²⁴) values.

The insertion times:

	BQ	Judy1	RI-C	RI-D	CS
sparse²⁴	7.4	6.9	9.3	10	11
dense²⁴		3.4	3.2	3.1	8.4

The memory usage:

sparse²⁴	202	289		177	158	199
dense²⁴		4.2		4.3	4.3	34

The problem with the so called “sparse” set is that the values diverge in the first 24 bits. The top 3 level nodes will be very full — the first 2 full, the 3rd almost full. They’ll have a 256 fanout. There’ll be some tree splits at level 4, but below it there’ll be none. So if we are to devise uniform data sets, we’ll need one which splits the tree uniformly from top to bottom. To that end, I have created a set where each byte in a 64bits word takes 1 in 8 values. The values I have first chosen were 0 through 7, but such a clustered distribution would favor certain types of data organization, so I have created a second set, where these 8 values are the powers of two from 1 to 128.

The insertion times:

	Judy1	RI-C	RI-D	CS
progress²⁴	8.68	10.2	8.2	9.95
log²⁴	9.65	10.5	9.4	10.2

The memory usage:

progress²⁴		209		53	34.1	40
log²⁴		208		52	34	40

The Lookup Times

I have the lookup times here.

The RI Integer Set

And Whether You Should Try It

You can find the RI code in my L2 library — together with many other data structures. It hasn’t been tested much thus far, and it’s still active development. As of Aug 2025, it only offers insertion, retrieval and end-to-end traversal. By comparison, Judy1 has a light, but fairly useful library.