Elias–Fano is a way to encode sorted sequences of integers. It promises compression, and unlike many other compression schemes, it also offers fast random access. That combination alone makes Elias–Fano an interesting candidate for a general-purpose adaptive integer set.
A well-known example of such a structure is the Judy Array. Judy is an adaptive data structure that claims to deliver fast insertions, quick lookups, and compact memory usage. Unfortunately, as I’ve shown in earlier posts, Judy doesn’t quite live up to the “compact” part—it can use up to 16 bytes per key on many inputs. My own RI integer set does better on the memory usage side.
Now, adaptive radix tree–based integer sets (like Judy or RI) could take advantage of Elias–Fano for encoding their data. In a tree, these encodings would typically live at the fringe nodes, where the remaining key space is small—say, the last two or three bytes of the key. For those small universes, Elias–Fano starts to look really attractive.
But there’s a catch. Classic Elias–Fano isn’t editable. Once you’ve built the structure, you can’t just insert a new number without rebuilding or shifting things around—pretty much the same problem you have with a plain sorted array.
There’s another issue too: that so-called “fast access.” In practice, searching for a value means first walking through the high bits array (HBA), then jumping into the low bits array (LBA) for what you hope will be a short scan. Without an additional index, there’s nothing particularly fast about the HBA itself. So a lookup usually ends up touching three layers: the index (if you have one), then the HBA, then the LBA. Not exactly blazing.
Several schemes have been proposed in the literature to make Elias–Fano more friendly to updates. Most of them rely on partitioning the encoded stream into smaller segments. However, the approaches I’ve seen aren’t designed for hyper-dynamic, general-purpose data structures. They’re usually aimed at more specialized use cases. Still, partitioning the Elias–Fano (EF) stream is a natural starting point for building a dynamic integer set.
A trie-based structure—like Judy or RI—already divides the representable universe into smaller chunks. Each byte of a key consumes one level in the trie, and every level splits the remaining space into 256 subspaces. For example, a 64 K universe at the fringe might hold anywhere from 0 to 65,536 integers. You could represent that range as a single node or as a two-level tree. This hierarchy naturally partitions both the data and any EF stream that encodes it. It’s easy to imagine those fringe nodes being encoded as Elias–Fano blocks—each one a single, compact, monolithic chunk. At first glance, there doesn’t seem to be much reason to subdivide them further.
But there are reasons.
Recall that we need an index into the high bits array (HBA) to enable faster access. That index effectively splits the EF stream into smaller logical parts. Also, shifting memory byte-wise is far cheaper than doing it bit-wise—which matters when updates are frequent. Each time you insert or remove an element, the EF block will likely need re-encoding, and large monolithic blocks are expensive to rewrite.
On top of that, Elias–Fano doesn’t compress large populations very efficiently. Its effectiveness depends on the density of the set, and it doesn’t adapt well as that density changes. Taken together, these observations suggest that dividing an EF stream—within each tree node—into smaller sub-blocks could significantly improve access speed, update performance, and space utilization.
Let’s now look at a more concrete Elias–Fano (EF) scheme for a trie node.
We’re still in the context of an adaptive trie—one that can switch between multiple encodings and always pick the most efficient option for its current contents. Our new compact encoding must therefore coexist and interoperate with other node types, allowing smooth transitions both to and from them.
Since this new node type is compact by design, let’s give it a name: the C-node.
Structure of a C-node
In this design, a C-node’s total size is capped at a modest 256 or 512 bytes—small enough for fast scans and cheap shifts when editing.
Each C-node stores, contiguously, a sequence of coding units, primarily Elias–Fano blocks (EF blocks). Each EF block is a self-contained Elias–Fano stream that covers a range of consecutive byte universes.
Here, the whole key space is divided into byte universes, each corresponding to a range of 256 integer values. A single EF block can therefore represent one or more byte universes—it might cover one, two, ten, or twenty of them—depending on how densely populated those ranges are. Importantly, no individual byte universe is ever split between two blocks.
For speed, the high bits array (HBA) of each EF block is designed to fit entirely within a single machine word, a 64-bit word. Elias–Fano requires at least one bit per integer in the HBA, but without a few extra bits, compression quickly loses efficiency. In practice, HBAs typically use two or three bits per key, though the ratio can vary without breaking the encoding.
Given these constraints, we’ll limit each EF block to at most 32 keys (i.e., integers).
Handling Dense Byte Universes
Of course, some byte universes will be more densely populated than that. To handle those, the C-node includes a secondary block type: bitmaps.
Each bitmap represents exactly one byte universe and can encode populations anywhere from 32 up to 256 keys.
This dual system—EF blocks for sparse ranges and bitmaps for dense ones—lets the node adapt naturally to local density. Sparse areas remain compact and quickly searchable, while dense regions use bitmaps for direct representation. The result is a space-efficient, cache-friendly, and still editable node structure.
Block Identification and Lead Keys
Since blocks in a C-node are scanned linearly, each block must be identifiable during traversal. To make that possible, every block is assigned a lead key—a small header value that marks its position in the overall key space.
For bitmap blocks, the lead key consists of the high eight bits identifying the byte universe within the larger 64 K universe of the C-node.
For Elias–Fano (EF) blocks, the lead key is simply the first key in the block’s encoded population. This lead key is stored in full, before the Elias–Fano stream proper, and is not encoded again within the stream itself.
The previously mentioned 32-keys-per-block limit applies only to the Elias–Fano–encoded integers—it doesn’t include the lead key. Note also that Elias–Fano doesn’t make sense for a single integer; there must be at least two values in a valid EF stream.
Scrap Arrays: Handling the Odd Cases
These details introduce one more special case, and thus another coding unit type: the scrap array.
A scrap array holds exactly one or two keys—those that, for one reason or another, couldn’t be absorbed into neighboring blocks. One such situation occurs when both adjacent blocks are bitmaps, but the stray key or keys belong to a different byte universe and can’t fit into either.
Scrap arrays handle these edge cases cleanly, preventing small fragments of data from forcing unnecessary block splits or wasted space. In practice, they act as tiny overflow buffers that preserve the node’s compact layout while still keeping all keys reachable and in order.
Describing Coding Units Efficiently
Each coding unit in a C-node—whether it’s an EF block, a bitmap, or a scrap array—needs to be identified not just by the range it covers (given by its lead key), but also by its type and size.
We make a somewhat artificial distinction between two categories:
- Blocks — the “desirable” coding units, which can be EF blocks or bitmaps.
- Arrays — coding artifacts, auxiliary and “unwanted,” they exist only to handle special cases.
Naturally, we’d like to store as little metadata as possible. Storing the type and size directly for every unit would be wasteful. Fortunately, we can infer most of it.
Implicit Sizes
For bitmaps and arrays, the size is implicit—no extra data is needed.
For EF blocks, however, the size must be computed. For that, we use the information that we must store anyway—
- The number of low bits per key
- The number of coded keys
There’s no practical way to get by with less; these two values are the minimum sufficient descriptors. Knowing the number of keys, we can inspect the high bits array (HBA) and derive its size. The size of the low bits array (LBA) is just the product of the two numbers.
Encoding the Unit Type
That leaves us with the question of how to identify the type of coding unit.
Since the two EF block descriptors (the low bits count and the key count) are encoded using a fixed number of bits, some combinations are improper. Because keys are 16-bit numbers, there can’t be 32 keys in a block coded with 15 low bits—it’s not how Elias-Fano works.
These disallowed combinations can be repurposed to encode the types of non-EF units, such as bitmaps and arrays.
This way, every coding unit—whether block or array—can be identified and parsed without any dedicated type field, keeping the C-node’s metadata compact and efficient.
Algorithm 1: Scanning and Searching a C-Node
Input: search_key
Output: membership status
1: while not at end of node do
2: lead_key ← next 8 bits
3: low_bits ← next 4 bits
4: n_keys ← next 5 bits
5:
6: if (valid_EF_combination(low_bits, n_keys)) then
7: hba_size ← select_HBA(n_keys)
8: lba_size ← n_keys × low_bits
9: size ← hba_size + lba_size
10:
11: if search_key ∈ EF_block_range(lead_key) then
12: scan_EF_block()
13: exit search
14: end if
15:
16: else
17: type ← classify_unit(low_bits + n_keys)
18: switch type do
19: case ARRAY:
20: if search_key ∈ array_range(lead_key) then
21: scan_array()
22: exit search
23: end if
24: size ← implicit_array_size()
25: break
26:
27: case BITMAP:
28: if search_key ∈ bitmap_range(lead_key) then
29: search_bitmap()
30: exit search
31: end if
32: size ← implicit_bitmap_size()
33: break
34: end switch
35: end if
36:
37: advance_pointer(size)
38: end while
Note that only the high eight bits of the lead key are needed to evaluate the search key membership.