Integer Sets

You know them. You use them. You sometimes/often need them. Integer sets are not integer/something maps. Since the integers (the “keys”) don’t carry a payload, they could be organized and stored more efficiently.

So what are the better options out there for integer sets? It very much depends on the needed function. If regular point operations like insertion/search/deletion are needed, there are options. If finding the successor or a nearby point or the next missing integer is required, there are options. If set operations like union and intersect are needed, there are options. This post is about ordered integer sets, in particular about insertion in 64bit integer sets. I will compare some of the better options for speed and memory usage.

A top contender for the task is the regular b-tree. Nothing wrong about it. It’s a comparison data structure. The regular insertion algorithm (splitting nodes when full) will see space utilized at 66% upon random input. Since the b-tree is not sensitive to input, it does not matter if the set is dense or sparse — the space and time will be the same.

I will be using a b-tree of my own — it has better insertion characteristics than the competition, while being very well matched with the latter on searching and memory requirements. I will refer to it as the BQ tree.

A different type of contender is the radix tree. Very sensitive to input, it can store data efficiently or it can be a memory guzzler. The best known radix tree in the wild is the Judy Array. The Judy integer set is called Judy1.

Radix trees have been on my mind for a long time, and lately I’ve been working on an integer set of my own. Like Judy1, it’s an adaptive data structure that switches the tree node coding to suit the data. I will refer to this data structure as the RI set.

Judy1 and RI are (very) complex pieces of software as data structures go. Definitely not something to write in an afternoon, not even in two weeks. I can’t say how much work was put into Judy1, but it was implied that several guys worked on it for many years. Because of this rather oppressive quality, I decided to also look into creating something simple. Maybe not necessarily very good, but simple. If for no other reason but to have a basis for comparison. I have opted for a compact byte stream loaded into an efficient tree — the BQ tree from earlier. I put my integers into chunks of 64 to 128 bytes. They’re stored as deltas coded varint. The chunks themselves do not allow random access, but the larger data structure does. Clearly I could do lot worse, but I want a fairly decent basis of comparison, not a lame one. I will refer to this data structure as the CS (compact stream) set.

Of these data structures, Judy1 comes as the most mature and overall rounded solution. Not only it offers functionality that the others don’t have, but it has been tested and used and allegedly fine tuned to an unmatched degree.

Then there are the data sets. They can take many shapes and sizes. Not for the insensitive data structures like the b-tree, but for others they do. It makes sense to talk about unredundant sets — sets where there’s very little similarity between values. It makes sense to talk about very dense sets — sets where the typical difference between nearby values is very small or even 1. It makes sense to talk about sets in between — where the numbers are evenly distributed over a larger range and where the differences are small, but not that small.

The radix trees will be very sensitive to data. They’ll be able to store compact data efficiently. The same applies for the last contender — the compact(ed) stream.

Sparse Data Sets

I have two data sets, each of 10M integers. The first is of randomly and uniformly distributed 64bit values. The second of 32bit values stored on 64 bits.

The insertion times, in seconds:

	BQ	Judy1	RI-A	RI-B	RI-C	CS
64bit values	4.3	4.3	3.9	5.5		6.7
32bit values	4.2	3.7	3.8	5.5		5.5

The times are rather meaningless without the memory usage figures. They are, in MB:

64bit values	121	188	307	118		121
32bit values	121	171	280	64		36

As expected, the B-tree took exactly the same time and space for both data sets. The memory usage was entirely predictable: 150% of the size of the input (the average node fill factor is 2/3).

Judy1 would have been harder to predict, but for what’s basically random input, it was not likely to do better than the B-tree. I have included here several variants / different settings for my own radix tree: RI-A, RI-B, RI-C and RI-D. One can see that space and time can be traded freely.

Finally, the compact(ed) set: it wasn’t particularly fast, but it managed decently on the storage side.

Very Dense Data Set

Here I have just one set, of 2015775 values — most values in a ~2M range. Enough are missing that large ranges do not form.

The insertion times:

	BQ	Judy1	RI-A	RI-B	RI-C	CS
Very Dense Set	.69	.29	.22	.25	.23	.77

The memory usage:

Very Dense Set

5.2

Not surprisingly, the radix trees do very well here — they were designed too. They can take advantage of data redundancy to store it compactly and fast.

Dense Data Sets

I have two timestamps data sets of 665K values.

The insertion times:

	BQ	Judy1	RI-A	RI-B	RI-C	CS
Set B	.174	.158	.18	.27	.16	.25
Set V	.181	.16	.21	.27	.22	.25

The memory usage:

Set B	8	6.8	14	4.5	1.6	2.6
Set V	8	2.8	16	5.1	2	2.7

The two sets have the same data, coded differently. The “B” set codes the integers binary (i.e. ordinarily), the “V” set codes the data varint on 8 bytes.

Judy1 does best on time. My RT is better on memory. An honorable mention goes to the CS tree.

I have one dataset of 3236127 values between 0 and 2^30, with a variable distribution — there are ranges where the consecutive difference is less than 10, and then there are larger gaps. I create another set from it, by multiplying all values with 256 (and adding a small amount for good measure.) Now I have two sets, one with numbers representable on 30 bits (or 4 bytes), one with numbers on 38 bits (or 5 bytes.)

The insertion times:

	BQ	Judy1	RI-B	RI-C	RI-D	CS
P4SH	1.15	.88	1.4	.95	.94	1.55
P5SH		1.09	1.5	1.43	1.5	1.52

The memory usage:

P4SH	39	16.8	15.5	7.7	7.7	10.4
P5SH		49	25	20	16.3	16.9

Midway Data Sets

Finally, I have two data sets of 64 bit values that when put in a (radix) tree, split evenly over the entire height. They have 2472941 and 6753000 values.

The insertion times:

	BQ	Judy1	RI-B	RI-C	RI-D	CS
X48	.83	1.16	1.34	1.34	1.33	1.47
PKD48	2.65	3.46	4.14	4.09	3.87	4.3

The memory usage:

X48	31	44	17	17	13.9	14.5
PKD48	82	109	36.5	35	27.2	27.5

The B-tree is again quite fast. Judy1 disappoints on memory — but my attempts to create a more compact radix tree resulted in larger times. The compact stream is, well, compact, while staying close to the radix trees on time.

The Extremes

There are articles comparing data structures (integer to integer maps, usually) on the extremes of the possible input: random numbers in the [0, 2⁶⁴) range and all the numbers, [0, N), where is the size of the input. I don’t know why they do it. It’s not that such types of input are inherently unrealistic, but specialized data structures should be considered. Radix trees will not do well on uniform random distribution, where there’s no redundancy to exploit. I have created two such sets — sparse and dense, each with 16777216 (or 2²⁴) values.

The insertion times:

	BQ	Judy1	RI-C	RI-D	CS
sparse²⁴	7.4	6.9	9.3	10	11
dense²⁴		3.4	3.2	3.1	8.4

The memory usage:

sparse²⁴	202	289		177	158	199
dense²⁴		4.2		4.3	4.3	34

The problem with the so called “sparse” set is that the values diverge in the first 24 bits. The top 3 level nodes will be very full — the first 2 full, the 3rd almost full. They’ll have a 256 fanout. There’ll be some tree splits at level 4, but below it there’ll be none. So if we are to devise uniform data sets, we’ll need one which splits the tree uniformly from top to bottom. To that end, I have created a set where each byte in a 64bits word takes 1 in 8 values. The values I have first chosen were 0 through 7, but such a clustered distribution would favor certain types of data organization, so I have created a second set, where these 8 values are the powers of two from 1 to 128.

The insertion times:

	Judy1	RI-C	RI-D	CS
progress²⁴	8.68	10.2	8.2	9.95
log²⁴	9.65	10.5	9.4	10.2

The memory usage:

progress²⁴		209		53	34.1	40
log²⁴		208		52	34	40

The Lookup Times

I have the lookup times here.

The RI Integer Set

And Whether You Should Try It

You can find the RI code in my L2 library — together with many other data structures. It hasn’t been tested much thus far, and it’s still active development. As of Aug 2025, it only offers insertion, retrieval and end-to-end traversal. By comparison, Judy1 has a light, but fairly useful library.

Integer Sets

Sparse Data Sets

Very Dense Data Set

Dense Data Sets

Midway Data Sets

The Extremes

The Lookup Times

The RI Integer Set

And Whether You Should Try It

Published by nobody

2 thoughts on “Integer Sets”

Leave a comment Cancel reply

Sparse Data Sets

Very Dense Data Set

Dense Data Sets

Midway Data Sets

The Extremes

The Lookup Times

The RI Integer Set

And Whether You Should Try It

Share this:

Related

Published by nobody

2 thoughts on “Integer Sets”

Leave a comment Cancel reply