Database B-trees

Most databases use B-trees to organize their data and/or to index their tables. The B-trees serve as associative arrays, mapping a string (the key) to data (the value). For most indexes, the value is some sort of pointer to an actual database record (can be a record number, a file offset, etc).

The B-tree, despite its extensive usage in this capacity, is only a dubiously suited data structure. It is not particularly adept at organizing string collections.

As a (perhaps surprisingly) complex data structure, a B-tree can take many shapes and sizes. For an instance, while most commonly a B-tree organizes its nodes as a sorted array (alternatively, an unsorted array with a sorted index), over the decades many other types of nodes organization have been suggested in the literature. Included are hash tables, binary search trees (yes, trees as the nodes of larger trees) – BSTs, balanced BSTs, even tries (N-ary trees).

I am comparing two B-trees, of two distinct breeds, on three data sets — the same data sets from my previous posts. One EN Wikipedia titles set (ENW), a dissociative press words set (PKD) and a file paths set (URL).

The first B-tree is the one at the heart of LMDB. LMDB is touted as a very fast database. And it is fast, no doubt about it. But every database has its uses, its strengths and weaknesses. More importantly, it is built upon design choices that make it suited for some uses and unsuited for others.

The B-tree of LMDB is, I believe, of the classical database B-tree breed. It’s similar with the Sqlite B-tree, the BDB B-trees, the B-trees at the core of many other DBMSs. It’s the database B-tree described by Donald Knuth. I will denote this B-tree as CDB (Classical DB B-tree).

The second B-tree in my test is of my own key/data store, nonblonde.

https://nonblonde.sourceforge.net/

https://sourceforge.net/projects/nonblonde/

The nonblonde B-tree nodes are organized as critical bit trees (CBTs), or blind binary tries, if you prefer. That means one (and only one) string comparison is required to descend/search a node. For nonblode even this single comparison is often not necessary, as the trie is only half blind. The nodes are also variable in byte size.

I will not be comparing the two key/data stores. If I were, I might decide the LMDB is the faster one, but remember, each database has its own uses.

I am comparing the B-trees. In this post, I am only concerned with the insertion speed, and with nothing else. Not transaction closing, not disk writing, just insertion. To this end, I insert all my data in one huge transaction, in an empty “table”. The inserts are random, as I don’t find sequential insertion that interesting.

As I prepare the test setup, I do have expectations formed. I believe LMDB is a fast key/data store, but I don’t expect it is because of an especially fast B-tree. As I’ve said, the LMDB B-tree is of classical breed that is not particularly fast, in fact, not particularly suited for strings organization. The nonblonde B-tree was designed to suit long keys. It is an in the middle design, one of doubtful virtues. But I believe the longer the keys and their common prefixes, the more favored the nonblonde B-tree will be.

In my tests, the keys attach no data.

Key countCDBnonblonde
1048576.759s.556s
20971521.91s1.397s
41943044.61s3.354s
PKD
Key countCDBnonblonde
1048576.9s.646s
20971522.23s1.587s
41943045.22s3.622s
ENW
Key countCDBnonblonde
10485761.78s.883s
20971524.03s2.033s
419430412.42s4.707s
URL

The times recorded above are the times to populate the database. They do not include database initialization. There’s no disk writing involved. The input is pre-read from a file and this pre-reading is also not included.

The LMDB B-tree is, I guess, not stripping common key prefixes and so forth. The nonblonde B-tree is itself not particularly memory efficient. Though not shown, the memory requirements appeared to me to be lower for the nonblonde B-tree.

And though I’ve said I will not be comparing disk writing, I have had a look at how much disk space is required to store the collections above. This time, nonblonde is particularly space efficient, as it strips all common prefixes between keys. For the last and largest string collection, the size of files on disk was several times smaller in nonblonde‘s favor. There is a cost to that, of course. As nonblonde transcodes between the in-memory representation and disk representation, the writing and reading times are better for LMDB.

Leave a comment