String Associative Tables, II

In my previous post (String Associative Tables) I have compared for insertion time and memory usage three string tables: the unsorted hash table std::unordered and the sorted JudySL and my own L2/SI.

None of the three is the absolute best in their own class.  std::unordered is not the fastest nor the most memory conservative hash table around.  JudySL is based on JudyL, an integer integer map on which several guys worked a few good years.  However, JudySL does not matches JudyL in effort and design complexity.  It is in fact more of a nifty trick than a true string collection design.  L2/SI is itself only a half baked idea.  Nonetheless, I do think that such a comparison is useful and instructive enough to reveal the strength and weaknesses of the two choices: hash table vs tree.

The conclusions were that the hash table, offering half the functions of the sorted arrays (no predecessor/successor resolution, no in order traversal, etc), is the faster option, but at a significant space cost.

What was left out was the query comparison.  I’ll do that now, for the same four data sets.

The first, an EN Wiki article titles set:

enw1

The second, a collection of English like made up words:

pkd1

The third, a collection of binary 8 bytes long strings:

w81

Not surprising, the hash table speed changes little between the three sets.  It also matches closely the speed of insertion, outside table resizing.  The tree structures do improve on their insertion times.  Their performance is also sensitive to data, and the shorter the strings, the faster the queries.

The final comparison is for a set of long local file paths:

zcom1

A clear win for the hash table at large set sizes, though it’s worth remembering that the trees take lot less memory to store the set (up to 5 times less).

Leave a comment