Determining Objects in a Set: Examples in JavaScript

In practical programming, it is quite common to determine whether an object belongs to a set or not. For example, if you implement a spell checker, it is essentially a comparison against the set of all correctly-spelled words. If the set itself is fixed and it never changes, what are the approaches we can use?

The usual initial solution is by using an associative container, often denoted as dictionary data structure. We populate the dictionary with the members of the set. Looking up whether the set contains the search word or not will reveal the outcome of the spell checking.

JavaScript, also known as ECMAScript, is unique in a sense that every object can act as an associative array. Obviously, this is not what you are supposed to do because of the abuse, but to give an idea:

var valid_words = {
  'foobar': true,
  'bar': true,
  'baz': true,
  'quux': true
};
 
function is_valid(word) {
  return valid_words.hasOwnProperty(word);
}
 
is_valid('fox'); // false

If you are using ECMAScript 5 only, you can use Object.create from null so that valid_words does not inherit some properties from Object.prototype.

Another (better) approach is to use array lookup, i.e. indexOf method, in particular since we can store the list of valid words in an array anyway:

var valid_words = ['foobar', 'bar', 'baz', 'quux'];
 
function is_valid(word) {
  return valid_words.indexOf(word) >= ;
}

The problem with the above tricks is that the performance depends very much on the implementation inside the JavaScript engines. Since this is not the primary use case of object or array, often this technique does not lead to a good performance, especially if the set is large.

How about using the native comparison in the language itself? Using switch statement is one possibility. It could be even a series of if statements.

function is_valid(word) {
  switch (word) {
  case 'foobar':
  case 'bar':
  case 'baz':
  case 'quux':
    return true;
  }
  return false;
}

Beside generating a lot of code (since the checking is done for each possible set members), a string comparison is not the fastest thing to do. Depending on the size of the set, the performance may vary.

Something I learned from Roberto Raggi few years ago is the use of prefix tree or trie to build a sort of decision tree. This is used in his C++ parser to detect whether an identifier is a keyword or not, pretty similar to our spell checker example. For the code of that cppparser, take a look at lexer.cpp. How do we apply the same approach to the simple spell checker? It will look like the following. Note that there can be dozens of other variants, but you get the idea.

function is_valid(word) {
  switch (word.charAt()) {
  case 'f':
    return word === 'foobar';
 
  case 'b':
    switch (word.charAt(2)) {
    case 'r':
      return word === 'bar';
    case 'z':
      return word === 'baz';
    }
    return false;
 
  case 'q':
    return word === 'quux';
  }
  return false;
}

A simplified construct that I use for Esprima, the fast JavaScript parser, is to take the word length as the first level comparison and then a simple conditional for the second one. This leads to:

function is_valid(word) {
  switch (word.length) {
  case 3:
    return word === 'bar' || word === 'baz';
  case 4:
    return word === 'quux';
  case 6:
    return word === 'foobar';
  }
  return false;
}

I found out that the above method works fast enough (tested with its benchmark suite) and there is no reason to micro-optimize further. Just like in PGO (profile-guided optimization), combining the short-circuit behavior of the logical condition with the representative distribution of the words can be used for branch optimization.

If the set is large and some members are quite similar to each other, DAG or directed acyclic graph (or rather directed acyclic word graph in this case), can be a better solution with respect to the memory usage. For many cases, generating this graph would be too complicated and not worth the effort.

Of course, it would be incomplete if I don’t mention no-collision hash, i.e. perfect hash. This is the approach used to recognize keywords in the tokenizers of various GNU compilers, from C++ to Java, as well as projects like WebKit and Mozilla. A popular perfect hash function generator is gperf, it generates C code though it is possible to transform the generated code to JavaScript.

What we need to feed to gperf is the following input:

%{
%}
%%
foobar
bar
baz
quux
%%

Now it will spit out some C code. For your convenience, I have crudely transformed it into JavaScript as follows. I also did change the asso_values look-up table since many entries in that table are duplicated. The (perfect) hash function itself is unexpectedly quite primitive. Combined with the short wordlist array, this gperf-based solution works rather quite well.

var MIN_WORD_LENGTH = 3;
var MAX_WORD_LENGTH = 6;
var MIN_HASH_VALUE = 3;
var MAX_HASH_VALUE = 8;
 
function hash(str) {
  var asso_values = [];
  for (var i = ; i < 256; ++i)
    asso_values.push(9);
  asso_values[111] = ;
  asso_values[114] = 5;
  asso_values[117] = ;
  asso_values[122] = ;
  return str.length + asso_values[str.charCodeAt(2)];
}
 
function is_valid(word) {
  var wordlist = ['', '', '', 'baz', 'quux', '', 'foobar', '', 'bar'];
 
  if (word.length <= MAX_WORD_LENGTH && word.length >= MIN_WORD_LENGTH) {
    var key = hash(word);
    if (key < = MAX_HASH_VALUE && key >= ) {
      var value = wordlist[key];
      return word === value;
    }
  }
  return false;
}

Before you create microbenchmarks comparing all the different methods, let me remind you again that it is always good to have the right data for your benchmark. In other words, write the most descriptive implementation first, create the benchmark suite properly with the representative corpus, and then do your comparison. In the above example, if you just measure the speed of running is_valid("foo"), you would be led astray.

Obviously, all the approaches mentioned above are not the only possible solutions. Computer scientists surely have studied and understood few more tricks. What is your favorite? Share it with us!

Determining Objects in a Set: Examples in JavaScript

Related posts: