After you have seen the distribution of JavaScript keywords and statements, it is time for another fun statistics. Now we will take a look at the histogram of various JavaScript tokens.
What is a token? According to ECMAScript specification (see Section 5.1.2):
Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.
In Section 7.5, the list of all tokens are provided:
IdentifierName
Punctuator
NumericLiteral
StringLiteral
Furthermore, Section 7.6.1 says something about reserved words:
A reserved word is an IdentifierName that cannot be used as an Identifier.
For example, you can’t use keywords such as if
, while
, and many others as your variable name. In addition, there are also literals like null
, true
, and false
(the last two are Boolean literals) which belong to this reserved group.
Armed with this information, now it’s time to mine the data. Fortunately, it’s rather easy with the help of Esprima, the standard-compliant and high-performant JavaScript parser. Since Esprima can optionally output the list of all tokens, we just need to harvest them properly. The result will look like the following chart:
The corpus is the collection of several libraries in Esprima’s benchmarks test suite. For a good measure, I also added regular expression to the mix, although technically it is not a token per definition. From the look of the chart, seems that our JavaScript programs consist mostly of various punctuators!
If you like to run the analysis yourself, simply use the tokendist.js example in Esprima source tree, which is also reproduced here for your pleasure:
var fs = require('fs'),
esprima = require('esprima'),
files = process.argv.splice(2),
histogram,
type;
histogram = {
Boolean: ,
Identifier: ,
Keyword: ,
Null: ,
Numeric: ,
Punctuator: ,
RegularExpression: ,
String:
};
files.forEach(function (filename) {
var content = fs.readFileSync(filename, 'utf-8'),
tokens = esprima.parse(content, { tokens: true }).tokens;
tokens.forEach(function (token) {
histogram[token.type] += 1;
});
});
for (type in histogram) {
if (histogram.hasOwnProperty(type)) {
console.log(type, histogram[type]);
}
}
Run it using Node.js like the following:
node tokendist.js /path/to/some/*.js
Happy lexing!