Tom MacWright

tom@macwright.com

Stream Statistics

stream-statistics is a Javascript library that implements online algorithms for descriptive statistics.

The idea came about while developing simple-statistics, a module I made to understand statistics better. That one takes full datasets, in many cases, massive arrays of numbers, but there’s another approach - providing data number-by-number to online algorithms via an interface like nodejs’s streams.

To be clear - stream-statistics doesn’t require nodejs and can run in browsers (even old ones). When you use it as a module with npm, it tries to align to nodejs’s stream specification.

That said; ‘stream specification’ is kind of overstating what node has - it has no prescriptive docs for how to implement streams, and my experience with making this ‘compliant’ has been less than sunny.

Here’s a thing you can do with stream-statistics

var fs = require('fs'),
    StreamStatistics = require('stream-statistics'),
    byline = require('byline');

var ss = new StreamStatistics();
var stream = byline(fs.createReadStream(__dirname + '/samples.txt'));

// Pipe a stream of newline-separated numbers into stream-statistics
stream.pipe(ss);
stream.on('end', function() {
    assert.equal(ss.max(), 120);
});

Unlike simple-statistics, the algorithms in stream-statistics don’t look much like their definitions on Wikipedia - they’re made to be quite fast and usable.

Like stream-statistics, it’s just one more implementation in a field of many - Boost.Accumulators is a notably incredible implementation in C++ which I’ve tinkered with in terms of mapnik. The streaming quantile implementation will be inspired by the C implementation of Efficient Computation of Biased Quantiles over Data Streams in statsite by Armon Dadgar.

To announce this, I wanted to finish either a neat drawing or one of the uber-difficult algorithms for a more complex statistic. The former won out; implementing quantiles was stalled for a while. The different, inpenetrable writing on Wikipedia, MathWorld, R, Mathematica, and elsewhere is a shame, and a ready example of how math fails to try to be useful in the gap between theory and pre-baked implementations.

Anyway, when I get more coffee or a pull request, stream-statistics will do cool quantiles and k-means analysis.

Install stream-statistics with npm or download stream_statistics.js from GitHub to use it in the browser.