Sorting

I have an undergraduate degree in Computer Science, and the paper to prove it. Despite the time I spent in lecture halls and the lab, many basic, ‘obvious’ concepts are still clicking, years later.

This year, I finally understood why sorting is so useful.

I’ve been working on the next version of Simple Statistics, a friendly library for statistics written in JavaScript. It implements a few sophisticated methods, like Bayesian and perceptron classifiers, but mostly focuses on descriptive statistics - methods like mean and mode that can be implemented in 20 lines of code but don’t come standard in the JavaScript language.

A brief introduction to complexity theory
Sorting in Simple Statistics
Sorting for order statistics
Counts and modes: naive algorithms
Fast counts and modes with sorting
Conclusion & notes

A brief introduction to complexity theory

Complexity theory is a subject in computer science that estimates and measures how the performance of programs changes as their inputs increase in size. It’s different from how performance is usually talked about in the industry because it rarely measures in terms of milliseconds or megabytes, but instead in terms of ratios and curves. It asks: how much slower does this algorithm get with each additional number to crunch?

For example, the question “what’s the first element in a array” is, in almost every case, answerable in one step. You have a pointer to the first element or you can look it up with array[0]. Regardless of array’s length, getting the first element takes the same amount of time - the amount of time to solve the problem is not affected by the length.

We call that kind of problem solvable in constant time, or in complexity theory’s notation, O(1). The 1 doesn’t stand for milliseconds or any real measurement - it’s more like an indication that the performance doesn’t rely on any characteristic of the data, like n.

How about finding the sum of all items in a array? You can’t solve this in one step. Instead you have to write a little code like.

var array = [1, 2, 3];
var sum = 0;
for (var i = 0; i < array.length; i++) {
  sum = sum + array[i];
}

Now there’s a for loop in your code and it looks at each item in the array, doing a very small amount of work in each step. The time it takes is dependent on the array’s length. Adding two numbers is fast but with thousands of items, computing a sum will take a long time. And the relationship between speed and length is one-to-one: each new item makes the loop one increment slower.

This algorithm works in linear time, or in the notation, O(n). The steps could be any size: for each extra item in the array the time taken could increase a little or a lot, but the important thing is that it increases the same amount for each item added.

And so on and so forth: beyond constant and linear time, there are problems only solvable with O(n²) - which require a nested loop, or in O(n log n), which are somewhere in between.

Sorting arbitary numbers requires at least O(n log n) time. That’s the fastest you can possibly sort, and has been confirmed by countless PhD theses. Modifications of the algorithms that keep time down, conserve memory, and improve other qualities are also commonplace thanks to theses. Sorting algorithms also contain lots of great examples of recursion, functional programming, and dynamic programming, so they’re omnipresent in undergraduate education.

Sorting in Simple Statistics

Some methods in Simple Statistics are implementing order statistics: the minimum, maximum, and median are all statistics that are directly related to specific numbers in a sorted array.

How do you find the minimum, maximum, and median numbers in an unsorted array? Well, there are two ways:

1: You can sort the array first, and then access the numbers at their rightful places:

min = array[0]
max = array[array.length - 1]
median = array[Math.floor(array.length / 2)]

The median code is a bit more complex than that just to handle uneven lengths, but the principle is the same.

2: You can use a linear-time algorithm to find the minimum & maximum values. This isn’t too tough, since minimum and maximum are the simplest examples of online algorithms: if you encounter a new value that’s lower than your current running minimum, it is the new minimum. Hence this style of algorithm:

function min(x) {
    var value;
    for (var i = 0; i < x.length; i++) {
        if (value === undefined || x[i] < value) {
            value = x[i];
        }
    }
    return value;
}

There’s a for loop that touches each individual item in the array, so this algorithm works in linear time.

Median can’t be precisely solved online, so our median algorithm has always relied on sorting.

Sorting for order statistics

Minimum and maximum can be calculated on a sorted array in one step – constant time, or on an unsorted array in linear time. If you have a sorted array, you can calculate these values almost instantly.

Counts and modes: naive algorithms

Minimum and maximum are the most straightforward cases: they’re order statistics, meaning ranks of numbers in a series. Let’s look at two other problems that are fast with a sorted array and slow or memory-intensive without one: the mode and the count of unique values.

The mode of an array is the number that appears the most in that array. If your array is [1, 2, 2, 3], the mode is 2 because the number 2 appears twice but 1 and 3 only appear once.

Order statistics rely on the principle that a sorted array is sorted by value, so at one end are the biggest numbers and at the other are the smallest, and the median is in the middle.

Sorting problems can take advantage of a related property:

Identical values are next to each other in a sorted array
Once you see a value in a sorted array, you’ll never see it again. If your loop hits the tenth value and it’s 4, and the eleventh value is 5, you’re guaranteed to never encounter a 4 again.

These two properties line up with our problems: the mode is based on identical values, and the unique count is based on counting how many values are present.

A simple, but memory-inefficient way to compute a mode would be:

var list = [2, 1, 2, 3];

// create an object where the keys are items in the array
// and the values are how many times each item is encountered
var counts = list.reduce(function(memo, item) {
  if (memo[item] === undefined) memo[item] = 0;
  memo[item]++;
  return memo;
}, {});
// { 1: 1, 3: 1, 2: 2 }
var pairs = Object.keys(counts).map(function (item) {
  return [item, counts[item]];
});
// [[1, 1], [3, 1], [2, 2]]
pairs.sort(function(a, b) {
  return b[1] - a[1];
});
// [[2, 2], [1, 1], [3, 1]]
var mode = pairs[0][0];
// 2

Gross, right? The careful reader will notice that we could avoid sorting the array and calculate the maximum value with the linear time algorithm mentioned before, but still we’re creating a big object, and then a big array, and overall this is costing a lot of cycles.

Calculating the number of unique values can use the same first step from that algorithm and instead of finding the most popular value, counting the number of values.

var list = [2, 1, 2, 3];

// create an object where the keys are items in the array
// and the values are how many times each item is encountered
var counts = list.reduce(function(memo, item) {
  if (memo[item] === undefined) memo[item] = 0;
  memo[item]++;
  return memo;
}, {});

var uniqueCount = Object.keys(counts).length;

Fast counts and modes with sorting

Sorted arrays let us implement this algorithm both in linear time and without the extra memory cost of temporary objects and arrays - all we need are a few cheap temporary number variables.

function modeSorted(sorted) {
  var last = sorted[0],
    value = NaN,
    maxSeen = 0,
    seenThis = 1;
  for (var i = 1; i < sorted.length + 1; i++) {
    // since the array is counted, we have a guarantee
    // that we'll never see the value again
    if (sorted[i] !== last) {
      if (seenThis > maxSeen) {
        maxSeen = seenThis;
        value = last;
      }
      seenThis = 1;
      last = sorted[i];
    } else { seenThis++; }
  }
  return value;
}

Counting unique values is even simpler with the guarantee that we’ll never see a value again.

function uniqueCountSorted(input) {
  var uniqueValueCount = 0,
    lastSeenValue;
  for (var i = 0; i < input.length; i++) {
    if (i === 0 || input[i] !== lastSeenValue) {
      lastSeenValue = input[i];
      uniqueValueCount++;
    }
  }
  return uniqueValueCount;
}

Conclusion & notes

This all clicked for me working on Simple Statistics v2: a lot of algorithms were running a sort as a first step and then proceeding into the ‘fast part’. v2 will let you skip the first step if you already have sorted input: use uniqueCountSorted, minSorted, maxSorted, medianSorted, modeSorted, and quantileSorted.

The algorithms in this post are simplified for space, since it’s already a long thing to read on the internet. They’re all available in their well-commented, corner-case-handling forms in the simple-statistics source code.

The term ‘online algorithm’ may be a little obscure, but for the node.js world a good parallel is streams. Streaming modules work on small parts of the data at a time, which gives you some memory advantage but also means that algorithms needs to be computable without knowing all the input. That kind of problem is online.

March 23, 2016 Tom MacWright (@tmcw, @tmcw@mastodon.social)

Tom MacWright