Literate Jenks Natural Breaks and How The Idea Of Code is Lost
Consider Jenks Natural Breaks Optimization.
It's a way to cut up data values into discrete categories. For instance, a choropleth map of poverty rates is cut into nine buckets which each have a different color. Different scales are useful for different kinds of data - you might use a linear scale, log scale, quantile, or so on.
Jenks is an algorithm designed specifically for this case that tries to maximize the similarity of numbers in groups while maximizing the distance between the groups. It tends to look good and be easily understood, so it's popular. It's in ArcGIS, QGIS, and lots of other tools for maps and statistics. It's in university lectures and textbooks.
Mystery Ports of Ports
Notice anything about these linked scripts?
Every implementation I could find of Jenks is a port from Fortran IV, with the same terrible variable names, no comments, and absolutely no evidence that the porting author knew absolutely anything about what the code was doing.
Let's be clear: the algorithms we use for Jenks are directly ported from an algorithm that accepted punch cards, documented in a paper composed and published on a typewriter.
It's not clear who renamed the matrices (originally
but that naming stuck and sucks.
are initialized as
0.0 as if the language had an integer type, and have
about half comma-free style.
And this is not quibbling about style: the Wikipedia article about Jenks natural breaks describes an algorithm entirely unlike these ones. There's no mention of dynamic-programming matrices, and there's no hint in the implementations of something like the 'sum of deviations between classes.' Of course, there are no comments in any of the ports, so it's hard to tell.
The lack of interest, the disdain for history is what makes computing not-quite-a-field. - Alan Kay
But on the other hand, how could they know?
George F. Jenks proposed jenks natural breaks in his 1977 article entitled 'Optimal Data Classification for Choropleth Maps'.
This article isn't available online, or in print. As far as I can tell, it has never been digitized. It's ostensibly still owned by the University of Kansas Geography Department, so the copy that Bill Morris was kind enough to scan and share I can't safely post. Even if the university doesn't assert copyright, because Professor Jenks passed in 1996, it'll be copyrighted in his name until 2072.
Bother your senator today about copyright term limits.
And so we have history in the oddest terms. The idea is lost but the Fortran code is resurrected in a new language every few years, along with a link to the last link to the unreachable text.
Jenks in simple-statistics
I spent a day reading the original text and decoding as much as possible of the code's intention, so that I could write a 'literate' implementation. My definition of literate is highly descriptive variable names, detailed and narrative comments, and straightforward code with no hijinks.
But the sad and foreboding state of this algorithm's existing implementations said that to think critically about this code, its result, and possibilities for improvement, we need at least one version that's clear about what it's doing.
simple-statistics, my library for literate, beginner-friendly statistics,
now has a literate implementation of
Jenks natural breaks. In the process of understanding the implementation,
I found that the matrix system is a sort of dynamic programming
that cleverly solves all breaks from
1-k when you request
Thus my implementation breaks it into two parts - creating the matrices
and pulling a solution out of them.
Try out Jenks in simple-statistics. If the assortment of other algorithms included make you feel like you should create a new 'micro library', you can also just use the jenks implementation as a standalone. Please don't delete the comments: if you need compression, use uglify-js as a separate step.
If you're an academic, take this as a cautionary lesson to promote the free distribution of your work - if not now, for the future. If you're a coder, consider whether the abstraction of software can be misused to mask ignorance of basic principles.