Solving Earth Problems

The answers to spatial problems tend to converge towards certain techniques that everyone uses. Here are a few.

Is this point in any of these rectangles? Indexes

If you’ve used PostGIS, it’s likely you’ve run into the all-important index problem. Adding a spatial index to a table means the difference between multi- and sub-section queries on a table. In software, the same principle goes: indexes are typically required for performance if you’re asking anything about spatial intersections.

Typically spatial indexes are R-trees: the data structure provides a nice balance of space and performance, and is fairly simple to implement.

In JavaScript, use rbush, Vladimir Agafonkin’s library based on recent research. Closer to the metal, spatialindex is essential.

Is this point in any of these polygons? Indexes + Point in Polygon

R-Trees contain rectangles and points, not polygons or lines. What if you need a more specific answer? This leads us to an important technique.

A useful and simple technique for asking questions like ‘is this point in any of these 2 million polygons?’ is that you can break it down into steps based on assumptions:

  • If a point is in a polygon, it must be in the bounding box for that polygon.
  • Therefore, we can answer this question in two steps:
  • Which polygon bounding boxes is this point within?
  • Of those polygon bounding boxes, how many of the actual polygons does this point fall into?

This takes advantage of the fact that the first question can be extremely fast to answer, thanks to R-Tree, whereas the second question, whether a point is in a specific polygon, usually cannot be fast. The canonical algorithm for point-in-polygon intersections looks at every point in the polygon, even in highly optimized implementations like GEOS.

In JavaScript substack’s point-in-polygon module and my leaflet-pip expose this algorithm. It’s an extremely simple technique: barring naive implementation mistakes, the performance is pretty predictable and scales relative to the number of vertices in the shape.

What is the area of this polygon? Ring Area

There are many ways to calculate the area of a polygon on the earth. The most complex and most accurate is rarely used - taking into account the minute surface variations and elevation differences of the geoid.

A much more common approach is to calculate the area of a ring using some simple trigonometry popularized by Robert Chamberlain at NASA. There are many implementations of that approach, like geojson-area and geojson-utils.

One neat side-effect of the algorithm to calculate polygon area is that it also answers the question of the polygon’s winding number: a clockwise loop will yield area greater than or equal to zero, while a counterclockwise loop has negative area.

What are the closest points to this one? KNN

What about finding the closest points to a certain point? The old-fashioned answer from my MySQL days would be something like a bounding box query with > and <: you can do better if you’re writing an application.

Using a k-d tree, we can solve the nearest neighbor problem without having to look at every entry. In JavaScript, sphere-knn is a tested and awesome implementation of this by Dark Sky.

Derivative Questions

Many questions distill into these: for instance, which shapes interact with a bounding box, a common question in tiled maps, is some combination of indexing and point in polygon checking.

Cells

An interesting new approach that we’ve been exploring at Mapbox and that has been implemented in Google’s S2 Geometry Library and Morgan Herlocker’s tile-cover library, is to use cell indexes rather than trees to answer these questions.

It’s a promising approach because it flattens the index structure, making it possible to distribute the index across a network and express spatial queries as range queries. Since the derivation of cells from a shape is very similar to rasterization, we can steal tricks from graphics programming and use ray-tracing techniques to make it super-fast.

Waltz

(new Teen Mom song)

RPL (2) - Terrarium

Bark Bench Terrarium

CC-BY-NC joshleo

The core of rpl is terrarium. Terrarium is an alternative to JavaScript’s eval() function and node’s vm.runInNewContext(). Unlike these systems, rpl is designed to run code as full-fledged scripts and to manage its execution as a proper full-fledged process.

The precedecessor to rpl, called mistakes.io, used eval() at its core - it would execute each 1..n subset of code text in eval() and display the result. While this worked very well for imperative code that didn’t use timing functions, but couldn’t handle async and timed code properly. This ruins the fun of JavaScript.

The problem is that some programs will finish executing in one pass:

var x = 10;

But other programs will never exit:

setInterval(function() {
  var x = 10;
}, 100);

JavaScript has a few different ways to run forever - setInterval, setTimeout, process.nextTick, and so on. It’s very difficult to ‘clear all intervals’ without brutal hacks.

The other problem of running code in a sandbox is scope: if you run something like eval('x = 10'), the variable x will now be defined in the outer scope. While node’s vm.runInContext method gives more control over scope access, there isn’t a browser equivalent with the same behavior.

Terrarium’s solution is true nesting in both cases - for node, it creates a module on the fly and runs it with child_process.fork to create a new subprocess with a stream for communication. In a browser, it creates a new HTML resource with the javascript code embedded and runs it in a new iframe with window.top as a communication mechanism.

This allows the sub-scripts in question to do anything normal code can do - in a browser, you can create elements and attach events to them, and in node, you can create web servers, access files, and all the rest. And all this without hacks necessary even in node core to make the default node REPL act like software.

var T = require('terrarium');
var terrarium = new T.Node(); // or T.Browser();
terrarium.on('data', function(d) {
  // instrumentation data
  terrarium.destroy();
});
terrarium.run('//=1');

Terrarium is a low-level API: you create a sandbox, write code to it, and receive instrumentation data. The API is identical between node and the browser.

var terrariumStream = require('terrarium-stream');
var terrarium = new terrariumStream.Node();
terrarium.on('data', function(d) {
  // instrumentation data
  terrarium.destroy();
});
terrarium.write({ value: '//=1' });
  • browser: iframe, communication via window.top
  • node: subprocess, communication via fork and streams

terrarium-stream is a slightly higher-level API that connects the node stream lifecycle - creating, reading, writing, destroying - to terrarium. Using terrarium-stream, rpl is able to interface with browser & node code and build an extremely simple websocket-based interface using shoe.