Reckless Clicking - notes/it/gis

Mon 15 Aug 2016 — Tue 08 May 2018

An overview of the tools and techniques used in mapping.

Geographic Information Systems

Geographic Information Systems (GIS) let you produce a map by working with shapes.

JTS and GEOS

These are the foundational pieces.

They both provide shape geometry and let you manipulate them. This involves things like testing whether they intersect and measuring the distance between them.

Java Topology Suite is the original. GEOS is a C++ port, which is widely used in all sorts of other programs.

Projections / Coordinate Systems

It's probably best to start with some Wikipedia articles:

Spatial reference systems (SRS): define how to represent geographic coordinates. Look these up at http://spatialreference.org/.
Map projections: convert 3D coordinates to points on a 2D plane.
Geodetic datum: defines the ellipsoid that represents the Earth in this SRS.

Common SRSs that you may want to learn the codes for:

EPSG:27700: British National Grid
EPSG:4326: represent data as latitude and longitude
WGS84: this is the same as EPSG:4326
EPSG:3857: web mercator. Web mapping tools pretty much all use this. It involves a magical maths hack on the geodetic datum to get some big performance speed ups when you do a projection.

If you try to draw some geographic data and can't see it, or you try to intersect two sets of geographic data and they have no intersection points, then it usually means you need to do a transformation between coordinate systems.

If you try to draw some geographic data and it looks like it is just a little out of alignment (e.g. everything shifted 10 to 100 metres in one direction), then you have probably done your transformation without including the datum shift.

Proj4 is the library you need for converting between different coordinate systems. Ports of it exist (e.g. Proj4J and Proj4js). You will usually actually use this via one of the higher-level tools listed below.

Data formats

The two most common format are probably GeoJSON and Shapefiles.

GeoJSON is fairly obvious and self-explanatory, and has a nice clean definition.

Shapefiles are a bit old, but widely used. They come as a set of different files:

Thing.shp: geometric shapes.
Thing.dbf: a table of properties, where each row corresponds to one shape in the .shp file.
Thing.shx: an index to allow quick seeking of records in the .shp and .dbf files.
Thing.prj: the SRS the data uses.

There are a few other file extensions for Shapefiles, but they're not that interesting. For example, they might include a symbology file, which tells you what symbol to draw for a Point geometry on the screen.

Shapefiles have two major drawbacks:

Column names in the .dbf table are restricted to 8 characters and upper-case.

A bad character set / encoding choice in the standard. Some tools disregard this. You can instruct Spatialite to export Shapefiles with a particular character set. ogr2ogr can convert them to unicode: ogr2ogr out-file in-file -lco ENCODING=UTF-8.

(In this millenium you should almost always use UTF-8 as your text encoding. Some Windows tools may default to LATIN-1 because they hate you.)

GDAL and GeoTools

Build on top of GEOS and Proj4, GDAL provide some nice tools which are worth learning:

ogrinfo: inspect a geographic file
ogr2ogr: convert a geographic file between different formats, coordinate systems etc.
A python interface: this is used by Shapely.

You can also do SQL queries on them, including using Spatialite syntax by including -dialect sqlite (http://www.gdal.org/drv_sqlite.html). This uses a thing in Spatialite called VirtualShape/VirtualOGR.

GeoTools is effectively the Java alternative to GDAL. It also lets you convert between coordinate systems and different spatial file formats. It's quite Java, with a lot of ThingStrategyFactoryProviderBeans.

Spatialite and PostGIS

Spatialite and PostGIS are spatial extensions for SQLite and PostgreSQL respectively.

They implement the same standards for how to represent shapes and most of the same functions are available in either. Probably the best way to learn is to work through the Spatialite Cookbook and to look through the Spatialite SQL Functions Reference.

VirtualShape

This is an SQLite driver which lets you read shapefiles.

/* Read-only view onto a file */
CREATE VIRTUAL TABLE table_name USING VirtualShape(shape_file_name, charset, srid);
/*
1 implies column names
TAB is the delimiter, it could be ','
DOUBLEQUOTE is the quote character, it chould be SINGLEQUOTE
*/
CREATE VIRTUAL TABLE table_name USING VirtualText(csv_file_name.csv, charset, 1, POINT, DOUBLEQUOTE, TAB);

/* Import and export */
 .dumpshp table_name geometry_column_name shape_file_name charset geometry_type
 .loadshp shape_file_name table_name charset srid geometry_column_name

You can't index VirtualShape tables.

Spatial Indexing

Spatial indexes are a way to speed up checks as to which shapes are nearby to (or intersect) each other.

They work on just the bounding box of the geometry you give them, so you almost always need to do actual intersection or nearest neighbour checks on the results they give you.

So, basically:

Without an index: do expensive check on every geometry.
With an index: do cheap looking in index, follow by expensive check on just a few geometries.

You should usually use some sort of R-tree.

Quad Trees

Quad trees are the simplest way. Each node represents an area.

Each node may have 4 child nodes which represents its quadrants. It will delegate its contents to those quadrants.

Some of a parent node's contents will overlap more than one of its child quadrants, in which case the parent node retains responsibility for that object.

d3-quadtree is a nice implementation.

R-Trees

R-trees are balanced trees.

Instead of dividing up areas into fixed quadrants, they find shapes which are close to each other and use the bounding box which fits around them.

This can lead to overlapping rectangles.

There are some variants of R-trees (e.g. R*trees), which aim to reduce the amount of overlap.

R-trees often divide up the areas into long thin strips, but we usually care about shapes which are closer to square. Some variants of R-trees (e.g. Hilibert R-trees) address this.

Spatialite and PostGIS both offer R-tree indexes. This Java R-tree implementation by David Moten seems good too.

Topology

When manipulating normal geometries, you often get errors when working with multiple geometries. Coordinates are represented in floating point, and geometric operations tend to have square roots and trigonometry in them, both of which create irrational numbers. Floating point can't store/represent an irrational number. This means that the set logic of our geometric operations can break down after a couple of operations.

For example, given two linestrings that we know touch, find their intersection point. Quite often the point that we find doesn't actually touch one the lines!

These small geometric errors are acceptable sometimes (human eyes can't see them). On other occasions, they break our programs. For example, for THERMOS we need to preserve network connectivity when we cut up the roads.

To mitigate errors, we can store our geometry as a topology. This is a graph representation where we eliminate duplicate coordinates. As a result, when these coordinates jitter around a little due to floating point blargh, shapes will all still be connected to each other.

This approach is also helpful when we are simplifying geometries. Otherwise it is easy to get very ugly looking boundaries.

The downside of this approach is that it is a lot slower than using raw geometries.

The standard way to describe topologies is:

Nodes

are points, and must include all intersection nodes.

Two Points which exist at the same coordinate become the same Node. We may include some snapping/tolerance here as well, to prevent clusters of nearby points.

Edges

are linestrings, and must include all boundaries between polygons.

Each edge connects to exactly two nodes.
Edges cannot cross each-other, or touch each-other anywhere other than at their ends. They When constructing your topology, you must cut your LineStrings wherever they cross intersection.
An edge which is a loop is not allowed: you must cut it into two.

Faces

a space bounded by some edges.

Face 0 is the outside universe.
Each edge has exactly two faces: one on each side.

TopoCurves

a collection of edges

TopoSurfaces

a collection of faces

Implemented in Spatialite using librttopo and PostGis using liblwgeom (still, I think?).

In Spatialite:

A Node is a Point
An Edge is a LineString
Faces, TopoCurves and TopoSurfaces don't have any stored geometry. You have to materialize them.

TopoJSON is a variant of the GeoJSON file format, with the difference that it stores topologies.

Web Mapping

These are the major web mapping libraries:

LeafletJS: simple and works well
OpenLayers: larger and more complicated, with integrated GIS features. Also has PostGIS integration.
Google Maps: annoying compared to Leaflet, Google stipulated you have to use their mapping software if you want Streetview.

Rendering Approaches

Approach	Complexity	Performance	Loading	Tools
SVG	Low	Low	One shot	Leaflet Vector Layers, d3-geo
Canvas	Med	Med	One shot	Leaflet Vector Layers + Canvas, Tom's implementation in THERMOS
Box query + Canvas	High	Med	Progressive	PostGIS, Spatialite
Vector Tiles	High	High	Progressive with good caching	TippeCanoe, TangramJS, MapBox GL JS, assorted Leaflet plugins
Raster tiles	Med	Max	Progressive with good caching	Mapnik to generate
Raster tiles + UTFGrid	High	Max	Progressive with good caching	Mapnik to generate, UTFgrid layers for Leaflet

So, when should you use what, and why?

Do you have a small amount of data (< 5MB)?
- Yes: go to (2)
- No: go to (4)
Do you have a small number (< 1000) of shapes, and your shapes aren't too complex?
- Yes: try SVG. It it's too slow, try Canvas (3)
- No: try Canvas (3)
Is your Canvas implementation fast enough?
- Yes: exit (0)
- No: go to (4)
Do you need to dynamically restyle your geometry (e.g. recolour it, hide and show it)?
- No: go to (5)
- Yes: use vector tiles
Does the user need to be able to interact with your geometry using the mouse (hover or click)?
- No: use raster tiles
- Yes: use raster tiles with UTF-grid hit detection

If you end up choosing either of the low data approaches (SVG or Canvas), you can probably keep your data in HTML local storage for a very large load-time speed up.

Tiles

Using tiles in maps is a performance optimisation. It gives us speed ups for these reasons:

Cacheable
Pre-compute the projection from point in coordinate system to pixel location on a 2D plane
Pre-simplify geometries based on zoom level
(Raster tiles only) pre-render the image

Tile positions (Z, X, Y) are mostly standard across mapping tools, using the following assumptions:

Web mercator projection.
Higher zoom levels are more zoomed in.
Zoom level 0 is the minimum zoom, and has 1 tile for the whole world.
For each zoom in step, we cut each tile into four equal quadrants (each tile is half as tall and wide as in the previous zoom level).
Each tile gets an X and Y coordinate based on how far it is from the Prime Meridian and from the top of world (excluding parts which are unmappable in Transverse Mercator).

The things which vary annoyingly between implementations are:

The position of Z, X and Y in the URL.
Whether Y starts from the North or South.

Depending on your performance needs, you could either pre-generate some, all or none of your tiles. If you don't pre-generate them all, you should put some sort of caching layer in front of them.

MBTiles

When we generate tiles, we often generate lots and lots of tiny files. When you get to large zoom levels (17 or 18), you might get too many files to actually fit in a normal (e.g. EXT4) file system. Even if you don't hit this limit, a lot of our standing file moving tools perform quite badly in this situation.

As an additional annoyance, most of these tiles will actually be empty (for vector tiles or UTFGrid), or a square of blue sea or green grass (for raster tiles), so really we'll get the same file repeated over and over.

MBTiles is a way to address these problems by instead putting the tiles in an SQLite database.

Having done this, it's still quick and simple to serve them using a small program. An example, in Python Flask: https://github.com/cse-bristol/hes-map/blob/master/server/tiles/tile_server.py.

Raster Tiles

Raster tiles are just images. They can be any sort of raster image format (e.g. .png, .jpg, .bmp).

There are lots of different renderers for them. Previously I've used Mapnik. This is OpenStreetMap's renderer. It gives a decent looking result, is quick, and has a Python API.

UTFGrid

UTFGrid is a way to make a raster layer interactive. It works by adding a second layer of tiles which is invisible, but against which you can check the pointer position.

In addition to letting you check what object your cursor is over, UTFGrid also stores some JSON data about that object.

The same geometry might be present on multiple UTFGrid tiles. If you want to highlight a shape or put a border over it in response to user mouse movements or clicks, then you'll also need to query a server for its boundary.

You might imagine that an alternative way to do something like this projection would be to check the colour of the pixel in the raster tile layer to determine what kind of thing it is. This is possible, but only if you serve your tiles and JavaScript from the same location. Otherwise, you will fall victim to the Cross-Origin policy.

Vector Tiles

You may have noticed in the flow chart that I never recommended using a box query.

Vector tiles are almost always superior to making arbitrary box queries. They are what you get if you ask the question "What if I snap my box queries to a grid, so that I can cache them?". Unless your data is so dynamic as to be uncacheable, they are a better choice.

Thus far, I have used Mapbox Vector Tiles (.mvt). This is a binary (protocol buffers) format, so it is compact.

tippecanoe is a program which turns a GeoJSON file into an MBTiles SQLite database full of MapBox Vector Tiles.