Spatial Data Structures for Better Map Interactions

ryandrake · on March 27, 2014

Spatial data structures are great, and heavily used in all kinds of mapping applications. Most of the systems I've worked with used the more constrained quadtree structure, where the map is divided into uniformly-sized tiles, and a level-of-detail hierarchy is built. This is good for raster data, where each tile is a picture and the whole world is covered by tiles. The R-tree has great advantages when the complexity is not uniformly distributed across the map, for example, vector road data, where vast areas on earth have no data, and most of the data is concentrated in small urban areas.

I believe PostGIS, which adds spatial operations to Postgres, uses an R-tree as its spatial index.

An improvement on the R-tree, the R(star)-tree, uses a different node splitting algorithm and includes re-insertions (similar to balancing a B-tree), reducing both coverage and overlap. The insertion complexity is greater, but in general, R(star)-tree query performance tends to be a bit better for mapping applications. Generally, maps don't change that often, so building the tree tends to happen far less often than querying it.

There are many more specialized spatial data structures available, for example, Kd-trees, which can be perfectly balanced and are useful for storing point data.

If you're really interested in this stuff, the holy bibles for spacial data structures (which I keep in a special place on my bookshelf) are a pair of books written by H. Samet: The Design and Analysis of Spatial Data Structures, and Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS.

EDIT: Formatting, and apparently you can't write the asterisk character on HN

R-Tree paper (1984): http://postgis.org/support/rtree.pdf

R(star)-Tree (1990): http://epub.ub.uni-muenchen.de/4256/1/31.pdf

PostGIS: http://postgis.net

H.Samet textbooks: http://www.cs.umd.edu/~hjs/design.html

jandrewrogers · on March 27, 2014

The spatial data structure literature is quite incomplete, and there is a lot of confusion about what to use when. Many of the issues with spatial indexing performance at companies I go into is that they are doing it wrong, not that they necessarily have a fundamental problem.

It should be pointed out that R-family data structures should only be used for data sets that are (1) small and (2) relatively static. They scale very poorly in a number of ways. The primary reason they are used in databases is that they can handle interval (i.e. non-point) spatial data types without the possibility of pathological space complexity (bad when talking about databases) and fit B-tree indexing models adequately, which that software understands.

Quad-tree variants can scale well for point data types but are often terrible for indexing non-point data types due to pathological space complexity.

Grid-file variants are the canonical structure for large-scale spatial data sets if the data is static. However, the number of companies implementing these correctly at scale is approximately zero in my experience.

For general, scalable, online/dynamic spatial indexing structures, there is another family of algorithms and data structures (adaptive spatial sieves) that are basically ignored in the literature even though they were first described in 1990, albeit in a not very useful form at the time. If you are doing petabyte-scale real-time indexing of polygons at extremely high rates, this is what you would use but little is published about them because most modern variants did not come out of academia.

And the most advanced algorithm family for indexing point-like data is not in the literature at all.

(Background: my day job involves extreme-scale, high-performance spatial indexing software and I invented a few spatial indexing algorithms back in the day that are still the state-of-the-art in their respective algorithm families.)

rdtsc · on March 27, 2014

> And the most advanced algorithm family for indexing point-like data is not in the literature at all.

So what is it? Is it patented, secret(classified)?

And how do you know it is the most advanced without a peer review?

BTW "adaptive spatial sieves" on Google search produces exactly 0 results. Is there another name for it.

Spatial indexing is interesting and complex but not exactly rocket science, there have been a good number of people thinking about. It is hard to believe there are general approaches there that haven't been thought of yet. But again, but I am not an expert, so I could be wrong.

Can you write some more about it, I am very curious.

jandrewrogers · on March 27, 2014

What makes spatial indexing different than other areas of computer science is that its fundamental operands are interval types (like hypercubes). Almost all other areas of computer science focus on solutions that only work on point-like data and so many idioms and intuitions computer scientists use are not actually valid when dealing with spatial representations. Data types that require no less than two integers to describe break the assumptions of computer science developed for data types that require one integer to describe.

Oddly, there are almost no people working on the computer science of spatial representations and indexing. That was how I became involved in the first place; I needed solutions to specific problems for which no active research was being done in academia (and I was talking to people like Samet at the time). And to this day, academics still aren't doing any interesting work on spatial structures.

The algorithms are not classified, just not published. There are a couple patents out there on spatial sieves -- I am the inventor on the first practical one -- but more sophisticated and advanced variants exist that will probably never be patented. Widmayer (respected CS academic) was the first to propose this approach to the problem of generalized interval indexing but a useful algorithm was not discovered until my work in 2007.

The point indexing algorithm family I mentioned has never been patented or published anywhere but was based on the development of a novel theoretical CS construct that allows the expression of polymorphic space-filling curves. These are essentially adaptive to the information theoretic properties of the high-dimensionality data set but still embarrassingly parallel and distributable. I don't think these are being used in production anywhere (yet) but they've been around for several years. Very cool but the number of people that grok them can be counted on one hand.

All modern research on computing with spatial representations is being done at one of a few companies, which is why nothing is published. People like me, and I haven't done the pure research in years, don't have time to generate hundreds of pages of fairly deep content. Consequently, learning advanced spatial indexing is fairly prohibitive; you have to figure it out yourself, which is far from easy, or be at one of the handful of places where they are using it.

kastnerkyle · on March 28, 2014

I found some related papers for others who are interested:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56....

http://bmi.osu.edu/resources/techreports/ahmedBokhariV21.pdf

Also a book:

Space-Filling Curves: An Introduction with Applications in Scientific Computing

by

Michael Bader

jayvanguard · on March 28, 2014

Thanks for the insight. I've seen similar patterns in areas such as recommendation systems. There are active research communities but really all of the detail work and pushing the envelope happens in private companies.

The financial incentives drive the pace. In the future I'm sure it will become more commoditized but right now the practical nature of integrating with large shopping, coupon, and marketing systems and scaling for millions of users place the progress mostly in the domain of industry not academy.

rdtsc · on March 28, 2014

Fascinating, thank you for taking the time to reply!

ehsanu1 · on March 27, 2014

Googling for it without forcing a phrase match returns some limited results. Here's a patent for a "Spatial Sieve Tree", which is apparently invented by the parent, jandrewrogers: http://www.google.st/patents/US7734714

aw3c2 · on March 27, 2014

In that case I am glad it is not covered in literature. Patenting algorithms is bad.

jandrewrogers · on March 27, 2014

Yep, that was the first practical spatial sieve (previous variants had pathological spatial skew sensitivity). Most people that use them use more sophisticated variants. Even with that version, which is simple, many implementors do not understand the technical subtleties well enough to produce a non-broken implementation.

kaib · on March 27, 2014

> And the most advanced algorithm family for indexing point-like data is not in the literature at all.

Don't leave us comp geom geeks hanging.. Which one are you referring to? :-)

poiuytrewq88 · on April 4, 2014

Is there a way we can get in contact? I'm really interested in your work.

darksaints · on March 27, 2014

PostGIS uses an R Tree implemented on a GiST index by default, with a pure R tree partially implemented.

http://postgis.net/docs/manual-2.1/using_postgis_dbmanagemen...

ryandrake · on March 27, 2014

Ahh, I stand corrected. Thanks!

dhconnelly · on March 27, 2014

Shameless plug of my Go R-tree implementation, which cites the Guttman R-tree paper and the Roussopoulos/Kelley/Vincent nearest-neighbor search paper in its implementation, and is therefore perhaps a useful resource: https://github.com/dhconnelly/rtreego

pron · on March 28, 2014

Yet another shameless plug: SpaceBase[1] uses an R-Tree variant especially tailored to concurrent modifications. It does not require a global lock (as an R* tree would), while at the same time it does not degenerate over time.

It is used in real-time defense systems, MMOs, and real-time location-based services.

[1]: paralleluniverse.co/spacebase/

willholley · on March 27, 2014

shameless plug: Cloudant Geo (https://cloudant.com/product/cloudant-features/geospatial/) uses a sharded R*-Tree for big geospatial data.

adam-a · on March 27, 2014

The other way to do this, which is common in (older?) 3d video games, is to render your scene twice, once normally and once with each object shaded in a unique flat colour. Then sample your second bitmap at the mouse position and map the pixel value back to your list of objects. This is very fast and only falls over when you start to have transparent objects, in which case ray-casting and r-trees become appropriate.

munificent · on March 27, 2014

This is basically a grid[1] where the resolution is down to 1 pixel. Not a bad solution!

[1]: http://en.wikipedia.org/wiki/Grid_(spatial_index)

sigil · on March 27, 2014

If you find yourself doing big-time spatial queries, consider using the PostGIS extension for Postgres, which uses R-Trees-on-top-of-GiST indices: http://postgis.net/

In addition to ray-casting and R-Trees, there's another alternative for the "point in polygon test" not mentioned by the article: compute the Winding Number. Appropriate for a very low number of polygons where a full spatial index might be overkill.

http://en.wikipedia.org/wiki/Point_in_polygon#Winding_number...

I think seatgeek made the right choice in this case though with a client-side R-Tree.

Aside: does anyone know what browsers use for hit-testing polygons defined by the <map/> tag?

cwmma · on March 27, 2014

the rtree implementation they found [1] isn't connected to leaflet except that the org that it lives in also manages some leaflet plugins, Vladimir, the author of leaflet, does have his own rtree implementation [2].

For the record I manage the one in the article.

1: https://github.com/leaflet-extras/RTree 2. https://github.com/mourner/rbush

efkv · on March 27, 2014

Ah, you are right. I must have looked at Vladimir's implementation and then just associated him in my mind with the leaflet-extras implementation.

kstop · on March 27, 2014

tl;dr: we re-implemented imagemaps.