Hacker News new | past | comments | ask | show | jobs | submit login

FWIW, even though PostGIS is pretty great, if your use-case is primarily offline analysis and you don't need the data to be permanently accessible or writable, consider not using a database at all. You can do a lot, a lot faster with e.g. https://shapely.readthedocs.io/en/stable/manual.html and/or https://geopandas.org/.



Ultimately for spatial functions both PostGIS and Shapely/Geopandas are using GEOS as the base library, so neither are 'a lot faster' for analysis at a fundamental level. Lots of queries to a database brings some speed loss but this isn't necessary for the vast majority of use cases. OTOH PostGIS is in many cases much more finely tuned than shapely/geopandas for speed, scale and parallelisation (it is a much more highly maintained library) so can actually be faster.

Functionality is a different thing. PostGIS has a lot more baked in functionality than shapely. But shapely/geopandas exposes the depth of python and its libraries which allow much more extensive customisation of how to work with data.

Lots of tradeoffs and overlapping use cases - I just wanted to add some depth to this discussion.


Yes, even for on-disk read-only use cases, shapefiles are almost always much faster. Overhead from serialization, network, etc often dominates the performance of GIS.


I'd argue that being able to use sql to more easily combine ans filter datasets, including non-geospatial ones is still very useful in the circumstances you described.


Not really, combining datasets in (Geo)Pandas is very straightforward, including spatial joins: https://geopandas.org/docs/user_guide/mergingdata.html#spati... Of course, it's all a matter of personal preference, but I have used both PostGIS and GeoPandas extensively.


You will eventually run out of memory or some other restraint given large enough datasets. PostGIS will churn through it and not crash...


I'm not sure what 'a lot faster' really means in this context.

Honestly, I've found using Spatialite queries to be orders of magnitude faster for analysis than shapely or geopandas. The latter typically imply row-by-row selection and manipulation for starters.

If you can wrangle the data into a geopackage first it's super easy to run queries over the data and extract what you need.


Faster both in terms of querying and in terms of doing the kind of analysis you want and getting the answers you need.

Of course, happy to acknowledge that different tools might work better in different scenarios. For example, I suspect that speed of querying is really just due to the data being in memory so if you can configure Spatialite or PostGIS to do the same, I certainly wouldn't be surprised if you say you can do even better.

But for one-off analyses, it's common to spend a lot of time just getting your data into the right shape, doing various manipulations, perhaps even wrangling the geometries. For that, working entirely within SQL is frustrating as heck. For example, I did an analysis on flight paths over heavily populated areas once, which involved turning infrequent point locations with gaps in the data into a smooth interpolated flight path. That's easy if you have numpy and scipy at your disposal, otherwise it's not. Another analysis involved estimating housing prices in neighborhoods without any recent sales, from prices in adjacent neighborhoods with sales, and again it's easy to code up an algorithm to fill the gaps or to run a geostatistical analysis that can impute the missing values, but not if all you have is SQL, or if you have to constantly do roundtrips between database and code.

I mention all this not to start an argument, but simply because when I first started doing GIS work, I was very confused about what the right tools and workflow were, and once I embraced projections (vs. working directly with spheroids) and in-memory analysis in Python, my productivity went way up. If other people find themselves in the same scenario, they owe it to themselves to try out both approaches to see what works best for them.


Do those assume/require that all the data will fit into memory?


Mostly yes, though you can wing it a little bit by relying on swap and/or doing any big filtering operations up front. In my work, it's never been a problem, because you can cram a lot of data into 16 GB, millions of big polygons and metadata. Note also that textual formats like GeoJSON or WKT are incredibly wasteful of space because all coordinates are encoded as characters instead of floats or integers, whereas the in-memory representation is much smaller, so even huge source files are likely to fit in memory just fine. But judging by the sibling comments it does look like it's a limitation for some.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: