4 years ago one of my clients wanted to "donate" a system for the local fire department to help them do a quick proximity search to find the fire hydrants and quickly choose the healthy one near the fire.
And since it was charity and had a bunch of private data Google was not an option ($$$$), so I (just a full-stack developer back then) was like "listen I have no idea what is this GIS stuff, but I'll give it a try", after a quick research boom PostGIS, reading the docs and testing things, plus QGIS to visualize and to help understand it better 10/10!
That was my unintentional "career" shift, thanks to PostGIS I'm now a senior dev at the largest food delivery company (local), specialized in GIS and realtime data driven systems using PostGIS everyday lol
not sure if this applies to you directly, but here's my 2¢:
1. a lot of people fall for the meme and drink the koolaid that if you spend 6+ years doing nothing more than studying then BOOM, you'll graduate and companies will just blindly start throwing money at your face. surprise surprise, it doesn't work like that.
2. people should follow their passion and everything but they should also at least keep an eye on the market.
3. students (not universities) should really start considering an university course unfinished without some kind of internship. if you manage to get an internship in a company that uses the tech you're interested into (in your case, gis) you can either realize you don't like it that much or understand what the direction for your studies need to be in order to be more proficient. (applying for internships is sampling the market, btw)
4. specialists are only needed up to a certain points. in most situations a good generalists can learn enough to get the ball rolling and bring home results. see like an 80-20 pareto principle or something like that. btw, a good generalist can surpass a specialist over time.
> I blame the internet and the accessibility of knowledge to the point where people can go "IDK WTF this is, but lemme google it".
the people you complain about probably can already do a lot of other useful stuff, to the point they can just "lemme google it".
I think it's more that the tooling has been commoditised. Historically, professional GIS was very much about Esri's tools (and, to a lesser extent, MapInfo). Your employability was directly linked to your ArcGIS proficiency.
Now, there's a massive ecosystem of open-source GIS: PostGIS is probably the standout, but also QGIS, everything around OSM, GDAL/OGR, and a hundred others. For government work and some parts of academia then Esri still dominates, but there's now much more to "geo".
This right here. GIS is accessible to anyone who knows Javascript, Python, and SQL thanks to the open source GIS ecosystem. Previously it was the sole domain of ESRI priests.
Now this is a net good thing, but there are downsides. Generalists wielding specialist tools means that a lot of the wonky basics aren't known until things break. But this is a blip compared to the step-change of making an entire field accessible to anyone who's a decent programmer with a healthy appetite for research. It's crazy how much you can spin up with PostGIS, Python, Mapbox and/or Leaflet...no ESRI license needed.
Indeed. This was my experience when a nonprofit in my area set up their geomapping for Covid-19 support. There was of course some domain growth, but I was super impressed with the tooling. Even more so now that networkx and Qneat3 support other topological use cases. All and all exciting stuff.
I was doing ArcGis about 10 years ago and it was amazing the quality of the free maps MN data deli provided. Your comment makes me excited to try out these new tools even tho I mainly do front end work now. It’s a great time to be a programmer!
PostGIS is one of those rare examples of highly specialised software that is both OSS and best-in-class. Usually OSS is relegated for highly commoditized software but GIS is anything but even after all these years.
I can't really think of many others? Maybe OptaPlanner would be another candidate.
Yes, among off-the-shelf GIS databases, PostGIS is generally as good or better than anything else on the market. More capable systems exist, particularly in terms of scale and performance, but those are all bespoke.
It's definitely up there. I guess it's hard to say there is a single "best-in-class" solver because it does depend on the problem at hand but if I had to pick one it would be OptaPlanner.
I'd say it's not quite in the same category as Postgis, where the open source project is handily better than any closed source rivals. It competes well with some closed source competitors but not all of them.
Thanks for sharing this! I've been thinking about alternatives to PostGIS to handle larger datasets (millions) and nothing seems to come close to the level of functionality, performance and community support of Postgres
If you really need to scale beyond what Postgres/PostGIS can handle, then you might want to check out GeoMesa[1], which is (very loosely) "PostGIS for HBase, Cassandra, or Google BigTable".
That being said, you may not need it, because Postgres/PostGIS can scale vertically to handle larger datasets than most people realize. I recommend loading your intended data (or your best simulation of it) into a Postgres instance running on one of the extremely large VMs available on your cloud provider, and running a load test with a distribution of the queries you'd expect. Assuming the deliberately over-provisioned instance is able to handle the queries, you can then run some experiments to "right-size" the instance to find the right balance of compute, memory, SSD, etc. If it can handle the queries but not at the QPS you need, then read replicas may also be a good solution.
Yeah at my current job we run RDS in AWS and scale it up to a m5.12xlarge when we need to get sh*t done fast. It normally sits around a 4xlarge simply because the 12 is far too expensive.
Millions is a pretty small dataset these days. I'm increasingly of the opinion that the boundary for considering "big data" tools starts at about 100 million rows.
I'd put it higher, depending on your dataset, usage patterns, etc.
This is kinda the hill I keep nearly dying on at work. Team X wants to spend months investigating and deploying BigDataToolY because "we have big data". This is not our core business or core competency. I tell them to dump the data into Postgres and just see how it performs out of the box so we can get back to working on stuff that matters. They don't, I do, we end up using Postgres.
We had one team ignore the advice and go straight to Redshift (which is a warehousing product and totally inappropriate for their use case; but they ignored that advice too). When they finished and then woke up to the reality that Redshift wasn't going to work, I literally just dumped their data into a plain vanilla Postgres instance and pointed their app at it and... everything worked out of the box. That was ~1b rows in a single poorly-modeled table.
Another team was certain that Postgres could never work and was looking to build out a solution around BigQuery, Firestore, and some other utilities to pre-process a bunch of data and pre-render responses. One of our main products was operating in a several degraded state while they spent months on "research" before deciding this would take about four more months to implement. So I dumped all the data into a Postgres instance (using TimescaleDB) and it... worked out of the box. The current data is a few billion rows across half a dozen tables (the bulk of the data in a single table), but I'd tested to 4x the data without any significant performance degradation.
These are just a couple "notable" examples, but I've done this probably a dozen times now on fairly sizeable revenue-generating real-world products. Often I'm migrating this data _off_ of big data solutions which are performing poorly due to either the team's lack of knowledge and experience to use them properly or the tool having been the wrong one to use in the first place.
I've yet to have to even have the team model their data properly. Usually just a lift and shift into Postgres solves the problems.
I've told every one of these teams "We'll use Postgres until it stops working or starts getting needy, _then_ we'll look at the big data tools.". I've yet to migrate any of these datasets back _off_ of Postgres.
I've promoted the idea in the past that if you're going to use something other than PostgreSQL (or MySQL if that's the DB that's already embedded) you need to PROVE that what you need to build can't work with that standard relational database before adopting some new datastore.
It's surprisingly hard to prove this. The most common exception is anything involving processing logs that generate millions of new lines a day, in which case some kind of big data thing might be a better fit.
I've been using PostGIS a bit for a toy project with Elite: Dangerous star system data. It's been a hoot, but I do worry that I'm going to start having trouble optimizing my queries. Finding all the systems within say 20 Ly of our sun can take upwards of a few seconds, and I've already added a GIST index on the positions.
I also find it a bit strange how 3D feels kinda tacked on, but it makes sense when you realize most maps are in fact 2D.
Anyway, my 2-cents of experience. If anyone has some good advise for > million row, 3D spacial optimizations for PostGIS, please let me know.
The only thing I can recommend off hand is to try and simplify your geometries a bit for searching if they’re complex. Searching along the border of a complicated shape is much harder than searching around a square, it has to do a lot more calculations.
Yes had the same solution ~15 years ago when location based search got popular. Searching for anything within some distance to a defined location with perfect earth projection and a circle radius „did not scale“. Ignoring the earth‘s projection and pretending a flat earth with a rectangle search was much much faster.
In the end we used this simplified calculation with added „regions“. So the world was split into 15km*15km squares, so any square being more than x apart could never be in the result set. This could maybe be used with modern postgresql partitioning and partition elimination in a clever way.
And without partitioning maybe clever z-ordering the entries physically in the database (clustering) could reduce a lot of random i/o.
MBR is a common optimization shortcut but I have to wonder if PostGIS doesn't already do that as a first pass filter anyway. It's an incredibly smart and performant extension.
Sometimes the issue can be tweaked by using intersection versus overlap/contains/contained when possible.
Is it possible you haven't added the bounding box for your geometries? The GIST index will use your bounding box to optimise the queries, but only if it finds one. Also make sure your query is actually using the index.
Perhaps I should have noted in my initial post that the only spacial data stored at the moment is the star system's (x, y, z) position in the game's coordinate system.
Unless I'm misunderstanding, there should be nothing to cull from a point (ignoring projections)... right?
What kind of front end do you use. Usually the spatial db is only a part of the solution, and the GIS information needs to displayed and functionality for interaction with maps and objects needs to provided. What framework would pair well with PostGis in the open source world?
Yea the front end is kinda an open question. I've been slowly relearning OpenGL (or maybe Vulkan) for building a little 3D scatter plot like map. 2D projections could come as a feature on top of that I guess.
Mostly I just want textual
information at the moment, viz is just "cool". Like, the primary question is simply, from star A to star Z how many FSD high wake jumps will I need to perform, and how much fuel will it cost. Then the pathfinding is modified to know about star class, and find valid routes with fuel scoopable stars, then the algorithm could be modified again to account for range boosting white dwarf stars, finally, it would be really cool to incorporate the game's market data, since while other third party tools already do all these things, they do not generate good trade loops.
I've been very disappointed with what little open source 3D mapping software I've seen. Everything seems highly centered around 2D projected maps. So personally, I've just been using matplotlib and it's various plotting tool with a jupyter notenook with readonly access to my database, allowing me to write %sql ... and get a python object for the resulting rows.
3d is major disappointment for me as well. Lot of entrenched mindset in GIS field against 3d, I do think there is major scope for innovation in this direction.
I find that somewhat hilarious actually... Like what do GIS people think, we live on a flat earth?
Seriously though, if all you ever want to do is draw 2D maps, I can see optimizing these cases. I can even maybe understand how it's best to nail these features down first... However, it just seems like so many of the tools data models are corrupted by a fundamentally mishandling of 2D vs 3D.
For example, a library I'm using for WKB encoding/decoding pollutes my 3D points with a bunch of functions for dealing with 2D points that I frankly want nothing to do with. Why should I ever want a function on a 3D point to return a 2D point with an optional third member. I can see how this might help you embed a 2D point inside a 3D space, by treating the null value as 0, or filtering it, or some other user defined or standard logic... but if I have a 3D point, why on earth am I casting it to a partially optional 3D point.
That was just an example that's been really bothering me... there's lots of other examples of 2D driven features looking a bit strange in 3D.
Not to mention adding an M dimension, or outside PostGIS, any others. The difficulty with this problem is the immense, vast, epic scale of the issue. I want to say you could probably just define the 3D metric space and then build the 2D one of that, but then why stop there. Perhaps I want a 5D space with distance measured as some similarity metric... is this not starting to sound like a more general problem than GIS?
And another thing! It would be very interesting to think about what aspects of the spacial reference are useful in 3D. I'm still learning about how the SRIDs are used in PostGIS, but this [1] example makes a lot of sense for lat/long references.
My suggestion might be to a precalculated step to split "the universe" into grid areas divided by geometry complexity (i.e. population density gridding) then process these reference areas in parallel. How you do the latter is probably the interesting part!
I'm not sure this is as simple as you make it sound...
Neighbors (the one example I gave) can exist across these precomputed grids, which would need to be accounted for. This is essentially what the role of the index is. So it sounds like you have the right idea, just not fully fleshed out.
For batch processing once I have a set of independent systems, I actually don't think that's the interesting part of this thread, since I could just package up the needed inputs and ship them off to their own cores for all I care. Most of the questions I care about require the relationships between these positions, i.e. Distance and derived metrics.
I mean, I'm using the provided GiST-based R-Tree spatial index... Otherwise, I'm not really making any use of PostGIS and might as well look for other spacial DBs or write my own little data-structure like you mention (thanks for the link to VP-trees btw).
I was hoping that the performance of PostGIS's features could be improved with some know how.
Hmm, I think you are looking for SP-GiST (unbalanced space partitioned GiST), unless you meant that implicitly. It can be much faster for...space partitioning.
Oh damn, I don't know how I missed that section of the docs. Indeed, though I haven't really profiled this, an SP-GiST index does seem quite a bit faster for me.
Thanks for the tip!
Still can't plot from Sol to Colonia with a range of 100 Ly (near the very dense center of the galaxy) in under 10 seconds though. But I'm aware of some larger issues in the search algorithm itself that are probably my next task on this journey. The performance of this index feels closer to what I think I was expecting to see on it's own.
PostGIS is loads of fun with sports data. Being able to find similar passages of play just using ST_MakeLine and ST_FrechetDistance feels like magic when you first do it in all of five lines of code.
Spatial tools are a bit like logic programming in that they’re very slightly esoteric. But once you know they’re the right hammer for some nails, they’ll save you lots of time and effort over your life, and let you express some ideas you might otherwise struggle with.
FWIW, even though PostGIS is pretty great, if your use-case is primarily offline analysis and you don't need the data to be permanently accessible or writable, consider not using a database at all. You can do a lot, a lot faster with e.g. https://shapely.readthedocs.io/en/stable/manual.html and/or https://geopandas.org/.
Ultimately for spatial functions both PostGIS and Shapely/Geopandas are using GEOS as the base library, so neither are 'a lot faster' for analysis at a fundamental level. Lots of queries to a database brings some speed loss but this isn't necessary for the vast majority of use cases. OTOH PostGIS is in many cases much more finely tuned than shapely/geopandas for speed, scale and parallelisation (it is a much more highly maintained library) so can actually be faster.
Functionality is a different thing. PostGIS has a lot more baked in functionality than shapely. But shapely/geopandas exposes the depth of python and its libraries which allow much more extensive customisation of how to work with data.
Lots of tradeoffs and overlapping use cases - I just wanted to add some depth to this discussion.
Yes, even for on-disk read-only use cases, shapefiles are almost always much faster. Overhead from serialization, network, etc often dominates the performance of GIS.
I'd argue that being able to use sql to more easily combine ans filter datasets, including non-geospatial ones is still very useful in the circumstances you described.
Not really, combining datasets in (Geo)Pandas is very straightforward, including spatial joins: https://geopandas.org/docs/user_guide/mergingdata.html#spati... Of course, it's all a matter of personal preference, but I have used both PostGIS and GeoPandas extensively.
I'm not sure what 'a lot faster' really means in this context.
Honestly, I've found using Spatialite queries to be orders of magnitude faster for analysis than shapely or geopandas. The latter typically imply row-by-row selection and manipulation for starters.
If you can wrangle the data into a geopackage first it's super easy to run queries over the data and extract what you need.
Faster both in terms of querying and in terms of doing the kind of analysis you want and getting the answers you need.
Of course, happy to acknowledge that different tools might work better in different scenarios. For example, I suspect that speed of querying is really just due to the data being in memory so if you can configure Spatialite or PostGIS to do the same, I certainly wouldn't be surprised if you say you can do even better.
But for one-off analyses, it's common to spend a lot of time just getting your data into the right shape, doing various manipulations, perhaps even wrangling the geometries. For that, working entirely within SQL is frustrating as heck. For example, I did an analysis on flight paths over heavily populated areas once, which involved turning infrequent point locations with gaps in the data into a smooth interpolated flight path. That's easy if you have numpy and scipy at your disposal, otherwise it's not. Another analysis involved estimating housing prices in neighborhoods without any recent sales, from prices in adjacent neighborhoods with sales, and again it's easy to code up an algorithm to fill the gaps or to run a geostatistical analysis that can impute the missing values, but not if all you have is SQL, or if you have to constantly do roundtrips between database and code.
I mention all this not to start an argument, but simply because when I first started doing GIS work, I was very confused about what the right tools and workflow were, and once I embraced projections (vs. working directly with spheroids) and in-memory analysis in Python, my productivity went way up. If other people find themselves in the same scenario, they owe it to themselves to try out both approaches to see what works best for them.
Mostly yes, though you can wing it a little bit by relying on swap and/or doing any big filtering operations up front. In my work, it's never been a problem, because you can cram a lot of data into 16 GB, millions of big polygons and metadata. Note also that textual formats like GeoJSON or WKT are incredibly wasteful of space because all coordinates are encoded as characters instead of floats or integers, whereas the in-memory representation is much smaller, so even huge source files are likely to fit in memory just fine. But judging by the sibling comments it does look like it's a limitation for some.
It's good software (I've used it more than a decade), however I found GEOS to be a sticking point. When using it on very large polygons, e.g. 10k to 1 million vertices, memory leaks are not uncommon and performance drops off considerably. Debugging SQL -> C -> C++ is not fun and hacking C++ geometry code when it's not part of your normal work is nigh on impossible. I've found the ESRI geometry API for Java to be by far the best geometry API out there. Harder to use initially and obviously JVM specific but faster and more reliable. It's a very good fit for Hadoop / Spark or other JVM applications. Ignore the brand name, I'm not affiliated and it's FOSS with an Apache license.
https://getgeodb.com is a hosted CartoDB which is an open source stack built on PostGIS but its notoriously hard to run and configure. So we decided to run it as a service for others... we find it very useful to build apps on top of PostGIS since running them requires configuring middleware like tileservices, data APIs and importing data.
I've toyed with PostGIS to make a poster from OSM data by selecting streets that intersect with one of Tokyo's special wards[0]. Sadly I haven't found any excuses to play with it more since.
I am writing an open-source line simplification algorithm[1] with PostGIS. So far it's a few hundred lines of SQL-like code[2], but it will grow.
I have never written anything serious in plpgsql, so this is my first non-trivial project. Writing in PostGIS is a strange mix of SQL (fully declarative) and a "real" programming language with variables, arrays, functions and loops. The mix is very interesting and requires getting used to. But doable.
Things I like so far:
- geometric functions are very robust and well documented. Reference manual[3] is amazing. Creating/editing/iterating over lines and polygons, detecting their features, is a breeze.
- good integration with QGIS -- I can see the result graphically on my desktop system very easily.
- I can even write tests[4].
Things I am worried about or don't like as much:
- The SQL/procedural mix requires a lot of getting used to, and sometimes "simple" things in a "real" programming language becomes complex here. E.g. I would love to have mutable linked-lists. Or moving data between "sql-world" (rows) to "plpgsql-world" (arrays and data structures) is non-obvious, and best references are, unfortunately, stackoverflow and scripts of others'.
- My program will run on a large data set (e.g. all rivers in a country or a continent). I will want to optimize it. There is an out-of-tree profiler[5], but "unofficial, out of tree" always adds risk.
- Biggest one: deep copies everywhere. Since plpgsql does not do any memory management, it is deep-copying everything. Sometimes (often in this algorithm!) I want to add a vertex to a line (=river bend); that always requires a full copy of the whole bend. When I know it will not be used and will be safe, I would like to say "I am mutating this geometry and I don't want a deep copy".
In general, I like it for algorithms. Though the moment it needs to hit performance-sensitive production, I believe I will re-do it in C (which, looking at the postgis source code, is quite write-able too).
Relatedly: the appendix section "SQL Primer" in Obe and Hsu's 'PostGIS in Action' is my favorite quick intro to SQL and the first stop I give anyone looking to learn about the language.
I helped build a relatively successful SaaS offering for finance atop PostGIS with MapServer and Mapproxy+Redis. The stack ran circles around ArcGis. Not even a close comparison.
And since it was charity and had a bunch of private data Google was not an option ($$$$), so I (just a full-stack developer back then) was like "listen I have no idea what is this GIS stuff, but I'll give it a try", after a quick research boom PostGIS, reading the docs and testing things, plus QGIS to visualize and to help understand it better 10/10!
That was my unintentional "career" shift, thanks to PostGIS I'm now a senior dev at the largest food delivery company (local), specialized in GIS and realtime data driven systems using PostGIS everyday lol