>And that’s the point, rules and complexity have completely unknowable downsides. Downsides like the destruction of the whole project.
With each rule and added complexity you make the system less human and less fun. You make it a Computer Scientists rube goldberg machine while sterilizing it of all the joy of life.
While too much rules and complexity can certainly be bad, some basic amount of standardization can actually reduce complexity and really doesn't cause a "destruction of the whole project".
As a counterpoint, too much flexibility can also increase complexity. For example, without defined rules, 5.6.2022 can mean 5. June 2022 or 6. May 2022. Nor user, nor parser can know for sure what it means, if standard isn't defined. This kind of flexibility certainly isn't fun.
Example from OSM wiki for "Key:source:date":
> There is no standing recommendation as to the date format to be used. However, the international standard ISO 8601 appears to be followed by 9 of the top 10 values for this tag. The ISO 8601 basic date format is YYYY-MM-DD.
https://wiki.openstreetmap.org/wiki/Key:source:date
Just define some essential standards. It won't lead to destruction of the project!
And while you are making breaking changes, please fix the 'way' element. Maps are big. Storing points in ways as 64bit node-ids, while coordinates in nodes are also 64-bit (32bit lon and 32bit lat), just leads to wasted space and wasted processing time. There are billions of these nodes and nearly all of these nodes don't have tags, just coordinates. There is no upside for this level of indirection. And in case tags are needed for a point, this can already be solved with a separate node and a 'relation' element.
OSM data format could certainly be improved and it would benefit end users, as better tools/apps could be made more quickly and easily.
The date example is a good one. No one has fun by choosing their own date format. This is putting the burden of choice onto the user. They might like to think about some map stuff and now they have to think about data format stuff.
Of course projects like these have to strike a balance between the strictest bureaucratic nightmare and such a structure so loose that people are overburdened by the available options at every corner.
I think a lot of that complexity can (and should!) live in the tools themselves. Who cares about a date format, when the tool that creates it offers a date picker or extracts the correct date from the meta data of an image? The date format in the backend should be fixed and then you should offer flexibility in the frontend for user input.
> The date format in the backend should be fixed and then you should offer flexibility in the frontend for user input.
Agreed. However, it might be not so easy for historical dates, because doing it correctly requires great diligence on the part of the tool developer as well as from the user to choose the correct calendar system. For example:
Q: What is the correct representation of the date of Caesar's death, 15 March 44 BC in ISO-8601?
A: -0043-03-13
Why? -- Ancient dates are typically given according to the Julian calendar excluding a year 0, but ISO-8601 uses a proleptic Gregorian calendar including a year 0.
This is a good point in a general sense, but I don't think it would be a problem in this particular case for the date some imported OSM data was sourced, which is similar to the "date accessed" for a website in a bibliography.
> Unix time is the real universal date format surely?
Heh… It can't represent UTC (leap seconds). Implementation weirdness like not being able to represent dates outside the 32-bit second counter range due to using a 32-bit second counter. The occasional need to store the TZ the timestamp is relevant in/to.
> It underpins basically everything in datetime database entries.
It does do that.
> The problem is that most people can't read that
This is why I prefer RFC 3339 or ISO, in that order.
> along with the French somehow never being able to convince the world to adopt decimal time.
As much as it is maligned, base 60 is rather convenient. It's easy to take 1/2, 1/3, 1/4, 1/5 and 1/6th of that number. Base 12 has this property too (though lacks the easy 1/5th).
Why make a good point and follow it with a bad joke which: both has a class of victim and does not work as a joke?
Nobody has ever heard of "the French" undertaking such a project. And if it had ever happened, it would most surely have been a particular Academy or the like. Not the 65M people under your careless swipe.
I'm actually one of those weird folks who likes decimal time and maybe one day we'll get there given that every other metric system was adopted by engineers worldwide except for time.
Wasn't meant as some slight against the French, completely the opposite hey, honestly think they had the right idea, the rest of the world got it wrong...and... that's very much a minority opinion these days. Some things are just ingrained into everyone I guess.
The proposed improvements would obsolete a bunch of problems such as broken polygons [1] which happen regularly. They would also make processing OSM more accessible without needing to randomly seek over GBs of node locations just to assemble geometries which takes a significant runtime percentage of osm2pgsql.
For me Steve Coast lost his credibility when he joined the closed and proprietary what3words.
There's around 5.1e14 meters squared on the surface of earth. It takes 34 bits to address this uniquely. If we use one of EFF's dice words style short word lists (6^4 words), we need 5 words to describe any point on earth with 1 meter precision.
If we use a projection like say S2 (though plenty of other options exist), these 5 word locators will show strong hierarchical locality. In any specific area for example, there's likely only 3 distinct top level words. Likewise, the last word is useful but probably unnecessary precision for "find the building" day to day use. So the middle 3 words will be sufficient to be unambiguous in most cases, and if people used this system they'll naturally become familiar with the phrases typical to their locale.
All of this can be done with an algorithm a freshman cs student can understand, with a trivial amount of reference data. It can run on any mobile device made in the last 15 years without an internet connection.
I designed a scheme like this for fun years ago, just because it was a natural outflow of some stuff I was doing with dicewords for default credentials in a consulting context, and I just find spatial subdivision structures neat.
It's hard to interpret what3word's scheme as anything but craven rent seeking. They want to keep the mapping obscure, and fundamentally sacrifice usability in the interests of this. That what3words markets this specifically as a solution for low income nations, and dupes NGOs that are not tech savvy in the service of this is utterly #$%@$#ing revolting.
Imagine trying to rent seek on selling poor people their own street addresses, if you'll let me be slightly hyperbolic.
There is no reason a scheme like this can't simply be a standard from some appropriate body, and a few open source reference implementations.
This comment thread is the first time I hear about w3w. It hurts my brain trying to come up a reasoning how such concept is not some kind of parody one-off project intended to be posted on HN or reddit for the lolz. Instead, it is actually being used by the emergency service?
> There is no reason a scheme like this can't simply be a standard from some appropriate body, and a few open source reference implementations.
Yet no-one did this and I think that's the point here.
World is full of rent seeking in the form of stuff that is dead simple to do but no-one does without a financial incentive.
In w3w the hard part is not the system itself, but getting people to use it, which must be done because the value of the system comes from the network effect.
But you don't need to complicate the storage format to fix a problem like that. You can build validation tools that will check whether the stored data conforms to the correct specified geometry, and only emit valid polygons to later tools in the pipeline when they do.
"Be liberal in what you accept and strict in what you send" is still a good principle. The problem with rejecting invalid structures at the data storage format instead of a later validation step is that it hurts flexibility and extensibility. If later on you need a different type of polygon that would be rejected by the specification, you'll need to create a new version of the file format and update all tools reading it even if they won't handle the new type, instead of just having old tools silently ignoring the new format that they don't understand.
> "Be liberal in what you accept and strict in what you send" is still a good principle.
No, it is a terrible principle which produces brittle software and impossible to implement standards. The problem is that no one actually follows the “be strict in what you send” part, and just goes with whatever cobbled together mess the other existing software seems to accept. Before long, a spec compliant implementation can’t actually understand any of the messages that are being sent
> just having old tools silently ignoring the new format that they don't understand.
This sounds like another headache. I don’t want my tools silently breaking.
> This sounds like another headache. I don’t want my tools silently breaking.
Yet here you are, posting your comment through a web browser on a web page. And the new standard that was intended to make web pages display catastrophic failure and stop processing with each error (XHTML) was never widely adopted. Makes you wonder why? Maybe the nature of an open data platform for human consumption has something inherent to it so that it's better to accept a certain degree of inaccuracy and inconsistencies in its stored data?
> Maybe the nature of an open data platform for human consumption has something inherent to it so that it's better to accept a certain degree of inaccuracy and inconsistencies in its stored data?
That's complete nonsense. The only reason that web browsers accept malformed webpages is because there were already orders of magnitude too many webpages that violated the relevant specs when XHTML was introduced. If web browsers had enforced XHTML from the start, then everyone would have damn well followed it.
> You can build validation tools that will check whether the stored data conforms to the correct specified geometry, and only emit valid polygons to later tools in the pipeline when they do.
It is not helping at all when the problem is that important areas disappeared.
It is also not helping at all other mappers or confused newbie.
Why do you need to change the data format to make it faster (at the cost of making it harder to work with to end users)? The data is the same as it was at the beginning, it doesn't justify a technical redesign. Why not just create accelerators based on an intermediat format?
Properly normalized data isn't just faster, it's also easier to work with for the end user. There are much less exceptions, edge cases and snafus to work around and test for. If you're talking about the transition period between formats, well yeah, you're gonna see things breaking. But these were already broken, just not in apparent ways. In the end, everybody wins.
> Properly normalized data isn't just faster, it's also easier to work with for the end user.
Extracting and reusing data, yes. Getting it into the tool in the first place, no way. Tools that won't even allow you to save your data and make it persistent until you conform to every single integrity requirement are a nightmare for end users.
People doing mapping tasks will use an editor and not really see the change.
People consuming the data will also mostly use tools, tools that likely run much faster.
I've written some code to chop up overlapping gis areas into ways and relations (to match the current data model of references to shared nodes). The input to that code is pretty close to the proposed data model, so not going to be more difficult to do that processing (as an example of a task that doesn't just use 3rd party tools).
The best thing is not to allow invalid geometries to begin with. Any validation would need to be done in an off-line fashion for a number of reasons (such as needing to retrieve any referred OSM elements), and by that time you can't automatically revert offending changes as any revert carries a chance of an object version conflict.
> The best thing is not to allow invalid geometries to begin with.
The best thing for whom? The developer? Certainly not for the end user, who needs to have invalid geometries while the drawing is being made and the data is still incomplete. Having a file format that won't admit that temporary state means that either the user can't save incomplete draft work, or that an entirely different format will be needed to represent such in-process work.
The article is rightfuly critizising that such incomplete way of thinking, that
doesn't take into account the full picture nor the systemic effects of a change, is pushed forwards only because they seem "the right thing" from an incomplete understanding of all the concerns and the needs from all stakeholders.
The right technical decision *must* include them to be correct, and the best design might involve a solution other than "update the file format so that it doesn't accept inconsistent geometry (acording to the set of rules that we understand as of today)". But to assess what the right decision is, you need to know how people is using the system in real use-cases beyond classic comp-sci concerns of data storage and model consistency; and to learn those, you need to talk to end users and perform field research to inform your decisions and designs.
> Having a file format that won't admit that temporary state means that either the user can't save incomplete draft work, or that an entirely different format will be needed to represent such in-process work.
Saving such temporary state is very rarely needed in OSM and should be never uploaded to the OSM database.
In addition, in almost all cases it can be simply saved as area of shape that is not yet matching intended one.
> Saving such temporary state is very rarely needed in OSM and should be never uploaded to the OSM database.
Maybe, but you're missing the other use case - that in the future you'll need an extension requiring geometries that are considered invalid by the current set of rules, forcing you to update all tools processing the file format to acommodate the new extension.
Keeping storage and validation as two separate steps is a more flexible design, preferable on platforms where data is entered by a large number of users in a complex domain that is not easy to model inambiguously.
Think of Wikipedia and what would have happened if its text format had only supported grammatically correct expressions without spelling mistakes, and without letting you save templates with any errors. The project would never have attracted the volume of editors it took to create the initial version with millions of articles, and the product would never have taken off. In an open project with data provided by the general public, keeping user data validation in the same layer as the automatic processing model is a design mistake.
I think that noone serious proposes to include rules like
> You could have rules that say you can’t link Finland to Barbados.
in the data model. That is a red herring.
But rules like "area must be a valid area" are a good idea, in the same way as Wikipedia is requiring article code to be a text and is not allowing saving binary data there.
> Maybe, but you're missing the other use case - that in the future you'll need an extension requiring geometries that are considered invalid by the current set of rules, forcing you to update all tools processing the file format to acommodate the new extension.
I think the way to go is to define several layers of correctness. A data set might then be partially valid. In such cases a tool might, for example, support transitions from a complete valid state A to a complete valid state C by an intermediate partially valid state B. (As databases with referential integrity may allow intermediate states in a transaction where referential integrity is broken.)
>
I think the way to go is to define several layers of correctness. A data set might then be partially valid
Thanks, that summarizes what I was aiming for. An open platform will be more flexible and and allow for different use cases the fewer assumptions about how it should be used it includes.
> Maybe, but you're missing the other use case - that in the future you'll need an extension requiring geometries that are considered invalid by the current set of rules, forcing you to update all tools processing the file format to acommodate the new extension.
As someone else in this subtree mentioned, apparently this flexibility wasn't needed for the last 20 years.
I don't have horribly strong opinions here, but the argument feels circular to me:
- The format should be kept simple to encourage more people to build tools on top of it, and users will be more likely to work with it.
- We should deal with the emergent complexity of bad validation by making tools more complicated and having them detect errors on their end.
If users are going to use a validation tool to work with data, then they can also use a helper tool to generate data. And if the goal is to make it easier to build on top of data, import it, etc... allowing developers to do less work validating everything makes it easier for them to build things.
I'm going over the various threads on this page, and half of the critics here are saying that user data should be user facing, and the other half are saying that separate tools/validators should be used when submitting data. I don't know how to reconcile those two ideas; particularly a few comments that I'm seeing that validation should be primarily clientside embedded in tools.
Again, no strong opinions, and I'll freely admit I'm not familiar enough with OSM's data model to really have an opinion on whether simplification is necessary. But one of the good things about user facing data should be that you can confidently manipulate it without requiring a validator. If you need a validator, then why not also just use a tool to generate/translate the data?
To me, "just use a tool" doesn't seem like a convincing argument for making a data structure more error prone, at least not if the idea is that people should be able to work directly with that data structure.
----
> you'll need to create a new version of the file format and update all tools reading it even if they won't handle the new type, instead of just having old tools silently ignoring the new format that they don't understand.
Again, not sure that I understand the full scope of the problem here, and I'm not trying to make a strong claim, but extensible/backwards-compatible file formats exist. And again, I don't really see how validation solves this problem, you're just as likely to end up with a validator in your pipeline that rejects extensions as invalid, or a renderer that doesn't know how to handle a data extension that used to be invalid or impossible.
Wouldn't be nicer to have a clear definition of what's possible that everyone is aware of and can reason about without inspecting the entire validation stack? Wouldn't it be nice to not finish a big mapping project and then only find out that it has errors when you submit it? Or to know that if your viewer supports vWhatever of the spec that it is guaranteed to actually work, and not fall over when it encounters a novel extension to the data format that it doesn't understand or that it didn't think was possible? Personally, I'd rather be able to know right off the bat what a program supports rather than have to intuit it by seeing how it behaves and looking around for missing data.
Part of what's nice about trying to do extensions explicitly rather than implicitly through assumptions about data shape, is that it's easier to explicitly identify what is and isn't an extension.
> If users are going to use a validation tool to work with data, then they can also use a helper tool to generate data. And if the goal is to make it easier to build on top of data, import it, etc... allowing developers to do less work validating everything makes it easier for them to build things.
That's good thinking for cases where you have a single toolset, in which tools can be kept in sync to collaborate with one another.
But in an open distributed data platform, where several possibly incompatible toolsets will be used, forcing a type of validation on the data itself based on the expectation of one group of tools can make some other applications impossible. In these cases, making the data format simple will make it easier to developers to build new tools, and the difficulties of synchronizing different tools may be dealt with in a different layer.
> That's good thinking for cases where you have a single toolset, in which tools can be kept in sync to collaborate with one another.
This is interesting. I would actually kind of argue the exact opposite, that more rigorously defined formats are more important the more diverse your toolsets get, and less important the less diverse they are.
The whole point of having a rigorously defined data format that blocks certain validation errors at the data level is that it's easier for diverse toolsets to work with that data, because they don't need to all implement their own validators, and they don't need to worry as much about other tools accidentally sending them malformed/broken data.
> making the data format simple
I think where we might be disagreeing is that I argue more specific data formats that inherently block validation errors are simpler than vague formats where there are restrictions and errors you can make, but those restrictions aren't clearly documented and aren't obvious until after you try to import the data.
I would point to something like the Matrix specification -- they have put comparatively more work into making sure that the Matrix specification (while flexible) is consistent, they don't want clients randomly making a bunch of changes or assumptions about the data format. That's partially inspired by looking back at standards like Jabber and seeing that having a lack of consensus about data formats caused tools to become extremely fragmented and hard to coordinate with each other. See https://news.ycombinator.com/item?id=17064616 for more information on that.
My feeling is that when you introduce validation layers, you have not actually gotten rid of restrictions between user applications, and you have not actually made coordination simpler, because different tools are going to break when they see pieces of data that they consider invalid or that they didn't realize they needed to be able to handle. All that's really happened is that complexity has been moved into the individual applications and that logic has been duplicated across a bunch of different apps.
In contract, when every single tool is speaking the same language and agrees what is and isn't valid data, then it's very fast to build tools that you know will be compatible with everything else in the ecosystem.
I'm thinking of Markdown as an example of a format with loose validation rules and a low entry barrier.
Sure, having several slightly incompatible versions with different degrees of completeness is a pain in the ass for rendering it. But insisting on a single format (such as titles can only be made with '#' not '-----', tables can only be '|--', comments can only be '-' not '*', etc) and rejecting as invalid any other user input would be way worse in terms of its purpose as an easy to learn, easy to read text-only format.
:) This is a really interesting conversation, because we keep aligning on some things and then reaching opposite conclusions.
I agree that Markdown has loose validation rules and a low entry barrier for writing, and having a low entry barrier for writing is nice, and I do think it's a good example, but just in the opposite direction. I think that Markdown's inconsistent implementations are one of the format's greatest weaknesses and have made the ecosystem harder to work with than necessary.
I generally feel like when I'm working with Markdown I can only rely on the lowest common denominator syntax being supported, and everything else I need to look up documentation for the specific platform/tool I'm using. It's cool that Markdown can be extended, but in practice I've found that Markdown extensions might as well be program-specific syntaxes, since I can't rely on the extension working anywhere else.
Markdown is saved a little bit by virtue of it not actually needing to be rendered at all in order to be readable, so in some cases I've taken to treating Mardkown as a format that should never be parsed/formatted in the first place and just treated like any other text file. But I'm not sure that philosophy works with mapping software, I think those formats need to be parsed sometimes.
This might get back a little bit to a disagreement over what simplicity means. Markdown is simple to write, but not simple to write in a way where you know it'll be compatible with every tool. It's simple to parse if you don't worry about compatibility with the rest of the ecosystem, but if you're trying to be robust about handling different variants/implementations, then it becomes a lot more complicated.
> I agree that Markdown has loose validation rules and a low entry barrier for writing, and having a low entry barrier for writing is nice, and I do think it's a good example, but just in the opposite direction. I think that Markdown's inconsistent implementations are one of the format's greatest weaknesses and have made the ecosystem harder to work with than necessary.
Maybe, but they're also what make it worthwile and made its widespread adoption possible to begin with.
> I generally feel like when I'm working with Markdown I can only rely on the lowest common denominator syntax being supported, and everything else I need to look up documentation for the specific platform/tool I'm using. It's cool that Markdown can be extended, but in practice I've found that Markdown extensions might as well be program-specific syntaxes, since I can't rely on the extension working anywhere else.
I do not see that as an essential problem limiting its value. It would be if you wanted to use Markdown as a universal content representation platform, but if you wanted that you would be using another more complex format, like asciidoc. Creating your own local ecosystem is to be expected with a tool of this nature, and is only possible because there wasn't a designer putting unwanted features in there that you don't need but prevent you from getting what you want to achieve with the format.
> This might get back a little bit to a disagreement over what simplicity means. Markdown is simple to write, but not simple to write in a way where you know it'll be compatible with every tool.
This may be the origin of the disagreement. You're thinking of an information that should be compatible with every tool; but that's not the kind of information system I'm talking about. Open data systems may have a common core, but it's to be expected that different people will use it in different ways, for different purposes and different needs. This means that not everyone will use the same tools with it. OSM data has that same nature as an open data platform that could be reused in widely different contexts and tools.
Think programs written in C. It's nice that you can compile simple C programs with any C compiler, but you wouldn't expect this to be possible for every program on every platform; the possibilities of programming software are just too wide and diverse, so you need to adapt your particular C program to the quirks of your specific compiler and development platform. Insisting that everybody uses exactly the same restrictive version of the language would only impede or hinder some of the uses that people have for it.
I think it's worthwile to have efforts to converge implementations toward an agreed simplified standard, but they should work in an organic evolutive way, rather than as imposing a new design that replaces the old. Following the C example, you can build the C99, C11, C17 standards, but you woldn't declare previous programs obsolete when the standard is published; instead, you would make sure that old programs are still compatible with the new standard, and only deprecate unwanted features slowly and with a long response time, "herding" the community into the new way of working. This way, if the design decisions turn out to be based on wrong or incomplete assumptions, there's ample opportunity to rethink them and reorient the design.
> You're thinking of an information that should be compatible with every tool; but that's not the kind of information system I'm talking about.
You're right, I am thinking of that. However, that's what OSM is, isn't it? It's more than a common format that stays localized to each device/program and varies between each one; it's a common database that everyone pulls from. We do want all of the data in the OSM database to be compatible with every tool that reads from it. And we want all of the data submitted to the OSM database to work with every single compliant program that might pull from it.
Outside of the OSM database, we want a common definition of map features where we know that generating data in this format will allow it to be read by any program that conforms to the standard. It's the same way as how when we save a JPEG image, ideally we want it to open and display the same image in every single viewer that correctly supports the JPEG standard. We don't want different viewers to have arbitrarily different standards or variations on what is and isn't a valid JPEG file, we want common consensus on how to make a valid image.
I agree that what you are saying would be true for information that doesn't need to be compatible with every tool. I don't understand why you're putting OSM into that category, as far as I can tell OSM is entirely about sharing data in a universally consumable way.
> Insisting that everybody uses exactly the same restrictive version of the language would only impede or hinder some of the uses that people have for it.
Isn't this part of the reason why the Web has started devouring native platforms? Write once, run anywhere on any device or OS. And even on the Web, incompatibilities between different web platforms and the need for progressive enhancement is something that we live with because we don't have an alternative. We still pretty rigorously define how browsers are supposed to act and interpret JS. A big part of the success of JS is that within reason, you can write your code once and it will work in every modern browser, and browser deviations from the JS spec are (or rather, should be) treated as bugs in the browser.
Even taking it a step further, isn't a huge part of the buzz about WASM the ability to have a universal VM that can be targeted by any language and then run on both the Web and in native interpreters in a predictable way? A lot of excitement I see around WASM is that it is more rigorously defined than JS is, and that it is trying to be something close to a universal runtime.
> Following the C example, you can build the C99, C11, C17 standards, but you woldn't declare previous programs obsolete when the standard is published; instead, you would make sure that old programs are still compatible with the new standard, and only deprecate unwanted features slowly and with a long response time
I sort of see what you're saying at the start of this sentence, but the second part throws me off. Most specs that iterate or develop over time break compatibility with old standards; Python 2 code won't compile on a Python 3 compiler. It's pretty common for programs to need to be altered and recompiled as newer versions of the language come out and as they're hooked into newer APIs.
Situations like the Web (where we try to maintain universal backwards compatibility even as the API grows) are really the exception to the rule, and while I do think specifically in the case of the Web it's good that we force backwards compatibility, holding to that standard comes with significant additional difficulties and downsides that we have to constantly mitigate.
And I still don't understand what this has to do with standardizing the format for data that is explicitly designed to be shared and generated among a lot of different programs. This isn't a situation where we want each program to have a slightly different view of what valid OSM data is because we want them to be compatible with a central database of information, and we want them to submit data to that database that is compatible with every other program that pulls from it.
Of course, for situations where that isn't required, where software isn't working with map data with the purpose of submitting it back up to the OSM project, they're welcome to keep using the old format, nobody can force you to use the new one. Those programs won't be as compatible with as many things, but if I'm understanding correctly, you're saying it's OK for the ecosystem to be a little fractured in that way and for some programs to be incompatible with each other? And if that's the case, I still don't see what the problem is.
For programs that you don't think need to be universally compatible with other programs, use the old format. When submitting to a database that is designed to be a universal repository of map data that anyone can pull from, use the new format to maximize compatibility. Unless I'm missing something else, that seems like it solves both problems?
In the first part of the article I was thinking, oh, maybe Steve Coast isn't such a jerk after all.
Then I got to the meat of it. Oh dear.
As one of the many many people who has had to deal with OSM data, I curse people with this attitude that the mess is somehow desirable or necessary. It's not. There is a long spectrum between totally free form and completely constrained, and OSM's data model is painfully down the wrong end, and causes enormous harm to all kinds of potential reuses of the data.
It also causes harm to the people creating data. Try adding bike paths and figuring out what tags are appropriate in your area. Try working out how to tag different kinds of parks, or which sorts of administrative boundaries should be added or how they should be maintained. It puts many people off, me included.
For a crowd sourced dataset, a strict ontology anyway wouldn't work. Instead of messy tag definitions you'd have tag use that didn't align with the definitions.
I don't mean that as an argument against improving the tagging!
The biggest friction point is probably that people resist rationalization of tagging schemes that have demonstrated themselves to be problematic.
The tagging system in the iD editor tries to address the issue, supporting search terms and suggesting related tags and so on.
The article is more about the underlying storage of the geometries (I don't think there is the same level of interest in changing the basic approach to tagging/categorization).
> For a crowd sourced dataset, a strict ontology anyway wouldn't work. Instead of messy tag definitions you'd have tag use that didn't align with the definitions.
I don't really agree. Wikipedia has clearly defined categories (with a very deep hierarchy), and they make it work. There is constant effort in recategorising, but there are tools to support it. No one is saying "oh, just tag stuff however the hell you feel like it".
> The tagging system in the iD editor tries to address the issue, supporting search terms and suggesting related tags and so on.
It also introduces problems in that it takes a particular interpretation of each tag which isn't necessarily right everywhere. Like calling highway=track an "unmaintained track road", where this concept of "unmaintained" doesn't come from the wiki anywhere - they just seem to have invented it.
>The article is more about the underlying storage of the geometries (I don't think there is the same level of interest in changing the basic approach to tagging/categorization).
There are big issues in this too. Particularly the lack of distinction between line and polygon features, which the consumer is supposed to infer from the tags (building=yes is a polygon, highway=pedestrian is a linestring, a circular walking path), with area=yes used in the most ambiguous cases. Plus all the mess with relations, super-relations etc.
> The Engineering Working Group (EWG) of the OSMF has “commissioned” (I think that’s OSMF language for paid) a longstanding proponent of rules and complexity to, uh, investigate how to add rules and complexity to OSM.
> [...]
> Let us pray that the EWG is just throwing Jochen a bone to go play in the corner and stop annoying the grownups.
> The harder you make it for them to edit, the less volunteers you’ll get.
And that is why dedicated area type (rather than representing areas with lines or special relations[0]) could help new mappers and new users of data.
There would be very significant transition costs, but maybe it would be overall beneficial.
It is possible to have objects that are both area and line at once. Or area according to one tool/map/edtor and line according to another.
And many multipolygon relations are in inconsistent state and require manual fixup.
Also, complexity of entire area baggage makes explaining things to newbies more complex. You can either try to hide complexity (used by iD in-browser-editor) leaving people hopelessly confused when things are getting complex or present full complexity (JOSM) causing people to be overwhelmed.
1000 times yes! I am a spatial data expert but only a some-time OSM editor and I still have yet to figure out how to create a polygonal feature more complex than a single building footprint. The theoretical advantage of a unified topology model of just nodes/edges where polygons and lines share core geometry is nullified by cultural rules that say "don't do that" to editors (I had a bunch of parks that shared a boundary with a road reverted with nasty notes). The current setup is not just hard for processors, it's hard for non-experts to understand and therefore a higher barrier than a simple polygon model would be.
> how to create a polygonal feature more complex than a single building footprint
In ID (default editor) you can mark area and area inside or select two disjointed areas and press right click on the and select "Merge". Or press "c" while selecting areas for combining.
In JOSM there is equivalent "create multipolygon" (or "update multipolygon")
FYI, that is because highway=* road line represents centerline of carriageway - and unless park somehow ends in the middle of road and includes half of its surface it will be not correct.
I know it’s nothing to do with the main thrust of the article, but the author fundamentally misrepresents KYC. Know-your-customer is a facet of anti-money laundering and anti-corruption regulation. It has nothing to do with talking to users.
Maybe, but not likely. The quoted text fits the common term definition:
> The answer, as any product owner will tell you, is to get close to the customer. To talk to them. To understand them. To feel their pain. The {big short}:
> Deutsche Bank had a program it called KYC (Know Your Customer), which, while it didn't involve anything so radical as actually knowing their customers, did require them to meet their customers, in person, at least once.
I do not consider it as defining and definitely nor desirable or improving ones standing.
For reference: I am extremely active in OSM community. On channels that I moderate this would result in user being warned/kicked (but not banned, unless in case of repetitive insults).
I’ve started playing with data from OpenStreetMap. It started with me trying to fetch all the places where I could get water when moving around Copenhagen, which turned out not to be as easy as first envisioned, because OSM seems to have a lot of different ways to categorise available water, which makes sense, OSM and the tagging system isn't there to support only my usecase, and describing my idea doesn't fit 1:1 with the model.
It's a tough problem to map out the world and describe it, especially when everyone can add or modify the data, but anything that could improve the experience of importing like osm2pgsql would be welcome.
I don't understand how this doesn't fit your use case. The tags are for different things, e.g.
> for places where you can get larger amounts of "drinking water" for filling a fresh water holding tank, such as found on caravans, RVs and boats
versus
> a man-made construction providing access to water, supplied by centralized water distribution system (unlike in case of man_made=water_well [...]). The tag man_made=water_tap is used for publicly usable water taps, such as those in the cities and graveyards. Water taps may provide potable and technical water, which can be specified with drinking_water=yes and drinking_water=no.
And another tag for when you're not mapping a separate water point, but indicating whether a given feature has drinking water (for example a well or mountain hut).
You're saying that it's tough when anyone can mess with the data rather than working in a structured way, but these tags have distinct definitions and seem perfectly sensible to me (there are much worse examples like highway=track, which spawned huge discussions in various places within the community). How do these tags not match your use case to select which tags you need and display those in the way you want (e.g. as list or map)?
When features are sometimes tagged specifically and other times tagged more generically, it is impossible to get valid results. You either have to filter on the more specific tag (leaving out valid features) or include the generic tagged features (including features that should not be).
If only we paid people to map everything to the detail you specifically need. I've never seen a public faucet that does what you describe, only in American movies in high schools (so those aren't public and most mappers won't know they're there or, even if, you wouldn't be allowed to use it as a non-student). I'm not sure the tagging scheme is the problem here: even if you'd enforce using the right tag by giving everyone mandatory training and exams so they tag everything perfectly, you'd not get this sort of detail necessarily, at least not without doubling the number of contributors as compared to today. (And that's from a German/Dutch perspective, countries which are already pretty well-mapped. In Belgium you probably need to quadruple the force or more.)
I spent 20 months of my life traveling around Europe and Asia and I found the sources to fill up my camper's water tank mostly using OSM data! It works very well in most areas.
I used the the app Maps.me for that (which by the way I would not recommend anymore). Maps.me's internal search function is not intuitive but I found out the right key words to get to drinking water sources.
To your list I would add the search for springs. Especially in mountainous areas you often find usable springs (sometimes pipes coming out of a wall) with drinking water.
The exact same argument that praises OSM's super flexible tagged node data model should also praise MS Excel for the number of things that can be achieved in the world of business with just a grid of boxes.
Both have been hugely successful, and both have the same pile of downsides.
> Both have been hugely successful, and both have the same pile of downsides.
Exactly. And the solution should not be to throw away spreadsheets completely and turn them into relational databases, but to create new tools to alleviate the downsides and reduce their impact (possible by exporting the spreadsheet information into a relational database, but without taking away the user's option to continue working with it.)
No one is suggesting throwing away OSM's data model completely. The current suggestion is basically "maybe we should think about a point release to properly address an ugly hack we invented in 2007".
What is currently on the table is simply a way to cleanly differentiate between closed ways that are polygons and actual closed ways. Example roundabout enclosing a park. The problem is that right now this relies on determining this from the tagging. This could well be implemented as a flag on the existing way type and not as an actual new datatype.
There is at this stage no intention to revamp the way how we model areas that are more complex than the single polygons from above, that is with multi-polygon relations.
The more controversial topic is giving OSM way objects partially or fully their own geometry.
The former would have for all practical purposes no noticeable contributor effect outside of geometry changes always creating new versions of ways, contrary to the current behaviour which can be somewhat puzzling for newbies.
The later would be quite drastic, but would provide more benefits for at least some kinds of processing, for others not, as then topology would have to be inferred.
In any case the 90% of the discussion on this topic fretting about tagging is completely misplaced as literally nobody is even remotely considering changing that.
> but without taking away the user's option to continue working with it
You keep bringing this up, but I still don't understand what about this change would prevent people from working with intermediary/local formats, or why tools working with intermediary/local formats would be harder than building validators at every step of the submission process?
How is this change taking away anybody's ability to do anything with local data on their device? And if the point is that they should be able to submit that data, then validators will be just as much of a problem for them as a file format will be.
Lots of programs work with their own temp formats locally that are specific to their needs. I mean, you don't need to even make a new one, if you like the existing format so much, save temp changes to it, and publish finished changes to the new format.
What am I missing here, why is any of this a problem?
The entire article can be summed up as: “OSM stores maps as graphs, in flat files where each line is either a node, an ordered list of nodes, or metadata. The graph nodes can be arbitrarily ordered in OSM files, which leads to computational complexity when parsing them. This is not a bad thing, since it means that the spec for OSM files can be extremely simple, which makes it easy for people to contribute to OSM. Other mapping formats optimized for parsing speed require a lot of irrelevant fluff that makes them much harder to understand by human contributors.”
Ironically, 95% of this article is irrelevant fluff that does not make it any easier for the reader to understand.
> OSM stores maps as graphs, in flat files where each line is either a node, an ordered list of nodes, or metadata. The graph nodes can be arbitrarily ordered in OSM files, which leads to computational complexity when parsing them. This is not a bad thing, since it means that the spec for OSM files can be extremely simple, which makes it easy for people to contribute to OSM.
That's actually a sensible design. Treat user-facing stored data as user interface. If you need efficient processing of that data, such as fast parsing, you can always build it elsewhere, such as by caching that data into an intermediate structure that is recompiled whenever the user data changes.
Wait, the proposed solution to a data format being slow to parse is to work around the bad performance by caching the already parsed representation? That seems like it has a clear flaw if you’re only accessing the data once…
Accessing any given data once. When you have a total dataset size in the 10-100s of gigabytes range, having to download any significant fraction of it to do data processing is really unfortunate.
But seriously what's up with this total disdain for anyone trying to build applications with OSM data? You don't seem to care whether parsing is near instant or as other commenters have mentioned, literally a majority of total processing time for certain compute jobs
Thanks! I read the article, I read the post the article is responding to, I read all the comments and still I had no real idea what it all was about until I read your comment.
It could be an example of an author assuming a general audience already knows the insider information but then I don't know who the target audience really was. This is the kind of thing that probably should have been spelled out in the introduction, with a link to something like this:
The claim that a dataset with billions of users has only dozens of people able/interested in doing data processing on it is a damning admission that the format is too hard to deal with
I maintain some OSM data-mangling code - moderately popular perhaps, but certainly not core - and even that has 840 github stars. I'd take the "dozens" as poetic licence really.
Dozens of open source volunteers who are interested in volunteering their free time to do software development using the format.
In addition to the innumerable developers in Facebook, Apple, and other corporations who are paid to do the data processing and actually bring the data to those billions of users.
The author is ranting a lot about OSMF's recent decisions, gives all kinds of reasons why they will undoubtedly lead to horrible consequences and grants sage advice what should have been done instead.
The only thing I'm missing is any indication that OSMF's course actually did cause any problems in reality.
What happened to the title? It used to be “In Defense of OpenStreetMap's Data Model”, which is the literal blog post title. Someone has now changed it to the boring-sounding “OpenStreetMap's Data Model”, probably resulting in fewer clicks.
Just for context, here is the current OSM data model in a nutshell:
There are three data types - nodes, ways and relations. All of the three can have any number of "tags" (i.e. a map<string, string>), which define their semantic meaning. For example, a way with `barrier=fence` is a fence. This stuff is documented in the openstreetmap wiki.
A node is a point with a longitude and a latitude.
A way is a sequence of nodes.
A relation is a collections of any number of nodes, ways or other relations. Each member of this collection can be assigned a "role" (string). Again, the semantics of what each role means is documented in the openstreetmap wiki.
To modify data, simply new versions of the edited data are uploaded via the API.
---
The most prominent point that stands out here is that only nodes have actual geometry.
This means that...
1. to get the geometry of a way (e.g a building, a road, a landuse, ...), data users first need to get the locations of all the nodes the way references. For relations, it is even one more step.
2. in order to edit the course of a way, editors actually edit the location of the nodes of which the way consists of, not the way itself. This means (amongst other things) that the VCS history of that way does not contain such changes
I have been looking into improving kerb and traffic_signals data for some bboxen. It is daunting, and I figure I need to work backwards - try to find out what the accessibility map apps look for and use those pairs. If I know what target I am shooting for I guess it will be alright. This is like my first week looking into this and I hope to find these targets soon.
Add keys to existing nodes mostly. Possibly using tasks.openstreetmap.org and/or possibly doing something in a batch if I can get data from the city to use. These structures seem well defined, thankfully. And the crossings and signal locations look to be complete.
In this case I would strongly encourage to start from manual mapping. StreetComplete Android app may be useful here (disclaimer: I am involved in making it).
> The harder you make it for them to edit, the less volunteers you’ll get.
I'm not sure I like OSMs obsession with the data model. I guess this has to do with its business model, but then let's not pretend the product manager is tasked to optimize for end-users.
I'd like it to focus on the UI. The easier it is to input a geographic thingie and the easier it is to visualize the geographic thingie, the better OSM both for users and volunteers.
Two issues to strengthen my point:
1. The osm-tag mailing list regularly discusses how tags are visualized in various renderers when recommending which to choose.
2. Quick mobile-based correction are nearly impossible with OSMAnd. I'd love to take a picture and write a quick note like "speed limit changed", so that someone (perhaps a bot) can pick this up and update the data model. Same with restaurant opening hours. Or various POIs.
Note: speed limit needs to be enabled, it is disabled by default. It is also unavailable in USA due to horrific default speed limit system which requires massive work to support.
Disclaimer: I am one of people working on StreetComplete
> I'm not sure I like OSMs obsession with the data model.
Given effort that went into various parts - fundamental data model has not received any changes for a long time. I would not describe it as obsession.
> I guess this has to do with its business model
OSM do not really have business model, it is not a business
It is surprisingly difficult to say which closed ways are areas and which are not. This depends entirely on tags of the way and is only solved by heuristics.
In addition, it is common to have objects that are both area and line at once. Or area according to one tool/map/edtor and line according to another.
And many, many multipolygon relations are in inconsistent state and require manual fixup.
Also, complexity of entire area baggage makes explaining things to newbies more complex. You can either try to hide complexity (used by iD in-browser-editor) leaving people hopelessly confused when things are getting complex or present full complexity (JOSM) causing people to be overwhelmed.
The current format stores locations and references to locations, so for example, a line feature only stores references to locations, so to realize it on a map, you have to go through the data and find all the locations it references and build up the actual geometric feature. So people do caching and so on, for sure.
The proposed changes would make that sort of data transformation easier and less resource intensive.
So, come up with an improved format that has/enforces various 'rules' and also provide a conversion program for moving between the new/old format, as desired.
Slowly deprecate the nastiest parts of the old, as people get used to the new format.
There's an amusing paradox I've seen many times amongst GIS managers - their complaints about how dreadful OSM data is are pretty much the direct opposite of their enthusiasm to use it!
Heh, their data model is 99% of the reason why I don't use OSM. It's scattered all over the place with so many tables! It's such a nice project, but damn is it impossible to work with programmatically, let alone poke around it to discover what's all in there.
There are various complaints about OSM data model but this is a new one to me. In OSM basically everything is mixed together and there is no real separation into layers.
>And that’s the point, rules and complexity have completely unknowable downsides. Downsides like the destruction of the whole project. With each rule and added complexity you make the system less human and less fun. You make it a Computer Scientists rube goldberg machine while sterilizing it of all the joy of life.
While too much rules and complexity can certainly be bad, some basic amount of standardization can actually reduce complexity and really doesn't cause a "destruction of the whole project".
As a counterpoint, too much flexibility can also increase complexity. For example, without defined rules, 5.6.2022 can mean 5. June 2022 or 6. May 2022. Nor user, nor parser can know for sure what it means, if standard isn't defined. This kind of flexibility certainly isn't fun.
Example from OSM wiki for "Key:source:date":
> There is no standing recommendation as to the date format to be used. However, the international standard ISO 8601 appears to be followed by 9 of the top 10 values for this tag. The ISO 8601 basic date format is YYYY-MM-DD. https://wiki.openstreetmap.org/wiki/Key:source:date
Just define some essential standards. It won't lead to destruction of the project!
And while you are making breaking changes, please fix the 'way' element. Maps are big. Storing points in ways as 64bit node-ids, while coordinates in nodes are also 64-bit (32bit lon and 32bit lat), just leads to wasted space and wasted processing time. There are billions of these nodes and nearly all of these nodes don't have tags, just coordinates. There is no upside for this level of indirection. And in case tags are needed for a point, this can already be solved with a separate node and a 'relation' element.
OSM data format could certainly be improved and it would benefit end users, as better tools/apps could be made more quickly and easily.