The idea of a language-independent wikipedia seems to disregard the fact that language and our definitions are shaped by culture quite considerably.
For example, what is the expected result of the query "how many colors are there in a rainbow?" For the majority of European languages, the answer will be 6, but in Russian language/culture, the answer will be 7 (there's additionally light-blue).
Also, how are they going to deal with definitions which vary across jurisdictions, for example, trivia related to disputed territories?
Indeed, this misidentifies language and knowledge as some bizarre occasionalist game,
On the occasion I generate the text, "who is the mayor of new york", I should receive the text, "antony cumio" (etc.).
For extremely narrow questions with canonical answers in a canonical form (eg., how many protons are in a gold atom) this is fine. This works as a kind of coincidence: for these types of questions is a coincidence that their answers can be canonical.
In general, language is a tool for navigating ones (principally social) environment. Absent this environment, mere text /has no meaning/. The question "what contributions did hegel make to philosophy?" has no canonical answer, and depends on a deep network of knowledge, belief, understanding and context which arises from being situated/embedded within some social environment.
The question means something different if asked by an 14 year old, vs. a professional philosopher. And if asked by a scientist, or if asked by a janitor.
Learning is a situated process, and language is a situated embedded phenomenon. Learning isnt acquiring a cannonical answer, it is a dialectical process of bridging the new meanings to the old ones you have acquired.
Perhaps so, in any other topic thread it wouldn't have mattered as much, we are talking about abstract Wikipedia this kind of challenges will come when knowledge is expected to transcend language.
P.S. Also OP said "antony cumio" Andrew Cuomo is similar but quite different .
P.P.S. I wouldn't expect context to differentiate say between City of London and London City but New York City and New York state is somewhat well known ?
Both the city elections for mayor ( preferential voting) and state governor (sexual harassment) has been in the news lately.
Or it might change what city has and there be small but distinct differences between Mayor and for example City manager. So at one point in time they have city manager and then after that mayor. These two being similar, but distinct roles...
In the longer term, "semantic drift" might also be a problem. The article keeps referring to the example of the mayor of San Francisco. But the definition of mayor (i.e. what the job involves) changes over time, and the definition of San Francisco (i.e. precise boundaries, legal status) changes over time. Perhaps in this case there haven't been and won't be any major changes during the period of history we care about, but if you define millions of these QIDs there will presumably be a constant flow of awkward language-dependent questions about what they should refer to.
The point is that "mayors of San Fransisco" does not have a single answer, and therefore any answer WikiData provides will be glossing over a mountain of nuance and important distinctions in terminology.
The same is true of a million million different individual and aggregated data points all across WikiData.
In the context of Wikipedia articles about the Western US city commonly called "San Francisco", the concept of "mayor" is sufficiently well defined that there is a single answer to it.
The concerns voiced here sound a little like "What if in a hundred years, GPT-255 has subjugated all of humanity into a hive-mind and the terms 'San Francisco' and 'mayor' are meaningless?"
I was using the singular example from the parent comment. If it would convince you more, I will provide others - though I'm sure you could think of some yourself, if the mood took you.
Wow, never knew some people were taught 6 colours in a rainbow! — know it's just an example in your point, but what European cultures/languages use 6 instead of 7?
It's always been the standard abstract that I've heard to teach around it.
In Ireland/UK it's something that gets brought up in primary and secondary schools, first when talking about rainbows and then later in discussions on optics/refraction — e.g. you'd get kids to make their own Newton's disc (https://en.wikipedia.org/wiki/Newton_disc) and see colours blending together to white.
That's because there isn't a meaningful distinction between indigo (as a shade) and violet. IIRC, Isaac Newton thought it was important (for biblical numerology reasons) that there be 7 colors.
What you're talking about is called Linguistic Relativity (https://en.wikipedia.org/wiki/Linguistic_relativity), which can be solved by having "facts" vs "opinions" coded into the abstract version. So a rainbow has way more than 7 colors, but humans can only see bands of those colors really. Then depending on language, the answer will change.
So the wikipedia page will (in both languages) tell all versions, leading to more culture share as we all can understand each other better.
"The official Israeli position is X and the official Palestinian version is Y" might be one factual way to state opinions. You can choose exactly how much should be decided as "fact".
Not disagreeing with you so much as taking it a bit further ...
It would be highly controversial to claim that any organisation has a valid mandate to represent the Palestinians. Meanwhile, the Israeli government, in order to avoid angering a significant part of the their electorate, cannot offer a firm opinion on certain questions, such as where the borders of Israel are. And everyone's very fond of "stategic ambiguity" and "plausible deniability" and all that stuff.
So your facts end up being something like: News outlet N claimed at time T1 that person P, claiming to represent organisation X, stated at time T2 something that can be accurately translated into language L as: ...
> So your facts end up being something like: News outlet N claimed at time T1 that person P, claiming to represent organisation X, stated at time T2 something that can be accurately translated into language L as: ...
That's good. Wikipedia already discourages "weasel words" and things like that, so adding specificity and information provenance would be a bonus.
Even if you accept the premise that an internet encyclopedia should be written from a neutral point of view, it is clear that some knowledge is contested within cultures, such as whether the population of Israel should include occupied and contested territories, or whether Catalonia is better described as a Spanish autonomous community or its own country. If more language editions relied on Abstract Wikipedia as the central source of truth, then this dominant point of view could replace alternative perspectives. But Vrandečić countered that each volunteer community could decide for itself whether Abstract Wikipedia should be used as a baseline. The Wikimedia Foundation, the nonprofit organization behind Wikipedia, will not mandate that different language versions be forced to use the machine-readable, abstract version. That means that, for example, Hebrew Wikipedia and Arabic Wikipedia could each continue to present very different articles for the topic of Jerusalem.
> For example, what is the expected result of the query "how many colors are there in a rainbow?" For the majority of European languages, the answer will be 6, but in Russian language/culture, the answer will be 7 (there's additionally light-blue).
That's true for Italian as well ("blu" and "azzurro", which is a light blue)
I haven't looked at the spec, but presumably the model will output only a subset of any language. It seems pretty do-able providing there's the ability to add the (hopefully) rare sentence per-target-language manually.
I actually hope Wikipedia, or someone, knocks it out of the park because it could form the basis of a new kind of web.
> For the majority of European languages, the answer will be 6, but in Russian language/culture, the answer will be 7 (there's additionally light-blue).
In my country, a former British colony, we learn the acronym ROYGBIV in school. Red, orange, yellow, green, blue, indigo, violet. 7 colours.
I think a better query is "how many types of snow exist"?
On my experience, one, snow, period.
Somebody living in north Canada, probably have more variations of the snow.
In Eskimo language, there are ~300 words to express the variations of snow
https://en.wikipedia.org/wiki/Eskimo_words_for_snow
Does this imply that every query about snow should be expressed in all these 300 variations of snow. Do I even know or care as a non-Eskimo speaker of all these variations?
We can always use the types of snow.
https://en.wikipedia.org/wiki/Types_of_snow
I count
6 types of snow
3 intensity of snowfall (which can also mean snow)
8 snow crystal classifications
4 types of snowpack material properties
7 wind induced snow structures
5 sun/temperature induced snow structure
9 ski resort classifications
7 informal classifications
Here's the crazy thing. Language is shaped by culture, yes. But culture of also shaped by language. WikiFunctions would essentially become a new language framework that could potentially impact culture. The things that WF values (being the only X to do Y), and that are readily expressible in that framework, could _potentially_ become the markers of success in modern culture. A cultural self-fulfilling prophecy.
Interesting
In Indonesia (and related culture) we also have 7
Mejikuhibiniu
Red, orange, yellow, green, blue, indigo, purple.
We don't use indigo often in daily life though.
English native speakers (myself included) seem to have a real issue understanding that languages vary and aren’t all interchangeable collections of sounds. I blame the lack of bilingualism.
Languages have deeply embedded in themselves the corresponding culture and world views. A phrase like "not my cup of tea" would sound hilarious at best in most other languages. Narratives are even more culture-dependent. The same sequence of events can be presented with entirely different causal relationships based on the values of the writer.
A good translator can find a comparable idiom that neatly replaces it in the target language, but it is rare to have idioms that match exactly in their applicable scope. It depends on the context, on the tone of voice, on the age of the speaker, on other visible or hidden social classifiers, even on the region the original speaker hails from.
This is something humans (i.e., experienced translators) excel at. It is something computers cannot do except at a very basic level — they lack the cultural knowledge and autonomous connection making capability needed.
In Dutch I could translate 'not my cup of tea' in many ways depending on the context. Some examples:
Neutral translation in lieu of additional context: „Dit is niet naar mijn smaak.”
Young person in a social context: „Dit is niet echt mijn ding, nee.”
Middle age manager in a white collar office environment: „Nee, dit is 'not my cup of tea', zeg maar.”¹
Ironic and/or cynical down-to-earth jocular fellow: „Dit is niet mijn kopje thee.”²
1: Note the literal borrowing of English.
2: A literal translation used humorously, knowing full well that this is not 'correct'.
If human language reflect human culture, and NLPs are trained on written text, is it impossible that they can have the cultural context to make such a translation? Is it impossible that an NLP can get a query like "translate this phrase into informal slang Arabic" and the NLP properly does so?
I find it improbable at this state, because 'understanding human culture' and bridging the gaps between two languages without compromising the quality of the prose depends not just on written text, but also decades of experience in being a human being in that society/language, and being able to make connections which may not be apparent from texts. A computer would also miss the final step: our monitoring function that accepts or rejects translations as silly, down-right anti-social, or totally apt — that well honed Fingerspitzengefühl that comes from the total sum of your experience and maturity.
You would, in effect, have to train an artificial intelligence with a brain capable of mimicking ours by raising it as a child in our society (sometimes cheerfully acknowledged by sci-fi writers who employ such characters), and even then you would need to have a brain that processes certain events, happenings, and subtle hints the way a human being roughly does.
Perhaps someone will manage to train one in a suitable manner one day (far far away — not in our lifetime), but I wouldn't want to be responsible for debugging it.
Someone will manage to make one that is good enough for some purposes though. Plenty of semi-gibberish manuals included with products that strongly reek of machine translation these days.
You can always find a good-enough substitute, but the original nuance will be lost. The native speaker can choose between "dislike", "would not prefer" or "not my cup of tea". The target language may have an even wider arsenal of phrases suited for different social occasions and across varying levels of politeness. Essentially, mapping the input to the output is a lossy process.
The commenter's point was that there is information that can be expressed in one language but not another (they're not all essentially the same). The fact that you can't translate a figure of speech doesn't evidence this.
I don't want to put words into their mouth, but I think the comment was more about how we think depending on the language we use rather than the concept we're able to express or not.
> A phrase like "not my cup of tea" would sound hilarious at best in most other languages.
It sounds like total nonsense to me even in english. I know what it means but the entire cultural context surrounding tea is simply not available to me. I don't even drink tea to begin with. Always assumed it was some british thing.
Idioms don’t have to make sense, and often don’t, due to semantic drift and language and cultural evolution. Indeed, one of the defining properties of an idiom is that it has meaning beyond its literal, reductionist interpretation.
Of course they make sense. Maybe they don't make sense today but they made sense back then to the people who started using the idiom. That's what I meant by cultural context. Something in their culture gave meaning to those words.
Eh, my point was that being able to use and understand idioms does not require knowledge of the original cultural context because their meaning has become uncoupled from their literal meaning. The saying could just as well be "it’s not my xyzzy of wargbl" and, if it became popular enough, would be perfectly understandable even though "xyzzy" and "wargbl" have zero meaning in English in and of themselves.
> The saying could just as well be "it’s not my xyzzy of wargbl" and, if it became popular enough, would be perfectly understandable
Yes, and it would still sound like total nonsense. I actually know concrete examples from internet culture that are pretty close to that. "I shiggy the diggy". Total nonsense yet becomes second nature after lurking certain forums for a few days.
The cultural context in the "xyzzy of wargbl" case is you made it up to prove a point. If it somehow became popular enough, fully understanding it would require the context of our conversation. It doesn't help explain the word choice since the words were made up to begin with.
Now that I thought about it, the "my cup of tea" thing must mean tea was extremely important to english speakers. Tea drinking must have been integral to their culture. They developed preferences and the concept of one's preferred tea made it into the language as a synonym for preference. Still surprising to me since this never happened to coffee or any other beverage.
You don’t need any cultural context. I’m not particularly familiar with boats, and different types of things “floating a boat” doesn’t even really make sense, but “Whatever floats your boat” never had an impact on my comprehension.
Sure you do. At least if you're actually trying to understand what the words mean in the context they're used. I understand your "whatever floats your boat" example, including the cultural context. This tea business is completely foreign to me though.
I have no problem understanding these things. It's easy to infer meaning when they're used in actual conversation. When taken in isolation and without cultural understanding, these idioms sound like those made up sentences used for language practice.
just like biodiversity is wealth and monoculture is risk, wikidiversity is a bonus, not a problem. mapping the human condition into abstract data can never be "neutral". that map is inherently and unavoidably biased.
wikipedia is about the only major online site / project that retains a bit of good karma, it would be a pity if its meager resources get squandered chasing open ended projects with ill-defined criteria for success.
it might be best (at least in the short term and while the world still sorts out whether and what kind of future we'll have) to leverage the good work in wikidata but only focus on information concerning the natural world - where those biased mappings (while still there) are of a less immediate concern...
> it would be a pity if its meager resources get squandered chasing open ended projects with ill-defined criteria for success.
How so? The closest equivalent to this project was Wikidata itself, and that has been wildly successful beyond anyone's expectations. Besides Wikipedia itself and other WM projects, every Big Tech software assistant is relying on it.
wikidata is an incredible powerful concept (the early success / adoption in technically savvy circles hints as much) but developing the fullness of its potential will require a lot of further work on both technical matters, content, usability and transparency / governance / acceptance by world-wide users.
abstract wikipedia and related directions are (imho obviously) mission creep, raising the bar to possibly unrealistic heights and thus creating the conditions for criticism of what is (I repeat) one of the few unambiguously good online platforms out there...
Afterthought: a premonition of the rise and fall of mozilla (and the string of its failed projects). surveillance capitalists can waste billions at "moonshoots" but non-profits must justify every penny. this is the world we live in, alas
If more language editions relied on Abstract Wikipedia as the central source of truth, then this dominant point of view could replace alternative perspectives. But Vrandečić countered that each volunteer community could decide for itself whether Abstract Wikipedia should be used as a baseline.
That sounds about as likely as “every country can decide whether to be affected by English-language media.”
This drive toward constant universalization is scary and it’s only going to make Wikipedia even more of a Western worldview project. We need to be heading in the opposite direction. You can be certain that China or India or whomever else will block Wikipedia rather than let western NGOs continue to define the narratives in their respective countries.
It also leaves open the later mandate of removing opt-in. One can easily imagine a large international crisis where truth is relative depending on where you are but where writing about something might be deemed as dangerous and harmful misinformation which has to be erased. Imagine Wikipedia during WW2 for example for conflict.
It's an inverse tower of babel situation - a kind of hubris where the aim of reducing differences leads to a greater evil. The hubris is that clever westerners think its possible to fit everyone into the same way of mind and if that happens then they will touch heaven.
I strongly doubt that "Catalonia the country" vs "Catalonia the autonomous community" is a problem that you can "solve" with an algorithm of any kind. The best you can aim for is finding and displaying the inconsistencies.
Interesting read and topic. Seems like the primary driver is utilising all the effort put into the English wiki for other languages and also avoiding potentially redundant extra work.
I appreciate the prose that Wikipedia tends to have and hopefully their more structured approach to 'facts' doesn't change that.
I've used wiki(pedia|data) datasets, the former for article summaries and latter for knowledge panel facts. Seemed like they've tinkered with something called 'templates' in the past, where things like population wouldn't be in the markup, but reference another page/key where the data actually is (see enwiki for Greece if interested). Was a bit of a pain writing a parser for it especially when these things change so often and as it turns out the population has reverted back to being plaintext in the article.
Would call myself a Layman, though my humble suggestion would be that statements could potentially be tagged with wikidataids of the things/events that are mentioned in them. Potentially that could make updates easier and wikidata has translations of entities that are much less subject to literary license. Potentially this could make it easier to translate to other languages. Definitely seems like a large rabbit hole to get into, as you could then go onto have simple language rules like is|was when someone is no longer alive, do-able if the wikidata entity is tagged to the statement.
Interesting stuff to think about nonetheless, and noting other comments that some 'facts' are subject to interpretation.
IME most people default to en.wiki unless they are looking for something very specific that the English language version doesn't cover. It's not surprising en.wiki gets the most attention, it's the one people actually use.
For instance, I remember university professors routinely saying to never look up stuff on, in my case, it.wiki. That's good advice, as far as I'm concerned.
> He imagines a future in which the human editors of Abstract Wikipedia are not only writing functions using computer code, but are also able to communicate with one another via that code.
vaste programme
> “All of those examples [of universal languages] are failures, all of them didn’t work out,” Vrandečic told me. “So why do we think we have a chance? [Because] we’re creating a framework where we have thousands of people working on this problem, not just a single person or some very small group [...]”
He's completely wrong about the reason. Code as a language doesn't work out because the apparent correspondence between the two is in fact only superficial. Natural language is not formal, it doesn't attempt to be formal, and it doesn't need to be formal. I don't think adding more people is going to solve any of that.
>IME most people default to en.wiki unless they are looking for something very specific that the English language version doesn't cover. It's not surprising en.wiki gets the most attention, it's the one people actually use.
That's a pretty wrong generalization. It depends on the country, whether they speak English well enough and if their national wikipedia is big enough (Scandinavian countries, maybe?). In Russia, for example, almost no one reads the English version. I assume it's the case in France as well. Also I've seen friends of mine from Germany exchange mostly German wiki links between themselves.
I would say it's the case if the information in their primary language is sufficient (and even if it's not, there may be other sites). I defaulted to enwiki as it provides more information and less attempt to control the narrative structure of the same article in my primary language (YMMV on the latter, though)
I don't understand why you're being downvoted, this is a straight fact. The amount of political bullshit in an article is inversely proportional to the number of contributors, and in most cases the English version has more editors.
For exactly the same reason en.wiki is more likely to be correct on scientific and technical subjects.
Its crazy how different languages on the same article can have completely different tones and give a different idea of the subject even when they fundamentally have the same facts listed.
I've studied Lojban for 2 days in the naive attempt to use it for knowledge base technology but then saw a talk about a poem by Lewis Carol translated to Lojban and dropped that effort because it is human unfriendly.
The choice of using function is at least an improvement over triplet based systems like Lojban and RDF, similar how Wolfram Alfa uses glorified s-expressions to have a heterogeneous language for data and computations.
That said I really think the final point of the article is the most important, Wikipedia lacks a user friendly way of editing and should really have a WYSIWYG editor to appeal to a wider audience of contributors. From talking to engineers at Wikipedia I understand this has in part to do with wikiscript and all the extensions used result in a language that is context dependent and can't be parsed but must be executed while parsing, sort of like human language. Haha. Anyway I think the situation is lame and they should use their 100 million endowment to build a web editor. If you like contact me I've used prose-mirror.
It’s quite obvious to me that this is the holy Grail of natural language processing: A representation of real world knowledge which can be updated from human language, and which can be automatically translated into human language.
That would also mean: If you write a novel, then you can change one fact, and the entire novel will automatically update to reflect that change. So that’s basically what Wikipedia this after.
Reading articles on any XX-XXI centuries event in English and Russian is fascinating. Try Iraq or US Argan war for example. Reading both or more will give you a bigger picture. Same in math and physics.
As of yet, it doesn't look like it has stopped the majority of those who want to sell us on “AI”. The definition of “failure” seems to be quite flexible.
If only Wikidata was easier to use. It is terrific to use across different language articles, but it is for most, including me challenging even to start using.
As an occasional Wikipedia editor, I loathe WikiData. It's confusing, overly complex and nobody really understands it. It smells of "top down" design, instead of "bottoms up" organic way of wikipedia.
Luckily I have to interact with it only occasionally (but it's usually always painful).
More influence from the USA? That's really what we needed from Wikipedia.
> But Vrandečić countered that each volunteer community could decide for itself whether Abstract Wikipedia should be used as a baseline.
So a few people in the "volunteer community" will decide which biais one of the biggest source of information online will follow?
> Throughout this year’s Wikimania, a number of volunteers expressed concern that English was becoming the lingua franca of the movement, and that English Wikipedia was receiving more attention than its counterparts. Vrandečic suspects that Abstract Wikipedia could help with this problem, too.
How? Code is written in English, its documentation in English, the resources in English. Unless we do pure lambda calculus or something like that.
I'm going to be very skeptic here, but this sounds like a power-grabbing move. That is, some people in Wikipedia want to leverage the power they already have even more. I know how strong the idea of "making things right with tech" can be, but sometimes that's not the correct answer, and this is one of those cases.
> Unless we do pure lambda calculus or something like that.
Yes, that's the plan. The Wikilambda project was renamed to Wikifunctions (probably because not many people know what "lambda" means in this context), but they kept the logo.
https://en.wikipedia.org/wiki/Wikifunctions
Wikidata already demonstrates that language-agnostic knowledge representation works, it's just that it feels like a giant listing of disjointed facts. The new initiative is about turning that into something that wouldn't feel out of place in a human-written Wikipedia article.
Sure, but most of the mindshare will still navigate around English. Look at the number of languages for the introduction https://www.wikidata.org/wiki/Wikidata:Introduction, and then look at the Tours https://www.wikidata.org/wiki/Wikidata:Tours. Already here we see a massive reduction in the number of languages. I think that the more specialized the information on Wikifunctions or Wikidata becomes, the fewer language choices there will be. This means it could appeal mostly to bilingual or English native audiences. Programming is already very focused on English. I don't think you can build something for people all over the world if most of the people actually building this thing can all communicate with each other in the same language.
That's one of the moments where diversity in teams is very powerful: if your team is only composed of people that can understand and express themselves in English, the people that can't won't be served well by your product.
Of course you could encode "In this culture, X is Y and in this culture, X is Z", but at this point why not just keep the encyclopedia like it is right now?
But now you only have to provide a native language tutorial of wikifunctions instead of recreating all the vast information available in different languages. It is perhaps unfortunate that English is the default medium, but such is life. We can’t really get to do anything together without a common language, but I don’t see how choosing one of the most spoken second languages as common ground is discriminating. It is inclusive to the widest amount of people over not having such a functionality.
The idea is to provide a unique source of truth that allows everyone to express their truth in their language and have it reflected in all other languages.
I think that there are universal truth, I just don't want the distinction between what is an universal truth and what is not to be made by a small team of active Wikipedia contributors. I realize that this is probably already the case, but I think these changes will aggravate that.
"Trying to Transcend the Limits of Human Language" or is it rather "Using algorithms to transform Wikipedia into a politically correct entity based on unknown deciders of what is correct" ?
The article already alludes to many cases where this simply won't work due to the realities of needing to take local context into account. Otherwise you'll only end up annoying (very) large groups of people, both amongst your contributors and externally, for no reason other than your desire to be politically correct.
For example, what is the expected result of the query "how many colors are there in a rainbow?" For the majority of European languages, the answer will be 6, but in Russian language/culture, the answer will be 7 (there's additionally light-blue).
Also, how are they going to deal with definitions which vary across jurisdictions, for example, trivia related to disputed territories?