Whether or not we "need" it, it turns out that in practice the Semantic Web was actually harder than AI and we're getting the latter before the former.
This has really always been the disconnect between the semantic web advocates and the skeptics like me. It's not that I don't buy your analysis of the benefits... it's that I do not and never have bought your analysis of the costs. The only question I have is how many orders of magnitude they undersell it by. It's certainly a non-trivial number.
What AI will do is basically swamp the semantic web. The problems the world will be dealing with with AI, like the deluge of AI-generated meaningless garbage, training AIs accidentally on other AI-generated garbage, the complete overwhelming of all human voices by AI voices by sheer weight [1], will be so large and pressing that the problems of the semantic web will just get washed away in the flood. And, sadly, the tools of the semantic web will be effectively useless to help; all the signals they are based on are even easier to forge than a human voice in arbitrary text. No help at all.
[1] What will it look like to live in a world where for every kilobyte of bespoke humanity I produce, an AI can produce megabytes? What will it look like when someone can afford to use AI to build an entire Reddit, just for me and only me, and isolate me there?
The disconnect between advocates of a "Semantic Web" and the commoditised reality of the WWW is that the advocates implicitly see an orderly structure of information as being a first principle of knowability/usability and education.
The commoditised reality of the WWW is that content is lucrative, but the content need not be of any social or educational value as long as the ice cream cone licks itself.
Cyberlarceny is more lucrative than order, usability, and education. And now, algorithmic text generation will provide the full experience of half-baked verbiage refreshed daily.
Society does need reliable, true information. But we will not have a semantic Web for economic reasons far beyond the value of education.
It’s not just the cost, it’s what humans are drawn to.
If you sort content by what leads to maximum engagement the stuff that comes out on top is absolutely not informative, intelligent, or positive. You get repetitive addictive filler, inflammatory fear/hate/outrage porn, and lurid tabloid style trash.
When you scroll the limbic system appears to be in charge, not the neocortex. The lambic system seeks only dopamine. The easiest path to dopamine wins.
This means that on an ad supported web content like the above will swamp everything else. There is zero incentive to produce anything else.
AI can now produce maximally engaging content on demand, so the future is Twitter or TikTok but largely without human creators. The future of the open web and free apps is a Skinner box full of AI generated outrage porn and silly filler that scrolls forever, only pausing for ads.
The dynamics can be very different when people pay directly for content. The neocortex appears to get more involved in buying decisions than passive consumption. But very very few people pay for content so there is either not much of this or it will have to be expensive.
We may be going back to the days when actual knowledge and quality discourse was costly, not because of the cost of printing it but as a filter mechanism to find it in the vast sea of trash.
The net was already headed here before AI. Generative models will just accelerate this.
I believe otherwise. They are perfectly happy with the errors. I see this as wikipedia 2.0. Wikipedia was once derided for being full of errors and holes. Now it is a de facto encyclopedia, the go-to place for the "true" description of most anything. AI will do the same. Eventually its error and subtle biases will become the accepted version of reality. Society needs only actionable information. Accuracy is great, but volume of content wins the audience.
Obviously in context I'm talking about "human content", such as we are creating here.
Of course the majority of the traffic on the internet by raw bytes is video, but that was clearly not what I'm talking about.
"If you're happier with a unique personalized reddit"
There is an element of deception there. If I think I'm interacting with real humans, I want to be interacting with real humans, not bots. And nobody but nobody will just generously spend the money to do that, and then leave me a perfect simulation of a human space. They're going to have some ulterior motive and it will not be in my best interests.
I will agree that a shocking-to-me number of people are not bothered by this prospect. Perhaps you are one of them who would not mind. You do you, as the saying goes. I am bothered about being deceived. Truth matters to me and I have no interest in being comprehensively lied to like that.
I think you underestimate people's intelligence. Your hypothetical AI reddit won't deceive anyone because the people using it will be perfectly happy knowing it's all generated, the same way people playing video games are not "deceived" into thinking they are watching a film with actors, and people driving cars are not "deceived" into thinking there's a horse in there somewhere.
AI is complicated, but seeing it as mimicry of pre-AI reality is far too narrow to capture the real change (and the real concerns).
I am assuming that someone is setting out to deceive, not advertising it as a fake reddit. We already had that. (It was pretty funny.)
I'm also not just talking about someone hooking up the publicly-available and known language models of May 2023 to this. You going to guarantee me that "nobody" will be deceived in even just another year's worth of public progress? Or five? Ten?
And "nobody" is a really tall bar... should we at least set the bar at 100IQ or 110, just in the interests of fairness? 2022 discussions of "income inequality" will have nothing on 2032's if a 150 IQ and heavy education becomes necessary to navigate the internet successfully.
Generally I agree, although it'll totally happen to people who get shadowbanned, since having an LLM reply to a banned user's invisible comments will make it less likely they'll refresh their VPN and create another account.
In which the rich have been getting richer, and asset inflation has been growing ever higher.
>If you're happier with a unique personalized reddit
The question here is this depends. Simulacra of which Reddit? Of the one that shows you a good world where we try to become better people? Of the Reddit that tells you "They" are out to get you so you should be afraid? Personalized of the 'Real' world, or just made up and trending towards the imaginary?
I mean, when I play a video game, I know I'm not playing reality. The line starts breaking down as the level of fidelity of reality is neared.
You're rebutting a moral argument I never made. My point is that humans have never been the sole or even dominant actors in the world, so it's odd to decry AI for somehow creating that reality.
I wonder if AI trained on other AI content inevitably leads to a garbage in garbage out scenario that gets worse over time or whether it actually improves, maybe even at a non linear rate?
I believe the former is true, because of financial and political incentives, but the latter isn’t unthinkable.
When Deepmind built AlphaGo, they trained their AI on other AI-generated content once they ran out of training data, and that was what gave it superhuman abilities. https://en.wikipedia.org/wiki/AlphaGo#Algorithm
I've had a nerd boner for semantic web constructs for years. My take is it's more useful than ever in the age of AI. Both for machine synthesis, but also for escaping the flood. Metadata, shared bookmarks, make the internet a place to explore and stake out again.
Than the throw-more-data-at-the-wall-leveraging-ever-growing-parallel-GPU-compute-and-hope-the-black-box-that-comes-out-is-good-enough shortcut to AI* :p
I am looking not just at the now, but given the long history of failure of the Semantic Web, the next 10-20 years as well. It is fair in this context.
But otherwise, yes, I would agree with you, the LLM-based AIs that we have right this very second don't "do" the Semantic Web. But we're clearly much closer on that front than by asking Facebook to pretty please grease up everything in pristine RDF files, let alone the entire web.
It is always possible to scale reviewers. And if someone is abusing access credentials to propose unvetted machine-generated changes those can be quickly revoked.
Yes, because the semantic web required people to mark up stuff semantically, which they didn't do.
The LLM web doesn't need people to explain the role of headings, paragraphs, etc. It can figure those out. It's faster, cheaper, and more effective than manually applying markup.
While OP's reply answers your question, it's important to not apply current costs to predict the future of AI. Hardware for LLMs is one step function away from unimaginable capabilities. That breakthrough could be in performance, cost, or more likely, both.
Imagine GPT-4 at 1/1000th the cost. That's where we're going. And you can bet your ass Nvidia is working on it as we speak. Or maybe someone else will leapfrog them like ARM did to Intel.
Remember when a Cray supercomputer was nearly 10 million dollars and had 8MB of RAM. And remember that now we carry something insanely more powerful in our pocket today.
If something is possible but not cheaper than people on a computer today, then is very very much likely it will become cheaper than people in a few years. Computers keep scaling and individuals don't.
Maybe? The semantic web is useful because it makes us catalog our thoughts and language in a machine readable way. The key thing is the cost of running these semantic web machines is smaller than running a GPU powered language model.
Also, didn't these models learn from the semantic web? No idea if that matters, but still.
Idk what definition of "semantic web" you have in mind, but AFAIK, it usually relies labeling words and names, and adding some vague relations. That's not enough. You can't build enough, inspectable, correct knowledge that way. Sure, you can list genealogical trees and and the components of a car, but it won't help you one bit when someone asks "give the five most important things about XML in gangsta rap style" (0), because none of the needed knowledge can be represented in any form of semantic web I've seen, and there isn't an engine capable of running the required inferences.
You may think that's a joke, but from the performance perspective, it's utterly impressive. Summarizing a paper is even more impressive, and is equally far outside the scope of the semantic web.
Another hazard of the AI approach is that it’s probabilistic. Presuming the semantic information and implementation are all correct (which I will admit is not a given), they’ll always be correct, whereas the AI approach will sometimes give wrong answers.
LLMs always remind me of pigeons you see in the city that have spent their whole life eating out of bins and pecking dropped chunks of kebab. We trained it on a bunch of stuff, but we're not sure quite what. Looks like it works OK, so let's get it on a plate!
There’s a difference, though: if the erroneous data comes from human input error, it can easily be corrected, and if the error is due to a software bug, it can be diagnosed and corrected; but errors that come from LLMs (the main type of AI in question here, and indicative enough of AIs in general, as it stands, though probably worse) are basically unfixable (that is: we don’t know how it would even be possible to fix them in general, so all you can do is patch things here and there, which will only probably fix the problems, and will increase the frequency of other problems).
How is an error from an LLM less correctable than one from a human? If "Paris is the capital of Germany" makes it into a news article, how does the origin of the error impact its correctability?
Reinforcement learning seems pretty analogous to how we correct people who are wrong about something. I'm really not seeing a category difference here?
With a human, if you have an employee, you can instruct them to correct an error, and if they refuse to or just don’t, you can fire them—you have recourse.
LLMs are complex like humans, but you lack the ability to make it do what you want, or to have recourse. If the model has gained the impression that Paris is the capital of Germany (a more concrete and fixable error than most, frankly), this is woven into its parameters throughout, and it’s really hard to fix, and for LLMs we basically have no idea how to actually truly do so. With such a simple error as “Paris is the capital of Germany”, we do know how to fine-tune the model to fairly effectively mask that error, but the important errors are often closer to what you might consider procedural than factual—rather than its “memory” containing a wrong fact, it’s “thinking” in the wrong way.
But all up, I’d call “lack of recourse” the most important difference.
Based on the failure of the semantic web as of so far, I think the statement 'not a given' is a given in itself. Humans are horrible at giving systems correct information.
Is anyone actually doing the semantic web anymore? I thought people gave up on that over a decade ago. It’s hard enough to get devs to just make content semantic enough for accessibility, let alone defining semantics for computers.
The closest practical application that's actually happening I think would be Google's Structured Data guidelines using schema.org types and JSON-LD to power a bunch of enhanced search display features for Rich Results, and for powering Google Shopping. Compliance with that is pretty ubiquitous for Search optimization reasons.
Somewhat? On the one hand, replacing a deterministic system with a probabilistic one is generally bad. Why replace something that's 100% accurate all the time with something that is... somewhat less than that?
On the other hand, the semantic web has been a complete failure. Even present attempts at semantically understanding the web (aka what search engines do all day) require highly complex heuristic and machine-learned solutions... so the "fully deterministic" thing is only technically true, but not meaningfully useful in any sense.
Even the actual semantic web standards that exist (and are poorly adopted) don't seem to really solve the core problem with the semantic web: there's no such thing as a universal ontology. Every attempt to produce a broad-based shared ontology has failed - there are fundamental disagreements about how to organize information, and information organization needs turn out to nearly always be application-specific, and be highly unstable over time, preventing there from being a useful, durable universal standard for various topics.
So I guess long story short: we need AI (or just highly advanced heuristic methodologies) to scrape useful information out of natural content. We always have, because the deterministic alternative has never actually worked.
I think the deterministic requirement was what made the semantic web nearly impossible at scale. Aligning on the schema for all the entities can be challenging within a small group, yet alone the world. Doing it continuously is impossible today. Wikidata has tons of abandoned and branches which are no longer relevant or even wrong.
It seems like LLMs and vector representations for words are more practical ways to explore semantic data. Then the specific, reccuring queries may be optimized through graph representations.
A good way to think about it is that the "semantic web" was all about finding some kind of universal data representation for the world's data, and that has failed in large part because data representation is inherently application-specific.
The only practicable version of the semantic web needs to embrace that reality. We need tools for converting, extracting, and interpreting data from one application-specific data representation to another.
Sometimes that's LLMs, sometimes it can be done heuristically, but either way, the notion of the universal data structure simply isn't realistic.
> As much as I love the semantic web, it is hard for humans to write, update, and maintain.
Of course, ChatGPT can help a human to write the tags to put on the website:
> Me: I have a shop that was the following opening hours: "Our opening hours are: Weekdays 10 until 7. Weekend 10 until 10 (Early closing 9 o'clock Sunday)." I would like to include these opening hours on the website of the shop using meta tags, so that search engines can easily read and parse them.
EU spent so many research Euros on Semantic Web projects. They all went something like this:
CS researchers: Hey guys, here's the niche marginal improvement our grad student spent 200 hours developing. Here are the guidelines for you domain experts on how to encode and load your knowledge. START!
Domain experts: euh ... what you are asking us to do would take 9 manyears for the initial base and a 2 FTE upkeep. Why exactly would we do this?
CS researchers: not my problem. Paper has been submitted and accepted. Your question was out of scope. Want to join our next proposal?
Why is no one mentioning the hardware requirements to run each? The semantic version runs with 2KB of RAM and takes a couple of microseconds. The LLM needs at least 30GB of RAM and maybe 5 seconds of processing time.
>"Our opening hours are: Weekdays 10 until 7. Weekend 10 until 10 (Early closing 9 o'clock Sunday)."
Pardon me, but have you visited a local business' website recently? Rarely is information like this presented in such an explicit way; it's usually crammed into some corner of the footer and without the preamble describing the relevant information. And if the business has multiple locations? Now the visual blocking and semantics become relevant to accurately parse the thing.
Humans don't need the Semantic Web, AI/bots do. It is expensive to turn text into the useful semantic bits that are useful for doing computations on large (or even moderate) quantities of data. So the AI will create a new Semantic Web3 that any bot (or human) can contribute to on any topic (especially about anything on the web).
If anything I think that LLMs are making the Semantic Web realizable in a new way. LLM can summarize the raw content down to a schematized representation, that you can then reference to express deterministic programmatic actions using the resulting properties. You still might want to do that rather than let the LLM do everything.
The big lie about the semantic web is this ridiculous notion that it is machine readable just because it has an unambiguous grammar to parser and has ontologies to disambiguate homonyms.
Thanks, I was rather involved in the research community. It has little to do with parsing text NLP-style. It has everything to do with annotating documents with formally structured KR data, or embedding the same into them.
In fact, my own involvement in the field started with Jim Hendler suggesting I look into the possibility of developing a spider that would wander across HTML pages and glean knowledge from them with an NLP technique. I worked on that for a while, then abandoned it and proposed to him instead that web pages should be marked up with formally parseable structured semantic data. Why parse a student's page saying he "Goes to U Maryland" when he could just write <claim obj1="me" obj2="UMaryland" rel="attend" ontology="http://ontology.org/university-ontology/">. That was 1995.
I think if anything, Semantic Web makes more sense with AI, not less. Not necessarily that we mark up web pages, but that formally defining an ontology that can interlink with other ontologies can help define the cultural boundaries we are seeking with ethical and safe use of AI.
For example, there is an old effort called Cyc, which is an ontology and knowledge base of common sense. Something like this can be used to align LLMs. However, what is common sense is not necessarily universal across cultures. The underlying ideas behind semantic web can be used for ontologies that don’t agree with each other.
ML (as currently used) is probabilistic. Whether or not you get what you want is up to chance.
Granted, based on the dataset, it's quite a bit more likely than winning big at the casino - but regardless, it's still not deterministic like intentional semantic markup.
This has really always been the disconnect between the semantic web advocates and the skeptics like me. It's not that I don't buy your analysis of the benefits... it's that I do not and never have bought your analysis of the costs. The only question I have is how many orders of magnitude they undersell it by. It's certainly a non-trivial number.
What AI will do is basically swamp the semantic web. The problems the world will be dealing with with AI, like the deluge of AI-generated meaningless garbage, training AIs accidentally on other AI-generated garbage, the complete overwhelming of all human voices by AI voices by sheer weight [1], will be so large and pressing that the problems of the semantic web will just get washed away in the flood. And, sadly, the tools of the semantic web will be effectively useless to help; all the signals they are based on are even easier to forge than a human voice in arbitrary text. No help at all.
[1] What will it look like to live in a world where for every kilobyte of bespoke humanity I produce, an AI can produce megabytes? What will it look like when someone can afford to use AI to build an entire Reddit, just for me and only me, and isolate me there?