Lol I love that it was edited out right away. I probably agree with that decision but this was a really cool project. I have followed a few YouTube channels which make graphic visualizations for Generations and I think they would really appreciate if you shared this Tech with them
For anyone who is interested: I used to work with a guy named Richard Wang, who indexed Wikipedia as his training data set in order to do named entity recognition. He'd be a good person to talk for anyone pursuing this.
I have a basically English heritage. If I counted unrelated ancestors back through 45 generations I reckon I should have had 35,184,372,088,832! Since the population of England back in the 800s was probably a few million, what are the chances I am NOT a descendent of King Alfred?
Loosely related: I had some fun working on a database with all events parsed from Wikipedia [0]. I parsed the entire wiki texts using a language model, extracting date/time/location and what happened then and there. The location data was then fed into a local openstreetmap instance to get the coordinates. I then built a front end to allow querying events near by, in time or in space. The UI is a bit clunky because my database is too large and I had little experience working with them. But building it was fun.
A lot of items don't have a wikipedia page, but all wikipedia, wikitravel, wiki* pages have a wikidata item (see "Wikidata item" link of the left side of the pages).
Some wikipedia infoxes are based on wikidata. I can't find an example, but here some links:
Perhaps, but I already know how to scrape HTML and I know the data I wanted to pull out was in there. I have no idea how to query wikidata and it could have ended up being a blind alley.
Also, it was only my reading your comment just now that told me wikidata was even a thing.
Don't worry about the haters. You needed a paltry amount of data and you got it with the tools you had and knew.
When I was analyzing Wikipedia about 10 years ago for fun and, later, actual profit. I did the responsible thing and downloaded one of their megadumps because I needed every English page. That's what people here are concerned about, but it doesn't matter for your use case.
Generally Wikidata would definitely be the way to go here, though I just now tried to retrace your graph in Wikidata and it seems to be missing at least one relation (Ada of Holland has no children listed -- https://www.wikidata.org/wiki/Q16156475).
I am doubtful. I tried for a long time to use it to get data or for my taxonomic graph project (https://relatedhow.kodare.com/) and SPARCQL was just not usable at all. The biggest problem was the 60s time limit. Totally not workable for what I wanted. I also had issues with seemingly inconsistent results, but it was hard to tell.
I ended up loading the full nightly db dump and filtering it streaming from the zip instead. Faster and it actually worked.
I only thought of it myself because you mentioned the problem with deducing which parent is the mother and which is the father, and I remember in wikidata those are separate fields.
Yup, I had the same situation some months before, even though I knew wikidata was a thing.
I know javascript and had the pages at hand.
I looked at wikidata and some pages about, but still had no clear idea how to use it and no motivation to digg into it. Because js just worked with a small custom script, to retrieve some pages and data.
RDF has to be the best and saddest example of sunk cost fallacy. Instead of redirecting their efforts to a more general graph model which has actual hype and use by developers, its cultists are double downing on their abstruse technology stack, making it always more complicated while still not addressing any of its fundamental problems.
I mean, imho RDF isn't the problem. RDF itself is very simple. As you correctly point out, the stack is overcomplicated.
> Instead of redirecting their efforts to a more general graph model which has actual hype and use by developers
neo4j is basically this. You can also load RDF into neo4j using neosemantics and query it using Cypher instead of using a conventional triplestore with SPARQL, which is nice.
Nothing beats it for data exchange? You must be joking, because if it were remotely true RDF would be in wide use, which is totally not. Except a few niche domains like bioinformatics, it is not used. No killer application use it as a data format, no popular data format is based on it either. Actually I can think of a single data format based on RDF, and the only open-data I know which use it have been converted to it and was simplier to use in their original format.
And for the model: property graph. But yeah, enjoy your Stockholm syndrome with your model where reification is required to annotate an edge. Also even your nickname is an aknowledgment of RDF failure: named graphs (n-quads) were created because RDF triples aren't good enough for modeling data.
Yes, let us see how you do data interchange without global identifiers. Such as URIs, which RDF has built-in natively and property graphs do not.
You're right about bioinformatics, but lets do a quick check on http://sparql.club/ on who else is looking for RDF/SPARQL specialists. Oh look: automotive industry, finance, publishing, medical, research etc.
In my experience SPARQL is really hard to use, and Wikidata data quality is really low. To the point that one of my larger project is trying to filter data to make it usable for my usecase.
But overall I would not encourage using it, if I would know how much work it takes to get usable data I would not bother with it.
Queries as simple as "is this entry describing event, bridge or neither" are requiring extreme effort to get right in a reliable way, including maintaining private list of patches and exemptions.
And bots creating millions of known duplicated entries and expecting people to resolve this manually is quite discouraging. Creating Wikidata entries for Cebuano Wikipedia 'articles' was accepted, despite that Cebuano botpedia is nearly completely bot-generated.
And that is without unclear legal status. Yes, they can legally import databases covered by database rights - but they should either make clear that Wikidata is a legal quagmire in EU or forbid such imports. But Wikidata community did neither.
It makes Wikipedia better too, in a virtuous cycle, with some infoboxes like those that he scraped being converted to be automatically populated from wikidata.
There's some mix between "it's slow" and "it sets its timeout threshold too low" - a lot of queries would be OK if they just had a bit more time to run. And unfortunately, the time wasted on the badput of the killed queries just slows down everyone else. (They really need a batch queue)
The Wikidata folks are well aware of the limits on their SPARQL service. They just posted an update the other day:
seems to depend on the query. I can issue straight forward queries that visit a few hundred thousand triples easily. But when i write a query that visits tens of millions of triples it times out.
E.g. to get a job in FAANG, finance or pharma where SPARQL is used extensively on enterprise Knowledge Graphs? Check the jobs here: http://sparql.club/
Definitely. There's even a public query service ( https://query.wikidata.org/ ) which can do a lot of this (though SQL is not good with searching for chains).
Like people on the other comment have said, if you've actually tried getting data from wikidata/wikipedia you very quickly learn the HTML is much easier to parse than the results wikidata gives you.
The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.
The reason is that the dumps just have pointers to templates, and you need to understand quite a bit about Wikipedia's bespoke rendering system to know how to fully realize them (or use a constantly-evolving library like wtf_wikipedia [1] to parse them).
The rendered HTML, on the other hand, is designed for humans, and so what you see is what you get.
You could, and if he was doing this on the entire corpus that'd be the responsible thing to do.
But, his project really was very reasonable:
- it fetched ~2,400 pages
- he cached them after first fetch
- Wikipedia aggressively caches anonymous page views (eg the Queen Elizabeth page has a cache age of 82,000 seconds)
English Wikipedia does about 250,000,000 pageviews/day. This guy's use was 0.001% of traffic on that day.
I get the slippery slope arguments, but to me, it just doesn't apply. As someone who has donated $1,000 to Wikipedia in the past, I'm totally happy to have those funds spent supporting use cases like this, rather than demanding that people who want to benefit from Wikipedia be able to set up a MySQL server, spend hours doing the import, install and configure a PHP server, etc, etc.
He was probably one of the biggest users that day, so that makes sense.
The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.
Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.
If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.
That's true. On the other hand, pages with infoboxes are likely well-linked and will end up in the cache either due to legitimate popularity or due to crawler visits.
Checking a random sample of 50 pages from this guy's dataset, 70% of them were cached.
Note - there's several levels of caching at wikipedia. Even if those pages aren't in cdn (varnish) cache, they may be in parser cache (an application level cache of most of the page).
This amount of activity really isn't something to worry about, especially when taking the fast path of logged out user viewing a likely to be cached page.
En wikipedia has some standards. Generally though they are user-created tables and its up to the users to make them consistent (if they so desire). En Wikipedia generally does, but its not exactly a hard garuntee.
If you want machine readable use wikidata (if you hate rdf you can still scrape the html preview of the data)
Even just being able to download a tarball of the HTML of the infoboxes would be really powerful, setting aside the difficulty of, say, translating them into a consistent JSON format.
That plus a few other key things (categories, opening paragraph, redirects, pageview data) enable a lot of powerful analysis.
That actually might be kind of a neat thing to publish. Hmmmm.
Better yet-- what is the set of wikipedia articles which have an info box that cannot be sensibly interpreted as key/value pairs where the key is a simple string?
Hehe-- I am going to rankly speculate nearly all of them follow an obvious standard of key/value pairs where the key is a string. And then there are like two or three subcultures on Wikipedia that put rando stuff in there and would troll to the death before being forced to change the their infobox class to "rando_box" or whatever negligible effort it would take them if a standard were to be enforced.
That's the great thing about HtmlAgilityPack, extracting data from HTML is really easy. I might even say even easier than if I had the page in some table-based data system.
Unlike APIs, html class/tag names or whatever provide no stability guarantees. The site owner can break your parser whenever they want for any reason. They can do that with an API, but usually won't since some guarantee of stability is the point of an API.
True, but the analysis was done on files downloaded over the span of two or three days. If someone had decided to change the CSS class of an infobox during that time, I'd have noticed, investigated and adjusted my code appropriately.
Scrapping, especially on a large scale, can put a noticeable strain on servers.
Bulk downloads (database dumps) are much cheaper to serve for someone crawling millions of pages.
It gets even more significant if generation of reply is resource intensive (not sure is Wikipedia qualifying for that but complex templates may cause this).
IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.
Wikimedia no doubt have caching, CDNs and all that jazz in place so the likely impact on infrastructure is probably de-minimis in the grand scheme of things (the thousands or millions of humans who visit the site every second).
>IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.
They said please don't, not don't do it or they'll sue you.
But content license and site terms of use are different things.
From their terms of use you aren’t allowed to
> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;
Wikipedia is also well within their rights to implement scraping countermeasures.
Yes, but they aren't going to care for just 2400 pages.
As a general rule, make your scraper non-paralell, and put a user-agent that has contact details in the event of an issue, and you're probably all good.
After all wikipedia is meant to be used. Don't be unduly disruptive, don't scrape 20 million pages, but scraping a couple thousand is totally acceptable.
Source: used to work for wikimedia, albeit not in the sre dept. My opinions are of course totally my own.
Honestly i'd rather people err on the side of scrapping wikipedia too much than live in fear of being disruptive and not do cool things as a result. Wikipedia is meant to be used to spread knowledge. That includes data mining projects such as the one in this blog.
(Before anyone takes this out of context - no im not saying its ok to be intentionally disruptive, or do things without exercising any care at all. Also always set a unique descriptive user-agent with an email address if you're doing anything automated on wikipedia).
Having been on the other side of this, I’d rather we encourage people to make use of formats/interfaces designed for machines and use the right tool for the job instead of scraping everything.
It’s incredibly easy for careless scrapers to disrupt a site and cost real money without having a clue what they’re doing.
I want people to think twice and consider what they are doing before they scrape a site.
> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;
Two things:
1) The English wikipedia *alone* gets 250 million page views per day ! So you would have to be doing an awful lot to cause "undue burden".
2) The Wikipedia robots.txt page openly implies that crawling (and therefore scraping) is acceptable *as long as* you do it in a rate-limited fashion, e.g.:
>Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.
> There are a lot of pages on this site, and there are some misbehaved spiders out there that go _way_ too fast.
>Sorry, wget in its recursive mode is a frequent problem. Please read the man page and use it properly; there is a --wait option you can use to set the delay between hits, for instance.
1. You'd be surprised what kind of traffic scrapers can generate. I've seen scraping companies employing botnets to get around rate limiting that could easily cost enough in extra server fees to cause an "undue burden".
At a previous company we had the exact problem that we published all of our content as machine readable xml, but we had scrapers costing us money by insisting on using our search interface to access our content.
2. No one is going to jail for scraping a few thousand or even a few million pages, but just because low speed web crawlers are allowed to index the site, doesn't mean scraping for every possible use is permitted.
"Who's gonna stop me" is kind of a crappy attitude to take with a cooperative project like Wikipedia.
I mean, sure, you can do a lot of things you shouldn't with freely available services. There's even an economics term that describes this: the Tragedy of the Commons.
Individual fish poachers' hauls are also, individually, de-minimis.
>The graph was interesting but this wasn’t the primary objective of this exercise. I wanted to write “He is the n-times great-father of his current successor Queen Elizabeth.” on King Alfred’s Wikipedia page.
Wouldn’t that contravene Wikipedia’s rules on original research?
It's not original research, more like synthesis. Reading the policy he's probably OK under the routine calculation exception.
The edit got reverted because there was no clear criteria for mentioning Queen Elizabeth III but not the other descendants. If he made an info graphic or info box with a clear inclusion criteria and stuck it in the Alfred article it would probably stick.
I also think its of questionable relavence to the topic at hand. Just because a fact about something is true doesn't mean it should be in the article on the topic.
Has anyone found an easy way to expand their templates without using their whole stack? I tried getting Lua templates working from Python but didn't get very far...
Parsoid[1] is what you'd want for that, most likely. It's the new wikitext parser that MediaWiki is gradually switching over to, but it has the virtue of being usable entirely outside of MediaWiki if you need to.
Good question, as a data-hoarder mining has always been fun but I was interested in seeing the profit. I've yet to find any monetary value in the troves of data I accumulate.
In an act of divine justice, my website is down.
https://web.archive.org/web/20210711201037/https://billpg.co...
(I'll send you a donation. Thank you!)