Hacker News new | past | comments | ask | show | jobs | submit login
Data-Mining Wikipedia for Fun and Profit (billpg.com)
207 points by billpg on Aug 19, 2021 | hide | past | favorite | 95 comments



(Author here)

In an act of divine justice, my website is down.

https://web.archive.org/web/20210711201037/https://billpg.co...

(I'll send you a donation. Thank you!)


Lol I love that it was edited out right away. I probably agree with that decision but this was a really cool project. I have followed a few YouTube channels which make graphic visualizations for Generations and I think they would really appreciate if you shared this Tech with them


For anyone who is interested: I used to work with a guy named Richard Wang, who indexed Wikipedia as his training data set in order to do named entity recognition. He'd be a good person to talk for anyone pursuing this.

Here's a demo: https://www.youtube.com/watch?v=SyhaxCjrZFw


I have a basically English heritage. If I counted unrelated ancestors back through 45 generations I reckon I should have had 35,184,372,088,832! Since the population of England back in the 800s was probably a few million, what are the chances I am NOT a descendent of King Alfred?


I found that work very interesting. So I downloaded the data by OP (thanks for publishing it!) and compared it with Wikidata. The results are here:

http://simia.net/wiki/Main_Page


Loosely related: I had some fun working on a database with all events parsed from Wikipedia [0]. I parsed the entire wiki texts using a language model, extracting date/time/location and what happened then and there. The location data was then fed into a local openstreetmap instance to get the coordinates. I then built a front end to allow querying events near by, in time or in space. The UI is a bit clunky because my database is too large and I had little experience working with them. But building it was fun.

0: https://whataday.info/


For this particular problem I wonder if wikidata would be better instead of scraping the HTML.


Yes:

* http://www.entitree.com/en/family_tree/Elizabeth_II

* https://family.toolforge.org/ancestors.php?q=Q187114

Tools found on this page: https://www.wikidata.org/wiki/Wikidata:Tools/Visualize_data/...

---

Some SPARQL queries: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...

---

Out of topic: I wish wikipedia would provide an API to get the infoboxes (made using Lua or wikidata).


There is a tool to get infobox data from Wikipedia into Wikidata: https://pltools.toolforge.org/harvesttemplates/

The easily parsable Infobox data can probably already be found in Wikidata (assuming there is a property).


So am I right in thinking Wikidata is sort of scraped from Wikipedia internally?

The other way round seems better, but obviously too late.


A lot of items don't have a wikipedia page, but all wikipedia, wikitravel, wiki* pages have a wikidata item (see "Wikidata item" link of the left side of the pages).

Some wikipedia infoxes are based on wikidata. I can't find an example, but here some links:

* https://commons.wikimedia.org/wiki/Commons:Wikidata_infobox_...

* https://commons.wikimedia.org/wiki/Template:Wikidata_Infobox

* https://en.wikipedia.org/wiki/Template:Infobox_person/Wikida...

There are lexeme too, and it is not based on wiktionary. Search the prefix "L:" (without quote).

Example:

* https://www.wikidata.org/w/index.php?search=L%3Acat&search=L...

* https://www.wikidata.org/wiki/Lexeme:L7

Also, there are a lot of tools on toolforge.org. One is reasonator which produce sentences from a wikidata item: https://reasonator.toolforge.org/?q=Q1339


(Author here.)

Perhaps, but I already know how to scrape HTML and I know the data I wanted to pull out was in there. I have no idea how to query wikidata and it could have ended up being a blind alley.

Also, it was only my reading your comment just now that told me wikidata was even a thing.


Don't worry about the haters. You needed a paltry amount of data and you got it with the tools you had and knew.

When I was analyzing Wikipedia about 10 years ago for fun and, later, actual profit. I did the responsible thing and downloaded one of their megadumps because I needed every English page. That's what people here are concerned about, but it doesn't matter for your use case.


> Don't worry about the haters

To be fair, the original comment just made a valid observation in a casual way, he didn't criticize the approach of the OP, nor was he impolite.

But I know it's pretty common to see haters nitpicking things all around ;)


Generally Wikidata would definitely be the way to go here, though I just now tried to retrace your graph in Wikidata and it seems to be missing at least one relation (Ada of Holland has no children listed -- https://www.wikidata.org/wiki/Q16156475).


I am doubtful. I tried for a long time to use it to get data or for my taxonomic graph project (https://relatedhow.kodare.com/) and SPARCQL was just not usable at all. The biggest problem was the 60s time limit. Totally not workable for what I wanted. I also had issues with seemingly inconsistent results, but it was hard to tell.

I ended up loading the full nightly db dump and filtering it streaming from the zip instead. Faster and it actually worked.

The code to do that is at https://github.com/boxed/relatedhow


There's an alternate Wikidata query engine available here: https://qlever.cs.uni-freiburg.de/wikidata (from https://github.com/ad-freiburg/QLever)

Currently it doesn't support some SPARQL features, but I've found it to generally be quite a bit faster for most queries.


Hard to use it if you don't know about it!

I only thought of it myself because you mentioned the problem with deducing which parent is the mother and which is the father, and I remember in wikidata those are separate fields.


Yup, I had the same situation some months before, even though I knew wikidata was a thing.

I know javascript and had the pages at hand.

I looked at wikidata and some pages about, but still had no clear idea how to use it and no motivation to digg into it. Because js just worked with a small custom script, to retrieve some pages and data.


earth calling ivory tower, earth calling ivory tower

superior RDF triples are like martian language to millions of humans

over


The ivory tower is working on it: https://github.com/w3c/EasierRDF


RDF has to be the best and saddest example of sunk cost fallacy. Instead of redirecting their efforts to a more general graph model which has actual hype and use by developers, its cultists are double downing on their abstruse technology stack, making it always more complicated while still not addressing any of its fundamental problems.


I mean, imho RDF isn't the problem. RDF itself is very simple. As you correctly point out, the stack is overcomplicated.

> Instead of redirecting their efforts to a more general graph model which has actual hype and use by developers

neo4j is basically this. You can also load RDF into neo4j using neosemantics and query it using Cypher instead of using a conventional triplestore with SPARQL, which is nice.


Let us see your proposal of the superior model?

RDF was designed primarily for data interchange and there's nothing that beats it at that.


Nothing beats it for data exchange? You must be joking, because if it were remotely true RDF would be in wide use, which is totally not. Except a few niche domains like bioinformatics, it is not used. No killer application use it as a data format, no popular data format is based on it either. Actually I can think of a single data format based on RDF, and the only open-data I know which use it have been converted to it and was simplier to use in their original format.

And for the model: property graph. But yeah, enjoy your Stockholm syndrome with your model where reification is required to annotate an edge. Also even your nickname is an aknowledgment of RDF failure: named graphs (n-quads) were created because RDF triples aren't good enough for modeling data.


Yes, let us see how you do data interchange without global identifiers. Such as URIs, which RDF has built-in natively and property graphs do not.

You're right about bioinformatics, but lets do a quick check on http://sparql.club/ on who else is looking for RDF/SPARQL specialists. Oh look: automotive industry, finance, publishing, medical, research etc.


I just looked at the linked repo. Have they made any progress? It looks like _very_ early days.


Are you saying parsing html is easier than parsing rdf triples? Because i dont know about that.

In real life you use tools for both.


Triples are already structured, machine-readable data. HTML is not.


It’s so sad that almost nobody knows or uses SPARQL…


In my experience SPARQL is really hard to use, and Wikidata data quality is really low. To the point that one of my larger project is trying to filter data to make it usable for my usecase.

Yes, I made some improvements ( https://www.wikidata.org/wiki/Special:Contributions/Mateusz_... ).

But overall I would not encourage using it, if I would know how much work it takes to get usable data I would not bother with it.

Queries as simple as "is this entry describing event, bridge or neither" are requiring extreme effort to get right in a reliable way, including maintaining private list of patches and exemptions.

And bots creating millions of known duplicated entries and expecting people to resolve this manually is quite discouraging. Creating Wikidata entries for Cebuano Wikipedia 'articles' was accepted, despite that Cebuano botpedia is nearly completely bot-generated.

And that is without unclear legal status. Yes, they can legally import databases covered by database rights - but they should either make clear that Wikidata is a legal quagmire in EU or forbid such imports. But Wikidata community did neither.


Who knew that a global machine-readable knowledge base would involve some complexity?


"is this entry describing (a) bridge (b) event" should have some reasonable way to answer.

So far I have not found way to achieve this without laboriously maintaining my own database of errata, and new exceptions keep appearing.


Because the syntax is relatively complex and it is difficult to judge which endpoints and definitions to use.


I learned SPARQL recently, and would agrre its complicated to get info out of Wikidata.

However, having read the article, they didnt have an easy time with scraping Wikipedia either.

So I'd probably still recommend people look into wikidata and SPARQL if they want to do this kind of thing.

Theres a few tools that generate queries for you, and some cli tools as well:

https://github.com/maxlath/wikibase-cli#readme

It makes Wikipedia better too, in a virtuous cycle, with some infoboxes like those that he scraped being converted to be automatically populated from wikidata.


At least the last times I checked, the WikiData SPARQL server was extremely slow, frequently timing out.


There's some mix between "it's slow" and "it sets its timeout threshold too low" - a lot of queries would be OK if they just had a bit more time to run. And unfortunately, the time wasted on the badput of the killed queries just slows down everyone else. (They really need a batch queue)

The Wikidata folks are well aware of the limits on their SPARQL service. They just posted an update the other day:

https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.w...


seems to depend on the query. I can issue straight forward queries that visit a few hundred thousand triples easily. But when i write a query that visits tens of millions of triples it times out.


Why should more people know SPARQL?


E.g. to get a job in FAANG, finance or pharma where SPARQL is used extensively on enterprise Knowledge Graphs? Check the jobs here: http://sparql.club/


The few openings I flipped through all mention SPARQL in an offhand manner, in the sense of "familiarity with query languages and data ontology".


Definitely. There's even a public query service ( https://query.wikidata.org/ ) which can do a lot of this (though SQL is not good with searching for chains).


Note: that's the sparql endpoint. Much better at searching chains than SQL (but easy to make slow queries that timeout).

The SQL endpoint is at https://quarry.wmflabs.org/ however it doesn't have the actual data so much as metadata (mostly) so its not super useful.


Like people on the other comment have said, if you've actually tried getting data from wikidata/wikipedia you very quickly learn the HTML is much easier to parse than the results wikidata gives you.


+1 use WikiData’s SPARQL endpoint.

Still, it was a cool article and a good example of scraping information.


Please don't scrape raw HTML from Wikipedia. They do a lot of work to make their content accessible in so many machine-readable formats, from the raw XML dumps (https://dumps.wikimedia.org) to the fully-featured API with a nice sandbox (https://en.wikipedia.org/wiki/Special:ApiSandbox) and Wikidata (https://wikidata.org).


The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.

The reason is that the dumps just have pointers to templates, and you need to understand quite a bit about Wikipedia's bespoke rendering system to know how to fully realize them (or use a constantly-evolving library like wtf_wikipedia [1] to parse them).

The rendered HTML, on the other hand, is designed for humans, and so what you see is what you get.

[1]: https://github.com/spencermountain/wtf_wikipedia


Still, I guess you could get the dumps and do a local Wikimedia setup based on them, and then crawl that instead?


You could, and if he was doing this on the entire corpus that'd be the responsible thing to do.

But, his project really was very reasonable:

- it fetched ~2,400 pages

- he cached them after first fetch

- Wikipedia aggressively caches anonymous page views (eg the Queen Elizabeth page has a cache age of 82,000 seconds)

English Wikipedia does about 250,000,000 pageviews/day. This guy's use was 0.001% of traffic on that day.

I get the slippery slope arguments, but to me, it just doesn't apply. As someone who has donated $1,000 to Wikipedia in the past, I'm totally happy to have those funds spent supporting use cases like this, rather than demanding that people who want to benefit from Wikipedia be able to set up a MySQL server, spend hours doing the import, install and configure a PHP server, etc, etc.


> This guy's use was 0.001% of traffic on that day

For 1 person consuming from one of the most popular sites on the web, this really reads big.


He was probably one of the biggest users that day, so that makes sense.

The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.

Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.

If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.


I don't think I agree. Cache has a cost too.

In theory, you'd want to cache more popular pages and let the rarely visited ones go through the uncached flow.

Crawling isn't user-behavior, so the odds are that a large percentage of the crawled pages were not cached.


That's true. On the other hand, pages with infoboxes are likely well-linked and will end up in the cache either due to legitimate popularity or due to crawler visits.

Checking a random sample of 50 pages from this guy's dataset, 70% of them were cached.


Note - there's several levels of caching at wikipedia. Even if those pages aren't in cdn (varnish) cache, they may be in parser cache (an application level cache of most of the page).

This amount of activity really isn't something to worry about, especially when taking the fast path of logged out user viewing a likely to be cached page.


> The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.

How is it possible that "give me all the infoboxes, please" is more than a single query, download, or even URL at this point?


The problem lies in parsing them.

Look at the template for a subway line infobox, for example. https://en.wikipedia.org/wiki/Template:Bakerloo_line_RDT

It's a whole little clever language (https://en.wikipedia.org/wiki/Wikipedia:Route_diagram_templa...) for making complex diagrams out of rather simple pictograms (https://commons.wikimedia.org/wiki/Template:Bsicon).


Oh wow.

But every other infobox I've seen has key/value pairs where the key was always a string.

So what's the spec for an info box? Is it simply to have a starting `<table class="hello_i_am_infobox">` and an ending `</table>`?


En wikipedia has some standards. Generally though they are user-created tables and its up to the users to make them consistent (if they so desire). En Wikipedia generally does, but its not exactly a hard garuntee.

If you want machine readable use wikidata (if you hate rdf you can still scrape the html preview of the data)


Even just being able to download a tarball of the HTML of the infoboxes would be really powerful, setting aside the difficulty of, say, translating them into a consistent JSON format.

That plus a few other key things (categories, opening paragraph, redirects, pageview data) enable a lot of powerful analysis.

That actually might be kind of a neat thing to publish. Hmmmm.


Better yet-- what is the set of wikipedia articles which have an info box that cannot be sensibly interpreted as key/value pairs where the key is a simple string?


The infoboxes aren't standardized at all. The HTML they generate is.


Hehe-- I am going to rankly speculate nearly all of them follow an obvious standard of key/value pairs where the key is a string. And then there are like two or three subcultures on Wikipedia that put rando stuff in there and would troll to the death before being forced to change the their infobox class to "rando_box" or whatever negligible effort it would take them if a standard were to be enforced.

Am I anywhere close to being correct?


I think you'll have to more clearly define what you mean by "key-value" pairs.


Genuine question from a non-programmer: why? Is it because the volume of requests increases load on the servers/costs?


That's part of it, but also it's typically much more difficult and there's an element of "why are you making this so much harder on yourself".


(Author of original article here.)

That's the great thing about HtmlAgilityPack, extracting data from HTML is really easy. I might even say even easier than if I had the page in some table-based data system.


The HTML is more volatile and subject to change than other sources though


Don't remember the last time wikipedia changed the infobox though


Can make it even harder, use Puppeteer to take screenshots then pass it to an OCR to get the text.



Unlike APIs, html class/tag names or whatever provide no stability guarantees. The site owner can break your parser whenever they want for any reason. They can do that with an API, but usually won't since some guarantee of stability is the point of an API.


True, but the analysis was done on files downloaded over the span of two or three days. If someone had decided to change the CSS class of an infobox during that time, I'd have noticed, investigated and adjusted my code appropriately.


"html class/tag names or whatever provide no stability guarantees"

Not quite. Many Wikipedia infoboxes (and some other templates) use standardised class names from microformats such as hCard:

https://en.wikipedia.org/wiki/Wikipedia:Microformats


Scrapping, especially on a large scale, can put a noticeable strain on servers.

Bulk downloads (database dumps) are much cheaper to serve for someone crawling millions of pages.

It gets even more significant if generation of reply is resource intensive (not sure is Wikipedia qualifying for that but complex templates may cause this).


How does scraping raw HTML from Wikipedia hurt them? I'd think they could serve the HTML from cache more likely than the API call.


IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.

Wikimedia no doubt have caching, CDNs and all that jazz in place so the likely impact on infrastructure is probably de-minimis in the grand scheme of things (the thousands or millions of humans who visit the site every second).


>IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.

They said please don't, not don't do it or they'll sue you.

But content license and site terms of use are different things.

From their terms of use you aren’t allowed to

> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;

Wikipedia is also well within their rights to implement scraping countermeasures.


Yes, but they aren't going to care for just 2400 pages.

As a general rule, make your scraper non-paralell, and put a user-agent that has contact details in the event of an issue, and you're probably all good.

After all wikipedia is meant to be used. Don't be unduly disruptive, don't scrape 20 million pages, but scraping a couple thousand is totally acceptable.

Source: used to work for wikimedia, albeit not in the sre dept. My opinions are of course totally my own.


I don’t think the op was talking specifically to the content author, but to all the people who read the article and get the idea to scrape Wikipedia.


Honestly i'd rather people err on the side of scrapping wikipedia too much than live in fear of being disruptive and not do cool things as a result. Wikipedia is meant to be used to spread knowledge. That includes data mining projects such as the one in this blog.

(Before anyone takes this out of context - no im not saying its ok to be intentionally disruptive, or do things without exercising any care at all. Also always set a unique descriptive user-agent with an email address if you're doing anything automated on wikipedia).


Having been on the other side of this, I’d rather we encourage people to make use of formats/interfaces designed for machines and use the right tool for the job instead of scraping everything.

It’s incredibly easy for careless scrapers to disrupt a site and cost real money without having a clue what they’re doing.

I want people to think twice and consider what they are doing before they scrape a site.


> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;

Two things:

  1) The English wikipedia *alone* gets 250 million page views per day !  So you would have to be doing an awful lot to cause "undue burden".

  2) The Wikipedia robots.txt page openly implies that crawling (and therefore scraping) is acceptable *as long as* you do it in a rate-limited fashion, e.g.:

  >Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.

  > There are a lot of pages on this site, and there are some misbehaved spiders out there that go _way_ too fast.

  >Sorry, wget in its recursive mode is a frequent problem. Please read the man page and use it properly; there is a --wait option you can use to set the delay between hits, for instance.


1. You'd be surprised what kind of traffic scrapers can generate. I've seen scraping companies employing botnets to get around rate limiting that could easily cost enough in extra server fees to cause an "undue burden".

At a previous company we had the exact problem that we published all of our content as machine readable xml, but we had scrapers costing us money by insisting on using our search interface to access our content.

2. No one is going to jail for scraping a few thousand or even a few million pages, but just because low speed web crawlers are allowed to index the site, doesn't mean scraping for every possible use is permitted.


"Who's gonna stop me" is kind of a crappy attitude to take with a cooperative project like Wikipedia.

I mean, sure, you can do a lot of things you shouldn't with freely available services. There's even an economics term that describes this: the Tragedy of the Commons.

Individual fish poachers' hauls are also, individually, de-minimis.


>The graph was interesting but this wasn’t the primary objective of this exercise. I wanted to write “He is the n-times great-father of his current successor Queen Elizabeth.” on King Alfred’s Wikipedia page.

Wouldn’t that contravene Wikipedia’s rules on original research?


It's not original research, more like synthesis. Reading the policy he's probably OK under the routine calculation exception.

The edit got reverted because there was no clear criteria for mentioning Queen Elizabeth III but not the other descendants. If he made an info graphic or info box with a clear inclusion criteria and stuck it in the Alfred article it would probably stick.


I also think its of questionable relavence to the topic at hand. Just because a fact about something is true doesn't mean it should be in the article on the topic.

But yes, citing your own blog isn't a valid source, and neither is citing wikipedia - https://en.wikipedia.org/wiki/Wikipedia:Verifiability#Self-p...


Has anyone found an easy way to expand their templates without using their whole stack? I tried getting Lua templates working from Python but didn't get very far...


Parsoid[1] is what you'd want for that, most likely. It's the new wikitext parser that MediaWiki is gradually switching over to, but it has the virtue of being usable entirely outside of MediaWiki if you need to.

[1]: https://www.mediawiki.org/wiki/Parsoid


Yes, but it implements lua by calling into the old stack, which doesn't really solve OPs problem.


Okay, yeah, I guess that Lua modules in particular are going to make everything fall down.


Where's the profit?


For three minutes, something I wrote was cited on Wikipedia.


Ah yes, every student in my district had this distinction too, until they lifetime banned our egress IPs over it.

The person who did the revert seems like a bundle of joy, though. Congrats on your efforts.


Good question, as a data-hoarder mining has always been fun but I was interested in seeing the profit. I've yet to find any monetary value in the troves of data I accumulate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: