Download Entire Wikipedia for Offline Use With an HTML5 App

aw3c2 · on Dec 31, 2011

"nearly all of the textual content of the English Wikipedia" = "1GB"

I find that hard to believe. Other wiki readers' dumps are a multiple of that. Eg aarddict for en is ~8GB.

teilo · on Dec 31, 2011

You can find the current size here: http://en.wikipedia.org/wiki/Wikipedia:Database_download

Current uncompressed size of all English articles in XML: 31Gb. Compressed: 7.3. That's without talk, User pages, revision history, etc.

antimatter15 · on Dec 31, 2011

It's most of the large articles, the entirety (stripped of citations and image metadata) and compressed is more like 4GB. For comparison, everything (including citations) as a wikimedia XML dump is about 7GB bzipped and thirty something uncompressed

riffraff · on Dec 31, 2011

from the blog: >>> First of all, it compresses not the entireity, but rather the most popular subset of the English Wikipedia. Two dumps are distributed at time of writing, the top 1000 articles and the top 300,000 requiring approximately 10MB and 1GB, respectively.

kristianp · on Dec 31, 2011

Actually the top 1337 and 314159 articles, respectively :).

aw3c2 · on Dec 31, 2011

Cheers! Then their pitch of "nearly all content" is grossly wrong. Meh.

erichocean · on Dec 31, 2011

Seriously mis-titled, since it's nowhere even close to the "Entire Wikipedia" – it's a tiny subset of the English-language Wikipedia from what I can tell.

antimatter15 · on Dec 31, 2011

You can switch to a larger subset in the settings but it is a still somewhat mistitled

obsessive1 · on Dec 30, 2011

Nice job, this looks really useful - would certainly help for the times when I'm stuck with no internet access and need to look something up.

One minor niggle - when I changed the file I wanted to use in settings, there was no confirmation or notification to let me know it was downloading the new file. I ended up stopping the download, erasing the data and starting again, to be sure. It might be worth adding in a confirmation to let users know it was changed OK, and is being re-downloaded.

darylteo · on Dec 31, 2011

Can we use this technology for APIs for all various languages/frameworks?

I could definitely use a bit of a productivity boost (by turning off web access)

icebraining · on Dec 31, 2011

You can just use a website mirroring tool like wget, they've been around for ages. I've done just that with plenty of reference websites.

Don't forget to set an acceptable delay to ensure you don't overload the servers, though. Mine usually run all night.

devs1010 · on Dec 31, 2011

Selenium's webdriver is great for this, they have different implementations so you can use it like wget (but more sophisticated) where it doesn't run an actual browser or you can have it run an implementation where it actually runs a browser, like Chrome or Firefox (good for debugging)

kanzure · on Dec 31, 2011

Also, there was a Wikipedia/git project a while back (offline editing). All of the revision history was dumped into git.

http://scytale.name/blog/2009/11/announcing-levitation

http://www.readwriteweb.com/hack/2011/08/gitjs-a-git-impleme...

Why does mediawiki have its own version control system, anyway?

lsb · on Dec 31, 2011

Because Wikipedia started in January 2001, well before even Subversion was around.

Levitation looks really cool, I wonder if using gitwiki will be easier and more sustainable than mediawiki.

And hey, Dan Lucraft who wrote git.js is still at Songkick, YCS07!

jmilloy · on Dec 31, 2011

For the record, (Wikipedia says) the initial release of subversion was Oct 20, 2000. It's still understandable that Wikipedia rolled it's own.

lsb · on Dec 31, 2011

Yup, but it became stable enough to host itself Aug 2001, whereas I think it was January or February 2001 that the few articles started trickling in.

TorKlingberg · on Dec 31, 2011

Wikipedia does not want to have branching, to force people to work together, rather than go off working on various rewrites. Each article is version controlled separately. The system does not handle tracking across merges and splits of articles, but many VCS have the same problem.

bajsejohannes · on Dec 31, 2011

This is cool, but the one thing I miss from all wikipedia dumps so far is images. It's essential for a lot of articles. Last time I checked, images were excluded from dumps because of license issues. "Fair use" in particular. How about a dump of just the images with fitting licenses? Does anyone here know why this is not available?

DanBC · on Dec 31, 2011

People don't understand licences. There are many images with incorrect licences. (There are bots that trawl the images to ask people to correct the licences; there have been megabyte long flamewars about the operators of those bots and how unpopular image tagging is.)

There are just too many infringing images, even those supposedly with the correct licence, for Wiki* to distribute and stay safe.

blinkingled · on Dec 31, 2011

Random article results in 404 once in may be 4 times. Here is a suggestion for an improvement - a link for making 404 pages available offline. So if I go looking for a specific page that isn't offline I can make it available and read it later.

kal00ma · on Dec 31, 2011

An offline wikitravel would be incredibly useful for travelers. I haven't found this yet so I built an offline wikitravel for android: https://market.android.com/details?id=com.heliod.eutravelgui...

Wicher · on Dec 31, 2011

I use the Wikireader (from OpenMoko) when traveling: http://en.wikipedia.org/wiki/Wikireader I find it very useful, especially on the longer wall-socketless cycling trips.

You can stick both wikipedia and wiktionary on it. Quite possibly also Wikitravel, if they provide dumps.

timdoug · on Jan 3, 2012

Wikitravel is available -- wget http://wrmlbeta.s3.amazonaws.com and you'll see all the dumps they have available, e.g.:

http://wrmlbeta.s3.amazonaws.com/entrav-20111105.7z.001

In the age of nearly unlimited connectivity, I still find my Wikireader an invaluable device when traveling.

pax · on Dec 31, 2011

for Android there's also iTravelFree (https://market.android.com/details?id=com.rezendi.itravel).

But you're right, waiting forward for the offline HTML5 Wikitravel app. Now with antimatter15's script it's only a matter of time and goodwill :)

vasco · on Dec 31, 2011

Unfortunately I can't see any formulas correctly and the tables are quirky. Example: http://offline-wiki.googlecode.com/git/app.html?Permeability...

antimatter15 · on Dec 31, 2011

I thought of using jsmath or mathjax but they're too big.

khuey · on Dec 31, 2011

It says it was tested in Firefox 10, which is a little surprising since it doesn't work at all in Firefox 10. The IndexedDB spec changed and Firefox changed to align with the spec between 9 and 10, but the page uses the old API.

antimatter15 · on Dec 31, 2011

I tested it on an infrequently updated installation of firefox nightly, and the about page said Firefox 10. But I didn't know the API changed, I'll look into it. But how so did it change?

khuey · on Dec 31, 2011

Instead of doing

var request = mozIndexedDB.open("databasename"); request.onsuccess = function(event) { request = event.target.result.setVersion(N); request.onsuccess = function(event) { // set up your database } }

it looks like

var request = mozIndexedDB.open("databasename", N); request.onupgradeneeded = function(event) { // set up your database } request.onsuccess = function(event) { // do stuff with your database }

Feel free to email me at <my hacker news username>@mozilla.com if you need a more detailed explanation.

naner · on Dec 31, 2011

Does this app grab the files from Wikipedia directly? It doesn't seem very nice to create an app that pulls down gigabytes of data from a web service you do not own nor have permission from.

EDIT: It appears my concern was unwarranted.

mmahemoff · on Dec 31, 2011

WebRTC means peer-to-peer is probably coming to Chrome and Firefox soon, which will allow an app like this to transfer Wikipedia in all its 7.3GB (compressed) glory without harm to anyone's servers.

http://www.webrtc.org/faq#TOC-What-other-components-are-incl...

seanp2k2 · on Dec 31, 2011

I'd be worried about security and data authenticity with tech like that.

antimatter15 · on Dec 31, 2011

You could just compute data checksums of the stuff you get from peers, a la BitTorrent.

blantonl · on Dec 31, 2011

can someone explain the reason why this concern is unwarranted?

teraflop · on Dec 31, 2011

The files are hosted by Google Code: http://code.google.com/p/offline-wiki/downloads/list

DanBC · on Dec 31, 2011

Wikipedia make dumps of their data available for exactly this type of use.

Those dumps are taken and stored in other places.

There's some weird GPL stuff that people need to be wary of, I guess.

antimatter15 · on Dec 31, 2011

It doesn't

naner · on Dec 31, 2011

Very good then, nice little app.

samstave · on Dec 31, 2011

Sorry for being obtuse, but I DLd the 1GB repository - where is it stored and how do I access it?

I see I can go to the index, from this page - is this index served up from the 1GB DL I just did?

How can I transfer this to [device]?

huetsch · on Dec 31, 2011

Thanks so much for this. This will be incredibly useful for me (behind the GFW, which gets moody about Wikipedia pretty often). Could this easily periodically update itself to grab fresh versions of articles? I think that would be a great feature, especially if you could do it without having to pull down the whole database each time you wanted to update, instead just updating on an article-by-article basis.

js4all · on Dec 31, 2011

Absolutely amazing. This technology can be used for many other offline databases. He provides the tools for indexing, compressing and everything needed for the reader. Make sure to read his corresponding blog post: http://antimatter15.com/wp/2011/12/offline-wiki-redux/

Kakitus · on Dec 31, 2011

Amazing project, just what I was searching for. A few recommendations:

Could you expand the available download options to include an option to download all of wikipedia, not just a subset of the most popular articles?

Right now, mathematic and other kinds of formulae aren't rendered correctly. Is there any way you could fix that?

An option to include pictures (maybe compressed or low-res versions) would be neat.

Thanks!

SquareWheel · on Dec 31, 2011

I remember when you announced some months back, wasn't it a paid application? Great work either way.

pax · on Dec 31, 2011

Can this also be synced, or does one need to delete/re-download the whole dump?

chrisatlee · on Dec 31, 2011

doesn't work for me with Firefox 10 or 11a2. It would be awesome if it could be made to work there!

heifetz · on Dec 31, 2011

doesn't seem to work on the ipad

antimatter15 · on Dec 31, 2011

It only sort of works on iOS 5, the downloads stop whenever a "Increase Storage" prompt pops up and you have to reload whenever that happens. But it does work with the small dump, albeit slowly.

mmahemoff · on Dec 31, 2011

Cool, I initially didn't think this much storage was possible on mobile yet. Are you saying you can get the whole thing down if you keep agreeing to the prompts?

It's a pity mobile browsers haven't got better support for this kind of thing yet.

antimatter15 · on Dec 31, 2011

No I think it stops issuing prompts after 50GB. Also, on iOS 5, it only supports WebSQL which dosn't store (AFAIK) objects like typed arrays, so I have to convert it to a string and back base64 encoded, which makes it use even more space.

leeight · on Dec 31, 2011

very cool.