Hacker News new | past | comments | ask | show | jobs | submit login
Wikipedia and Internet Archive partner to fix 1M broken links on Wikipedia (wikimedia.org)
494 points by The_ed17 on Oct 26, 2016 | hide | past | favorite | 101 comments



In the long term the internet archive will likely be the major supplier of references to Wikipedia. Webpages don't live forever, hopefully the internet archive does. It's an extremely valuable resource, the archive and wikipedia are amongst the most valuable digital assets we have.


I think a case could be made that the Archive should be the reference by default for citations -- that is, when a URL is cited, the Archive's snapshot API is triggered, and the archive URL is used as the direct link with the original URL being the "backup". Sites/pages change so much that it's more helpful to point the average user to what a page looked like when the reference was accessed and included. It may rob the original URL some traffic, but Wikipedia isn't meant to be a link-aggregator for external content.

The obvious roadblocks are:

- Websites, even content-focused ones, can be so complicated in terms of JS that the Archive might not capture it accurately.

- The Archive's policy is to honor robots.txt and other no-archive directives.


It may rob the original URL some traffic,

That is an advantage. It reduces extraneous incentives to post links, which should be only about the information they provide and not about page views. At wikipedia's scale, and sensitivity to information purity, that's relevant.

See also the rel="nofollow" decision from a few years back.


It'd be a real shame if publishers decide to opt out of the archive to reclaim "lost" pageviews.


It's probably good that it's possible to opt-out of being archived but what annoys me is a useful domain changing hands and the new owners deciding to wipe its history.


Surely that's not possible?? And there is no way to find the old archive, even under an alias?


It's not wiped (they don't delete anything) but if robots.txt forbids indexing then access to old archives is blocked. It can be quite aggravating, but at least the information is not lost---if robots.txt or archive.org's policy changes, it can be retrieved again.


> - The Archive's policy is to honor robots.txt and other no-archive directives.

What bothers me is that they do so retroactively based on the current robots.txt, not the one contemporary with the archived content. So if a domain parker takes over a domain, and their robots.txt excludes everyone from every page (or everyone but Google), then archive.org no longer provides its archive of the old content.


I'm not sure there's a good way to work around this - the alternative is to only respect the robots.txt as it was when the snapshot was taken, at which point once a confidential page is in the archive you can't (easily) remove it again.


Perhaps access to archived pages should only be blocked when the ia_archiver user agent specifically is denied in the robots.txt. That way they aren't inadvertently blocked by a generic robots.txt that denies everything (which sometimes occurs with parked domains), but there's still a way to deny the Wayback Machine if you really need to.


This, exactly. Or, even better, use a separate user-agent ("ia_retroactive" or similar) for retroactive removals.


Well, then everybody would claim the same right and you'd have to maintain full list of bots and keep it up to date. This doesn't sound scalable, * should still mean "everybody".


robots.txt only tells bots not to crawl the website. It doesn't say anything about indexing or archival of pages.

It would be perfectly possible to have the Wayback Machine respect the robots.txt and not have it crawl or archive any new pages, whilst making pages that have already been archived accessible unless a specific user agent has been denied.


I would say that the default behavior should be to respect the robots.txt of the time of the snapshot and only revert archival of accidentally cached pages that were never intended to be public.

Sadly, it requires a bit of human intervention there.


The internet archive wouldn't work if its crawler wasn't fully automated. You can't handle the whole internet in a way that requires human intervention.


The problem for manual deletion via human intervention is that if you do it outside of robots.txt you then need to ensure the identity of the owner, which makes it much more complicated and costly.


Maybe robots.txt could have clauses that say whether to apply entries retroactively. True, domain parkers could enable it, but I don't think most would, since it's extra work for no benefit - the point is usually not to erase history but to protect current site.


Since the number of pages referenced from Wikipedia is limited, the archiving could be also done by other parties. Maybe just setup a raspberry, xTB hard disk, download everything and then distribute over IPFS (or something else) (as backup in case Archive.org removes the pages).

To take into account Javascript etc, one could also capture a png snapshot of the page.


That seems like an awful bug, is this by design?!


Yes, it's by design. It's to stave off the inevitable flood of legal threats and takedown demands.

If someone requests content be taken off, they instruct them to update their robots.txt. The content is not removed but will not be shown through archive.org as long as the robots.txt exclusion is in place.

There was a court case where the plaintiff wanted to subpoena the Internet Archive for evidence (since the defendant had since blocked the content with robots.txt). They sent an expert to testify that complying with that kind of thing would be too much of a burden for them, and suggested that the court force the defendant to change their robots.txt. The court agreed.


By design from the last time I saw it discussed. The idea is that a change there could indicate that there was a mistake and the data shouldn't have been crawled for one reason or another. There's just no way to know in an automated fashion.


I'm inclined to say that robots.txt should be ignored altogether and removal requests should be handled per individual case. But I guess that goes too far...

Still, if the domain owner changes, they should not be able to remove content from old archives. That's like being able to remove stuff from encyclopedias about a palace somewhere, just because you live in the place where the palace once stood.

They are The Internet Archive after all, it's logical to archive contents like the domain owner and robots.txt for a given point in time. A change of owner can be easily detected.


> removal requests should be handled per individual case

Are you going to fund the internet archive to handle that workload?

> if the domain owner changes, they should not be able to remove content from old archives.

Why not? If that work belongs to anyone, it is the current domain owner. Why does the fact that the internet archive happened to crawl it mean that suddenly they lose control of their information?

This is elevating the Internet Archive from 'hey it's cool someone made a copy of that while it was up and no one cared' 'because the Internet Archive crawled it the world has absolute rights to that information from now on, wishes of the owner be damned.'


> If that work belongs to anyone, it is the current domain owner.

Not at all. A domain parker taking over a domain does not imply they have any rights over all the content that the previous owner of the domain posted.


Outside of any other information, it does imply that.


> Are you going to fund the internet archive to handle that workload?

I figured someone would ask that. I don't have an immediate answer but it's a good question (upvote for that). My hopes are just that removal requests are not too frequent. But without current numbers (of number of pages hidden after-the-fact and current removal requests) this is guesswork.

They could charge for it perhaps? A dollar per request. Doesn't seem too unreasonable for something you mistakenly made available to the planet. It doesn't have to be per page, so if you made a million documents available all under example.com/hidden/ then hiding that folder is a simple action and costs just one dollar. You're paying them for their time.

In the Netherlands, if you want your personal information (e.g. phone number; email address) removed from a company's systems, you can request that and they must grant it if they have no reason to keep the data any longer. And you can make requests to see your data, etc. But the law allows for companies to charge for this and I've seen example amounts (I think around 3 euros) somewhere. It's a somewhat similar situation.

So I don't have a single good answer, but I think by-case is a better way to go (and worth thinking about, at least) than just using the current approach.


Wait, so every time someone makes a citation on wikipedia it causes Archive.org to archive that URL? This seems like it could be abused by a malicious agent, no?


Well, in my hypothetical world, the Wikipedia editor page would include a widget in which a user could submit a URL, and Wikipedia would generate the markup, including an archive URL. Deciding how to implement this so that it works efficiently will have some overlap with issues on that 2nd hardest computer science problem of cache invalidation.


Good idea, after that it's really just rate limiting and perhaps only looking up the same URL every X days. Perhaps a per domain limit as well. No problem they probably haven't already solved.


There's already the citation tool in the default editor. Shouldn't be too hard to just fire off an AJAX request to the Archive when the citation is added, right?


Easy enough, but what do we call it?

The Wikiarchivator?


You can already just tell archive.org to archive a URL. How would this be different?


You could have a "reverse DDoS" where archive.org blocks requests from Wikipedia, removing the ability to add Wikipedia citations.


There's no such thing as a "reverse DDoS", that's just a regular denial of service attack.


A theoretical reverse DDoS: You somehow prevent traffic from reaching their load balancer so it automatically scales their AWS instances to zero.


Your minimum instances would always be 1 in the autoscaling group associated with your ELBs (or now, ALBs).

Regardless, neither the Internet Archive nor Wikimedia use AWS or other cloud providers, as it would be prohibitively expensive. They both run their own infrastructure/ops.


Mostly a joke, just trying to think of what that term would mean.


I think a reverse DoS attack would be one that actually increases the capacity of the service. I'm not sure what scenario would allow that to happen though.


Some streaming sites (e.g. 4 on Demand, IIRC) have a peer-to-peer element; you could reverse-DDOS those by running a lot of computers "seeding" (perhaps by leaving them on the page but at the end of the video) distributed all over the globe.


What I meant was:

1. Get Wikipedia to send lots of requests to Archive.

2. Archive blocks requests from Wikipedia.

3. Wikipedia citations are disabled.

In other words getting Wikipedia to DDoS Archive, so that Archive's defense hurts Wikipedia.

A very silly scenario of course, just coming up with one for why an attacker might want to indirectly DDoS Archive via Wikipedia.


What would probably happen between 1 and 2 is that archive.org notices spike in traffic from Wikipedia, talks to Wikipedia engineers, they find out who is generating all these requests, and block them from editing Wikipedia.

Though it's certainly possible to generate enough junk edit traffic to cause disruption on Wikipedia, but that's nothing new. It's the nature of Wikipedia as the resource - it trusts the internet community to be good on average. So far it worked.


That's still not a DDoS, as its missing the Distributed element. Almost by definition if you can block all requests at source its not a DDoS, its just a regular old DoS.


I'll include the obligatory donation link below for what is effectively an internet utility. [0]

Also, take a look at downloading a copy of Wikipedia! You can get a full download with images for around 100GB (last I checked, which was about two years ago). It's great for if you ever think you'll need some technical info while away from live internet. I keep a copy on a hard drive that boots into linux, just in case (maybe I'll be on site with a customer and need some engineering notes - BAM taken care of!)

[0] https://archive.org/donate/

EDIT: I used XOWA, and I do not keep the wiki up to date, really. Note that the entire wiki history is huge, but a reasonably current snapshot is manageable (~100GB or so).

[~] http://xowa.org/home/wiki/Help/Download_XOWA.html


Actually I think 100G is not that much, if you consider what you get. Furthermore some time ago I estimated that you could store the entire world's tiles (raster data in 10m resolution) and OSM data on a SSD (<500G).

If you think about it, that might be a game changer. Maybe it might make sense to ship e.g. smart phones with wikipedia, OSM and satellite photo (e.g. Sentinel-2 is free data) on disk.

Granted, 10m resolution is not the state-of-the-art in aerial imagery - which is around ~0.1m (this means 10^4 times more data) - but it is reasonable to detect buildings and combined with OSM's vector data you have practically Google Maps in your pocket.


> If you think about it, that might be a game changer. Maybe it might make sense to ship e.g. smart phones with wikipedia, OSM and satellite photo (e.g. Sentinel-2 is free data) on disk.

I don't think this is a sensible use of local phone storage, but I do think you'll see a lot more P2P edge cache nodes if/when IPFS takes off (OSM tiles are already served on the IPFS network).

It would be a simple matter of picking a VPS provider or hardware colo provider near expected heavy use, launching IPFS, and having it pin the relevant content locally.


How is caching handled for those tiles in IPFS? At OSM.org they get rerendered when the underlying data changes.



If we're considering just the availability, Google Maps has an option to cache a region for offline use, which works spectacularly good - given that you get your antenna on a good signal every once in a while.


I mostly use this because limited data plan, makes it use a lot less data if you already downloaded the area you are currently in before hand while you were still on wifi.


Thank you for posting the donation link. I have donated in the past and always think about donating when the Internet Archive is mentioned.

I just set up a monthly donation amount, and I encourage you to do the same if you believe in the Internet Archive's mission!


I have a pretty substantial amount of storage for things I have collected over the years. Podcasts, YouTube videos, Gifs even stuff from back when Flash animations were popular. All organized and curated. Because it's private I don't really have to worry about copyright laws as well. 6 disks in a raid 10 array with 2 hot spares. It's rather cumbersome now but hopefully the tech will get good enough to be able to maintain itself for a few hundred years before I die. I know some of that stuff will be lost to time, some of it you already can't find online. I call it the ark. I hope some day down the road I can blow some historians mind.

Perhaps, long from now, they might quote internet comments from sites like HN like we quote the Greek philosophers. "cmdrfred the elder once said...". I should make a habit of ripping the front page or so and comments for the ark.


> Perhaps, long from now, they might quote internet comments from sites like HN like we quote the Greek philosophers. "cmdrfred the elder once said..."

Or Talmudic claims of authority in Judaism:

https://duckduckgo.com/?q=%22taught+in+the+name+of+rabbi%22&...

https://duckduckgo.com/?q=%22said+in+the+name+of+rabbi%22&ia...

Muslim ahadith also have the phenomenon of the chain of narration where people declare the provenance of the teaching:

https://en.wikipedia.org/wiki/Hadith_studies#Sanad_and_matn


I've been an editor on Wikipedia for years, and it's simply amazing how many web pages I referenced in 2008–09 have disappeared. Digital archivists have their hands full.


Here's another piece of irony for you.

There was an article by Brewster about preserving the Internet in the scientific American which put the average lifespan of a URL at around 40 days. The said article now 404s and the only way to get it is through the wayback machine. :)

http://web.archive.org/web/19970504212157/http://www.sciam.c...


Of course. The page is much older than 40 days...


A little anecdote that might be interesting here.

I worked at The Archive for a little while and one of the projects I worked on was to unpack about 300 TB of crawl data from the defunct search company cuil.com. It was mostly decent quality data in a standard format and after some grinding, the whole thing was converted into warc files which the wayback machine could use to show the URLs. The end result was that about 60 billion URLs came "back onto the web".

During the work, I was looking at the stones rather than cathedral but after I left and thought about it in detail, it was very satisfying. I was reading the book "A Canticle For Leibowitz" at the time and the general theme of cycles of history was in my head. That dovetailed very well with the work I had done.

If you're interested, you can download the dumps over here https://archive.org/details/cuilcrawl


Upvote for "A Canticle For Leibowitz" alone. Thanks for your service at the Internet Archive!


It was a privilege to work there and definitely the best period of my career. So many wonderful people and experiences in such a short time. I personally consider Brewster Kahle a severely underrated hero of our age. The Archive's work is extremely valuable but I think the value will be appreciated only by a future generation.

As for the book, I think if The Archive had novel for a totem, it'd be "A Canticle For Leibowitz". Very much affected my world view when I read it and this thread has just kindled my interest again. :)


If you don't have a paperback copy of A Canticle, I'd be happy to send you mine!


That's very generous of you but I do have a treasured copy myself. Thanks! :)


Whenever I cite a webpage, I try to archive the page at that moment and include the archive link in the citation. It's a bit more effort than just `<ref>[url]</ref>`, but it really is necessary.


I had to migrate my archived articles off Readability prior to the OEL at the end of September. I'd only used the service for a couple of years, and hit about a 5% bitrot rate, this on fairly significant articles.

One of the more curious cases was CSIRO (Australia's national science and research organisation) which seems to have not only deliberately purged a fair amount of data (Graham Turner's work specifically), but has a robots.txt in place which blocks archival by TIA. That strikes me as ... downright curious.


Which CSIRO site is that with the robots.txt? (it's not their main site).

I'm thinking that could be something to have an Opposition Senator bring up in Senate Estimates.


Could you reply with CSIRO links blocked by robots.txt? I run my own instance of ArchiveTeam's ArchiveBot (which archives links provided regardless of robots.txt), and would be happy to put the content into cold storage.


I'll need to check my Readability dump.


No rush; I'll bookmark this thread to check two weeks from now.


Found it. Apparently the robots.txt has been fixed:

404: http://www.csiro.au/en/Portals/Multimedia/CSIROpod/Growth-Li...

Now available: http://web.archive.org/web/20120508210658/http://www.csiro.a...

That's among the specific links which wasn't being served by TIA earlier.


Awesome. I've emailed CSIRO to try to track down that podcast included in the article that was not archived.


Thanks, I really appreciate this.

dredmorbius@gmail.com if you happen to track that down.


Emailed. Also, archiving all of the current version of csiro.au, just in case.


It seems to be climate, CO2, and limits-related work that is most prone to being censored.


Not surprising based on the political climate in AU currently.


Wikipedia and The Internet Archive are examples to me of what makes the internet awe inspiring. (I can throw in Google search too, not Google Inc). Data (sometimes imperfect) is one query away on any internet connected device. Truly amazing in a way.


The archive.is guy provides mirrors of rotten links to Wikipedia also, although not as the result of any official agreement with Wikipedia, just on his own initiative, which I think was nice of him.

Enclyclopedia Dramatica is generally not a reputable source of truth, being the site that it is, but while looking for some more information on archive.is mirroring of links from Wikipedia articles, I found an article on ED that I found interesting. It is heavily advocating one side of the story but at least it backs it up with some links, which is rather seldom on ED (most links on ED usually go to other pages on ED in my experience).

https://encyclopediadramatica.se/Archive.is


It's amazing to me how Wikipedia ends up being a reasonably good website with such a cancerous community behind it.


Perhaps this is the genius of Wikipedia - it keeps many of those of a certain type of personality occupied amongst themselves while using the energy of their machinations to produce a product of wider social good.


You know ED heavily dramatizes stuff right? (It's in the name) They also have a huge hate boner for Wikipedia in my experience. Anyway, I think it's unfair to call the community cancerous when in this case archive.is was spamming Wikipedia with bots. Spam from any website is spam, regardless of how useful, and in this case it's a severe breach of the community's trust. Anyway, archive.is was recently removed from the blacklist, so it's silly to paint the whole community as "cancerous", which is also a juvenile term to use.


What are you referring to?


I'm not necessarily agreeing with the OP here and I don't even know what the community is like but this seems a decent page to start[1]. I do like how Wikipedia keeps a page on it's own controversies - I mean it makes sense, but I like that they're open about it.

[1] https://en.wikipedia.org/wiki/List_of_Wikipedia_controversie...


The thing with the links on wiki is that if they lead to obscure third-party site, the user would trust them at the level of trust for Wikipedia. But if Wikipedia community has no idea what this site is, they'd feel uncomfortable using these for linking. Especially if there's danger than in 10 years whoever runs this site gives up, loses the domain, some spammer or criminal buys it and gets all the naive people from Wikipedia coming to them and trusting them since Wikipedia referred them.

For archive.org it is a known, established and trusted organization. It's actually has an office within walking distance from Wikimedia offices, AFAIK :) - not that it is very important, just an interesting fact. The point is there's no reputation problem. But for site that is less known, there is.

I understand the frustration of people about not being trusted, but that's how it works - trust needs to be earned. I don't see any way to it but for whoever runs the bot to talk to Wikipedia community and earn their trust. Name-calling won't exactly be helpful here like some do here in comments. Shady practices used by whoever wrote the bot like using tons of IPs and not identifying the bot properly also doesn't help. You can't be sneaky and complain there's no trust at the same time.


Excellent news. Should note that today is the 20th anniversary of the Internet Archive: https://blog.archive.org/2016/10/26/making-the-web-more-reli...


I'm really glad this is happening. Wikipedia needs to clean up their broken links, and this could help the archive get a wider sampling of websites, so as to preserve more data.

Websites going offline is a huge problem. For example, the now-famous thread from which sleepsort originated (on 4chan's /prog/ textboard) isn't archived anywhere: textboard threads are immortal, so nobody thought to archive any threads until dis.4chan.org went down for good.

Thankfully, some bright spark managed to save the sqlite databases for most of the boards on dis to the Internet Archive, so I was able to track down the thread eventually.


This is a huge step forward for Wikipedia as an authoritative source of information. Glad to see this happening. :)

OT: I considered applying to the Internet Archive last time I was looking for work, but their office is too hard to commute to coming from the East Bay. :(


I agree that it's a big step forward for encyclopedias. Not just as a 'source of truth' but also in terms of automating away a lot of the routine editorial maintenance that needs to happen at Wikipedia's scale.


This whole discussion reminds me of how all MySpace content was destroyed in a rash corporate decision years ago. Just like that, five years of the most popular social networking site on the World Wide Web and all its history were wiped out:

http://activehistory.ca/2013/06/myspace-is-cool-again-too-ba...

Unfortunately, the Internet Archive was only able to get the non-logged-in version of the site. All those loud, obnoxious profile pages users spent endless hours working on? We only have oral histories now to remember them.


I wish I could get all my old horrible homepages back. There was time you'd have to torture me to admit I had anything to do with that, but now I would probably be proud of them again. It's history now.


It'd be great if StackOverflow approached the Internet Archive about doing the same for their broken links, too.


Internet Archive should look into distributed models such as IPFS for storage of the archived sites.


The IPFS team is working with the Internet Archive on this.


Excellent, it may be a better way to provide spare bandwidth/storage similar to their https://archive.org/details/archiveteam-warrior


For sure. ArchiveTeam has explored providing mirrors of the entire Archive [+], but IPFS is a perfect fit for the task.

[+] http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.B...


As I have clicked in several broken links already, I am wondering how many, absolute number or in percentage, per article are likely to be broken

I might be way off, but doesn't 1M seems like a low number for wikipedia size? What is that in percentage of total number of links? Does anyone know?


Wow. I really love the internet archive as a project. This is a great usage. Looking forward to see how that will work out.

I wonder if they will publish a list of replaced links after the fact?


What's blocking Wikipedia to just archive the referenced pages on edit?

It would be far more reliable than depending on Internet Archive when it may not have the page archived and more likely the time of the archive would differ from the time it was referenced.

It would cost some more disk space and bandwidth, which of course is already pressuring them but in turn would greatly improve usability and reliability.


Likely some interpretation of how Wikipedia is not to be a primary source.


One corner case that exists: a content is linked on Wikipedia, this content is taken down due to a copyright violation

(I suppose Archive.org would be asked to take the content down)


Archive.org will take content down for certain reasons, but they have a pretty broad copyright exemption as a non-profit archive.


Why does the headline says "to fix 1M broken links" but the article says it's already been done?


Yeah that's a bit confusing. I'd attribute it to the "press release" nature of Wikimedia's blog where they mean that they have already partnered with the Internet Archive and are announcing it after the fact.


(red heart)(yellow heart)(green heart)(blue heart) Internet Archive (red heart)(yellow heart)(green heart)(blue emoji)

There are fewer more noble pursuits than archiving the sum of human knowledge.


On a side note, it makes me very sad how Wikipedia editors are often pushing some political agenda. I'm relying on it for less and less topics. Clearly nothing that can be affected by US politics or SJW-style controversies.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: