Oh how I wish the wayback machine would ignore robots.txt... So many websites lost to history because some rookie webmaster put some misguided commands into the file without thinking about the consequences (eg. block all crawlers except google)
The worst part is when a site is stored in the wayback, then the domain expires and the new owner (or squatter) has a robots.txt that blocks everything, then all the old content becomes inaccessible.
They should store a history of WHOIS data for the site's domain and make separate archives for when the owner changes, I think. Also, why did anyone think that applying robots.txt retroactively is a good idea? :/
The worst part of this is that it's retroactive, so adding a robots.txt that denies the wayback machine access causes the machine to delete all history of the site. This is really annoying for patent cases where the prior art is on the applicant's own website: they can go and remove the prior art so it's no longer available (which is why examiners make copies of the wayback content before making their reports).
To be pedantic, they aren't lost. They are just unavailable until the robots.txt goes away. I'm fairly sure the Internet Archive aren't too keen on deleting things (unless you absolutely super duperly wants it gone and you're the author/owner of the data).
I'm surprised that some upstart search engine hasn't made a selling point that they ignore robots.txt and claim they search the pages google doesn't or something.
Speaking as an upstart search engine guy (blekko) who also has a bunch of webpages and a huge robots.txt, that's a bad idea. Such a crawler would be knocking down webservers by running expensive scripts and clicking links that do bad things like deleting records from databases or reverting edits in wikis. You don't want to go there.
Really? I was always taught that search engines only do "get" requests, and anything that modifies data is in a "post" request. Are there really that many broken web sites out there, that hasn't already fallen victim to crawlers that ignore robots.txt?
I noticed this today. Googling "united check in" and clicking the link "check" gave me a link that told me the confirmation number that I entered was invalid though I never entered one.
IANAL. But, although in principle, providing an easy opt-out shouldn't really matter with respect to copyright and so forth, as a practical matter it seems as if it does--in that, if you at least vaguely care about your website not being mirrored you have an easy way to prevent it. An organization like the Internet Archive simply can't afford (in terms of either time or money) to take a more aggressive approach to mirroring.
To be more specific--short of granting the Internet Archive some sort of special library exemption--what if I were to say, create a special archive of popular cartoon strips. What's the distinction?
[EDIT: The retroactive robots.txt situation seems less clear but, like orphan works, also depends on the scenarios you care to devise.]
Or, with a more historical lens, lots of history has been learned by pouring over intimate private personal correspondences of historical figures - most of whom I would imagine would feel quite perturbed to see their love letters on display in museums.
Should historians not read private letters sent long ago? Should they swear to some oath and take a moral stand that such things shouldn't be examined?
If the answer is "No, they should read them.", then in that same way, then why, for historical record, should we observe robots.txt? Isn't it the same thing?
means that you should not crawl the site today. It should have no effect whatsoever on displaying pages that WERE crawled before the timestamp on the robots.txt file.
I don't think there is an inarguable answer to my rhetorical question. People's intents and wishes do matter.
But there also is an idea from antiquity about the public good and the commons. I guess at some point my personal wishes get trumped by this overarching principle.
The whole point of the question was that someone would say "You may not read my love letters" and then society said "Too bad, we're doing it anyway. And reprinting it in highschool text books."
Is that ok? I don't think there's a clear line and I do think there are probably moral boundaries.
I'm by no means Lawrence Lessig and this type of discourse I'm really not experienced at. I do think there are many important questions here that we may need to rethink our thoughts on.
One might nitpick that there was initially some distinction between the publicly available internet and a private facebook; although the latter seems to be making strides to narrow this gap.
Yes, because secrets and forgetting can be important.
It's not our cultural tradition that every written work (train schedules, greeting cards, friendly notes, lolcats,etc.) must be archived at the Library of Congress. I'm not sure that it'd be a good idea.
No one is stopping you from archiving my websites if you think the data will have some importance. It seems like you're suggesting that archive.org is the universal keeper of history and everyone should agree with that idea.
I'd love to see this, even if they'd keep the content private for x number of years. Copyright runs out eventually and it would still be archived then.
I mentored a Google Summer of Code project to do just that - every citation on Wikipedia would be forwarded to Archive.org for permanent storage, and the citation link would be modified to offer the cached version as an alternative.
For various reasons this didn't get completed or deployed. It's still a good idea though. IMO it should be rewritten, but it wouldn't be a lot of code. I'd love to help anyone interested.
(French Wikipedia already does this, by the way. Check out the article on France, for example - all the footnotes have a secondary link to WikiWix. https://fr.wikipedia.org/wiki/France)
Alexis said (at the IA 10th Anniversary bash) that they are going to have this running very soon, using a bot to go over all of Wikipedia and insert archived links close to the dates of existing references (if available), and also capturing newly added links.
I would just like to say that the Internet Archive is a pretty small bunch of people and they have a lot of never ending work to do on a somewhat tight budget.
I would assume it's mostly that. They seem very accepting and willing to do a lot of things.
That's why I'm a "donation subscriber". If you'd like to know more about it, please visit: http://archive.org/donate/ - a subscription helps extra much, because it's a constant flow of cash. But one-time donations are of course of help as well.
It wasn't the IA's fault. At the time, the IA was already working on an API to submit URLs and to rapidly cache items, so we just needed early access.
The GSoC student didn't follow up with the process of getting it adopted. I didn't either, which I regret. I left the WMF in early 2012 so I guess it was dropped on the floor for a while.
That said I have since found out that others have taken up the charge.
Sir, not all of us our javascript/HTML5 people. I do mostly operations, and can barely drag myself through Javascript until I get the chance to take some vacation time to concentrate on learning the web side (JS/HTML5/etc). I admit that I don't know what I'm doing sometimes.
I am incredibly pleased at the save-page-now feature. Before there was a hack where using liveweb.archive.org might save a page on-demand, but you had no way of knowing. I'm adding this to my archive-bot immediately.
Glad to see they finally got an API, however I'm a bit disappointed that it doesn't return the oldest archived date for a site, only the newest. I often need to check how long ago a site was originally archived. The API would have been very helpful for that, but the closest they provide is an option to query whether or not it was archived on a specific date, which is nowhere near as helpful.
Funny, I just spent some time this weekend creating a json API wrapper around the little-known Memento API that wayback offers. My idea was to make a bookmarklet that would show prior versions of the page the user was visiting. (The backend is pretty trivial, actually, but I could use some help with the javascript/dom parts.)
(The CDX API linked below is links to the actual warc/arc archive files, not the web-viewable versions.)
I love the Wayback Machine (and all of Archive.org, really). I recently used it to reminisce about some old VRML-based chat communities that I frequented about 10 years ago. It had a record for every single of them.
They still have the site I used to host pre-release Warcraft 3 servers when I was young until my parents got a call from Blizzard telling them to take it offline ;)
Yep! There was also GoonieTown, which didn't last very long and eventually became VR Dimension. Flatland Rover was another, but it used its own 3DML engine instead of Blaxxun Contact and VRML. Good times!
I remember both. I was active from '99 to about 2002, and was a City Councilor/Colony Leader at one point. Nice to run into someone else with a similar background, the internet just isn't the same as it used to be in those days.
I just launched a similar service called https://www.DailySiteSnap.com that screenshots, emails, and archives a specified web site on a daily basis. My use case is to be able to look back at any one day and see what my site looks like, since Archive.org doesn't refresh my page as often as I update it.
Disclaimer: I'm really not trying to over-market myself, but I figured readers of this thread might be interested in my project. Happy to take down this post if it's read as too spammy.
Thanks for the info, was unaware of ARC/WARC formats. That said, I still think many people are looking for something simpler/easier, and a daily screenshot is good enough. Particularly, it will guarantee preserved formatting as browsers continue to evolve.
You can probably make your service do both screenshots and WARC, instead of loading a site directly, load it through WARC Proxy (https://github.com/odie5533/WarcProxy), that will write out a WARC file and you can still store your screenshot.
Once you have the WARCs you can upload them to Archive.org and they can be added to the wayback, or you can set up your own service for browsing them, built off something like warc-proxy https://github.com/alard/warc-proxy (Yeah, same name different purpose...)
This makes creating a browse-able mirror of a site in warc format fairly straightforward, as wget will automatically make links relative, as well as fetch requisite files (css, js, images) for each page.
If his service runs on any sort of Linux distro, its stupid simple to call wget with a system call. Wget comes standard with all of the most popular distros.
Thank god they didn't change much. I hate when extremely functional websites decide to 'revolutionize' their interface (I'm looking @ you Google Maps).
On a purely aesthetic side, the new input form does clash with the old menu. The ~carousel seems a bit "cpu consuming", maybe a simpler tile grid as in Windows Phone 8. That said I love the service, and the frontend is probably not the most important part of their system.