Seriously, the Wayback Machine is awesome. Just last week I used it to find a website I made twelve years ago at the ripe age of 10. If there are any maintainers/developers reading this, thank you.
Agreed. Back in '96 I was a teenager who spent all his free time running a modest gaming fan website [1]. I dropped it when I went to college a year later, but its nice to know all that hard work will forever be memorialized within the wayback machine. It helps me remember why I became a software engineer in the first place. Thanks for that.
Thanks! Sadly, I brought on a volunteer who bought the domain in his name. Not a very smart idea in retrospect... but I was young, and not many people understood the value of a good domain name back in 1996 ;-)
We were up to 20k daily uniques when I quit (not bad for 1997). I wrote the forum & related software myself in perl, which was an amazing learning experience.
when i was a kid using my parents AOL connection Homestead was blocked by the default AOL parental controls. so i installed a keylogger so i could sneak onto my dad's screen name and make websites when he went to work. in retrospect i probably could have just asked him to unblock homestead so i can make websites.
i realize this is only tangentially related, but your page compelled me to share. :)
My only gripe with the wayback machine is that when old sites go offline and some random domain squatter picks up the domain when it expires, they apply the current robots.txt to all of the old content making archive.org useless.
robots.txt should have a limit; it shouldn't be applied retroactively so aggressively.
I suppose it's a pretty good approach in order to help avoid upsetting website owners and possible lawsuits. While I'm not too sure I agree with it, I can appreciate that they have provided a very easy way for websites to opt-out.
It's pretty shocking how many web designers, even experienced professionals, assume a site isn't being crawled because it hasn't "gone live" yet in their minds (i.e., no press release). If you have a site active on an IP without any access controls, you can almost be sure it is being indexed by someone. If it's not the default site, expect one of your users to leak the virtual host name. If it's SSL-protected, it might even be revealed in the certificate. I respect the work the Internet Archive is doing, but I'm also grateful that they will immediately retroactively apply robots.txt if you discover you foolishly exposed a site prematurely.
Hi, please I am using wayback-1.6 on my tomcat-5.28 (java-1.7 , ubuntu-11.04) to display all my arc.gz files but I have got this error, however this folder contains all my arc.gz files /tmp/wayback/files1/IA.arc.gz
Resource Not In Archive
The Resource you requested is not in this archive.
I'm ten percent into an implementation of a 'personal web archive', mostly as a fun side-project. I just wonder, if historical data gets more interesting as the web ages?
The other day it occurred to me that the digital age might result in some serious data loss. I was wondering about historic prices for cars. Not sure how you would have gone about it in previous times, but I suppose you could find old catalogs, adverts in newspapers and stuff life that. But what if vendors only advertise prices on their web sites? Those sites with old prices will be all gone once new prices are set up, same with the advertisements.
Not even sure if archives can help - with some algorithmically created content it might be impossible to index it all.
Just one example - there are surely more. I used to think digital data would be easier to preserve for the future, but now I am not so sure anymore.
Not even mentioning Facebook, which presumably can not be archived because of the walled garden thing.
The index for Wayback is a massive sorted text file (called a CDX) containing a line for each URL and timestamp. For very large installations this index is sharded across multiple servers and queried in parallel. The lookups are done using plain old binary search.
Each CDX record maps a URL-timestamp pair to a byte offset into an ARC or WARC file. These are essentially just gzipped HTTP responses concatenated together:
I only have second hand insight into how they do it, but my impression is that their spiders log the raw data right to disk and then, periodically (a few times a year?) they process all the accumulated data in one mind-bogglingly massive job. I don't know what the data goes in to as a result of this processing though.
The fact that they were able to preserve a masterpiece like this means a lot to me: http://web.archive.org/web/20010124071800/http://expage.com/...