Hacker News new | past | comments | ask | show | jobs | submit login
Wayback Machine: Now with 240,000,000,000 URLs (archive.org)
79 points by cleverjake on Jan 10, 2013 | hide | past | favorite | 36 comments



Seriously, the Wayback Machine is awesome. Just last week I used it to find a website I made twelve years ago at the ripe age of 10. If there are any maintainers/developers reading this, thank you.

The fact that they were able to preserve a masterpiece like this means a lot to me: http://web.archive.org/web/20010124071800/http://expage.com/...


Agreed. Back in '96 I was a teenager who spent all his free time running a modest gaming fan website [1]. I dropped it when I went to college a year later, but its nice to know all that hard work will forever be memorialized within the wayback machine. It helps me remember why I became a software engineer in the first place. Thanks for that.

[1] http://web.archive.org/web/19970414022225/http://www.scorche...


Do you still own that domain? It's fantastic.


Thanks! Sadly, I brought on a volunteer who bought the domain in his name. Not a very smart idea in retrospect... but I was young, and not many people understood the value of a good domain name back in 1996 ;-)

We were up to 20k daily uniques when I quit (not bad for 1997). I wrote the forum & related software myself in perl, which was an amazing learning experience.


> Best viewed in 800x600 resolution.

that's great ;-)


i remember questioning whether it was ok to drop 8x6 support. the golden days.


Oh, this was back when we were tired of supporting 640x480, hence the message of "best viewed on 800x600" :)


Personally, my favorite is how NT4 and 2000 Terminal Servers were limited to 8-bit color depth (without Citrix, of course).


Clicked on Today's comics and felt very nostalgic when I saw this http://web.archive.org/liveweb/http://expage.com/cgibin/free...


> The fact that they were able to preserve a masterpiece like this means a lot to me

That's exactly the kind of website they should be preserving! The future people will be glad that we kept some ephemera around.

And I love that website - just so blue; and comic sans not used ironically; and animated gifs; and the guestbook.


I find it hilarious that you animated the B flipping between italics and normal


You forgot a DOCTYPE.


Obviously I live on the wild side.


when i was a kid using my parents AOL connection Homestead was blocked by the default AOL parental controls. so i installed a keylogger so i could sneak onto my dad's screen name and make websites when he went to work. in retrospect i probably could have just asked him to unblock homestead so i can make websites.

i realize this is only tangentially related, but your page compelled me to share. :)


Ahhh! Homestead. Thank you for reminding me. I need to check there for archives, too!


Your animated gifs are awesome.


My only gripe with the wayback machine is that when old sites go offline and some random domain squatter picks up the domain when it expires, they apply the current robots.txt to all of the old content making archive.org useless.

robots.txt should have a limit; it shouldn't be applied retroactively so aggressively.


I suppose it's a pretty good approach in order to help avoid upsetting website owners and possible lawsuits. While I'm not too sure I agree with it, I can appreciate that they have provided a very easy way for websites to opt-out.


A possible pragmatic solution would be to track the site and spot ownership changes and freeze the robots.txt when it happens.

Reliable ownership change detection can be tricky though, but it's doable IMHO.


It's pretty shocking how many web designers, even experienced professionals, assume a site isn't being crawled because it hasn't "gone live" yet in their minds (i.e., no press release). If you have a site active on an IP without any access controls, you can almost be sure it is being indexed by someone. If it's not the default site, expect one of your users to leak the virtual host name. If it's SSL-protected, it might even be revealed in the certificate. I respect the work the Internet Archive is doing, but I'm also grateful that they will immediately retroactively apply robots.txt if you discover you foolishly exposed a site prematurely.


The Wayback Machine is run by archive.org, a non-profit. If you like what they do consider donating at http://archive.org/donate/index.php


You can even donate some Bitcoin if you don't have a Paypal or Amazon account.


Just imagine hosting this beast and then having 1,000 people wanting to scan the entire thing every second! For free!


Even more!


Please I need help,

Hi, please I am using wayback-1.6 on my tomcat-5.28 (java-1.7 , ubuntu-11.04) to display all my arc.gz files but I have got this error, however this folder contains all my arc.gz files /tmp/wayback/files1/IA.arc.gz

Resource Not In Archive

The Resource you requested is not in this archive.



I'm ten percent into an implementation of a 'personal web archive', mostly as a fun side-project. I just wonder, if historical data gets more interesting as the web ages?


The other day it occurred to me that the digital age might result in some serious data loss. I was wondering about historic prices for cars. Not sure how you would have gone about it in previous times, but I suppose you could find old catalogs, adverts in newspapers and stuff life that. But what if vendors only advertise prices on their web sites? Those sites with old prices will be all gone once new prices are set up, same with the advertisements.

Not even sure if archives can help - with some algorithmically created content it might be impossible to index it all.

Just one example - there are surely more. I used to think digital data would be easier to preserve for the future, but now I am not so sure anymore.

Not even mentioning Facebook, which presumably can not be archived because of the walled garden thing.



Cherished memories: word.com from the late '90s


Is there a way to get the list of those url's ? If anybody knows how show me the way.... or you can email me agatto2@gmail.com


I wonder how this handles all the more recent HTML pages with all their javascript?


Does anyone know what database they use? Or just files and folders?


The index for Wayback is a massive sorted text file (called a CDX) containing a line for each URL and timestamp. For very large installations this index is sharded across multiple servers and queried in parallel. The lookups are done using plain old binary search.

http://archive.org/web/researcher/cdx_file_format.php

Each CDX record maps a URL-timestamp pair to a byte offset into an ARC or WARC file. These are essentially just gzipped HTTP responses concatenated together:

http://archive.org/web/researcher/ArcFileFormat.php http://www.digitalpreservation.gov/formats/fdd/fdd000236.sht...

The document is retrieved, uncompressed, URLs are rewritten, the navigation banner javascript injected and the result is sent to the client.

The code is here: https://github.com/internetarchive/wayback


How do you get a hold of the list of urls?


I only have second hand insight into how they do it, but my impression is that their spiders log the raw data right to disk and then, periodically (a few times a year?) they process all the accumulated data in one mind-bogglingly massive job. I don't know what the data goes in to as a result of this processing though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: