Wayback Machine: Now with 240,000,000,000 URLs

thomasvendetta · on Jan 10, 2013

Seriously, the Wayback Machine is awesome. Just last week I used it to find a website I made twelve years ago at the ripe age of 10. If there are any maintainers/developers reading this, thank you.

The fact that they were able to preserve a masterpiece like this means a lot to me: http://web.archive.org/web/20010124071800/http://expage.com/...

arscan · on Jan 10, 2013

Agreed. Back in '96 I was a teenager who spent all his free time running a modest gaming fan website [1]. I dropped it when I went to college a year later, but its nice to know all that hard work will forever be memorialized within the wayback machine. It helps me remember why I became a software engineer in the first place. Thanks for that.

[1] http://web.archive.org/web/19970414022225/http://www.scorche...

citricsquid · on Jan 11, 2013

Do you still own that domain? It's fantastic.

arscan · on Jan 11, 2013

Thanks! Sadly, I brought on a volunteer who bought the domain in his name. Not a very smart idea in retrospect... but I was young, and not many people understood the value of a good domain name back in 1996 ;-)

We were up to 20k daily uniques when I quit (not bad for 1997). I wrote the forum & related software myself in perl, which was an amazing learning experience.

hayksaakian · on Jan 10, 2013

> Best viewed in 800x600 resolution.

that's great ;-)

catshirt · on Jan 11, 2013

i remember questioning whether it was ok to drop 8x6 support. the golden days.

arscan · on Jan 11, 2013

Oh, this was back when we were tired of supporting 640x480, hence the message of "best viewed on 800x600" :)

yuhong · on Jan 11, 2013

Personally, my favorite is how NT4 and 2000 Terminal Servers were limited to 8-bit color depth (without Citrix, of course).

ollysb · on Jan 10, 2013

Clicked on Today's comics and felt very nostalgic when I saw this http://web.archive.org/liveweb/http://expage.com/cgibin/free...

DanBC · on Jan 10, 2013

> The fact that they were able to preserve a masterpiece like this means a lot to me

That's exactly the kind of website they should be preserving! The future people will be glad that we kept some ephemera around.

And I love that website - just so blue; and comic sans not used ironically; and animated gifs; and the guestbook.

agscala · on Jan 10, 2013

I find it hilarious that you animated the B flipping between italics and normal

scarmig · on Jan 10, 2013

You forgot a DOCTYPE.

thomasvendetta · on Jan 10, 2013

Obviously I live on the wild side.

catshirt · on Jan 11, 2013

when i was a kid using my parents AOL connection Homestead was blocked by the default AOL parental controls. so i installed a keylogger so i could sneak onto my dad's screen name and make websites when he went to work. in retrospect i probably could have just asked him to unblock homestead so i can make websites.

i realize this is only tangentially related, but your page compelled me to share. :)

thomasvendetta · on Jan 14, 2013

Ahhh! Homestead. Thank you for reminding me. I need to check there for archives, too!

cglace · on Jan 10, 2013

Your animated gifs are awesome.

binarycrusader · on Jan 10, 2013

My only gripe with the wayback machine is that when old sites go offline and some random domain squatter picks up the domain when it expires, they apply the current robots.txt to all of the old content making archive.org useless.

robots.txt should have a limit; it shouldn't be applied retroactively so aggressively.

scottbartell · on Jan 10, 2013

I suppose it's a pretty good approach in order to help avoid upsetting website owners and possible lawsuits. While I'm not too sure I agree with it, I can appreciate that they have provided a very easy way for websites to opt-out.

ithkuil · on Jan 11, 2013

A possible pragmatic solution would be to track the site and spot ownership changes and freeze the robots.txt when it happens.

Reliable ownership change detection can be tricky though, but it's doable IMHO.

jackalope · on Jan 11, 2013

It's pretty shocking how many web designers, even experienced professionals, assume a site isn't being crawled because it hasn't "gone live" yet in their minds (i.e., no press release). If you have a site active on an IP without any access controls, you can almost be sure it is being indexed by someone. If it's not the default site, expect one of your users to leak the virtual host name. If it's SSL-protected, it might even be revealed in the certificate. I respect the work the Internet Archive is doing, but I'm also grateful that they will immediately retroactively apply robots.txt if you discover you foolishly exposed a site prematurely.

NelsonMinar · on Jan 10, 2013

The Wayback Machine is run by archive.org, a non-profit. If you like what they do consider donating at http://archive.org/donate/index.php

sroecker · on Jan 11, 2013

You can even donate some Bitcoin if you don't have a Paypal or Amazon account.

lucb1e · on Jan 10, 2013

Just imagine hosting this beast and then having 1,000 people wanting to scan the entire thing every second! For free!

hna0002 · on Jan 11, 2013

Even more!

attabi · on Jan 18, 2013

Please I need help,

Hi, please I am using wayback-1.6 on my tomcat-5.28 (java-1.7 , ubuntu-11.04) to display all my arc.gz files but I have got this error, however this folder contains all my arc.gz files /tmp/wayback/files1/IA.arc.gz

Resource Not In Archive

The Resource you requested is not in this archive.

comfyred · on Jan 11, 2013

http://web.archive.org/web/20020827023250/http://home.netc.n...

mtrn · on Jan 10, 2013

I'm ten percent into an implementation of a 'personal web archive', mostly as a fun side-project. I just wonder, if historical data gets more interesting as the web ages?

Tichy · on Jan 11, 2013

The other day it occurred to me that the digital age might result in some serious data loss. I was wondering about historic prices for cars. Not sure how you would have gone about it in previous times, but I suppose you could find old catalogs, adverts in newspapers and stuff life that. But what if vendors only advertise prices on their web sites? Those sites with old prices will be all gone once new prices are set up, same with the advertisements.

Not even sure if archives can help - with some algorithmically created content it might be impossible to index it all.

Just one example - there are surely more. I used to think digital data would be easier to preserve for the future, but now I am not so sure anymore.

Not even mentioning Facebook, which presumably can not be archived because of the walled garden thing.

tzury · on Jan 11, 2013

How much is 10PB anyway

http://archive.org/details/10000000000000000BytesArchived?st...

zwieback · on Jan 10, 2013

Cherished memories: word.com from the late '90s

agatto2 · on Jan 12, 2013

Is there a way to get the list of those url's ? If anybody knows how show me the way.... or you can email me agatto2@gmail.com

hayksaakian · on Jan 10, 2013

I wonder how this handles all the more recent HTML pages with all their javascript?

ddorian43 · on Jan 10, 2013

Does anyone know what database they use? Or just files and folders?

ato · on Jan 11, 2013

The index for Wayback is a massive sorted text file (called a CDX) containing a line for each URL and timestamp. For very large installations this index is sharded across multiple servers and queried in parallel. The lookups are done using plain old binary search.

http://archive.org/web/researcher/cdx_file_format.php

Each CDX record maps a URL-timestamp pair to a byte offset into an ARC or WARC file. These are essentially just gzipped HTTP responses concatenated together:

http://archive.org/web/researcher/ArcFileFormat.php http://www.digitalpreservation.gov/formats/fdd/fdd000236.sht...

The document is retrieved, uncompressed, URLs are rewritten, the navigation banner javascript injected and the result is sent to the client.

The code is here: https://github.com/internetarchive/wayback

agatto2 · on Jan 12, 2013

How do you get a hold of the list of urls?

jlgreco · on Jan 10, 2013

I only have second hand insight into how they do it, but my impression is that their spiders log the raw data right to disk and then, periodically (a few times a year?) they process all the accumulated data in one mind-bogglingly massive job. I don't know what the data goes in to as a result of this processing though.