Hacker News new | past | comments | ask | show | jobs | submit login

The index for Wayback is a massive sorted text file (called a CDX) containing a line for each URL and timestamp. For very large installations this index is sharded across multiple servers and queried in parallel. The lookups are done using plain old binary search.

http://archive.org/web/researcher/cdx_file_format.php

Each CDX record maps a URL-timestamp pair to a byte offset into an ARC or WARC file. These are essentially just gzipped HTTP responses concatenated together:

http://archive.org/web/researcher/ArcFileFormat.php http://www.digitalpreservation.gov/formats/fdd/fdd000236.sht...

The document is retrieved, uncompressed, URLs are rewritten, the navigation banner javascript injected and the result is sent to the client.

The code is here: https://github.com/internetarchive/wayback




How do you get a hold of the list of urls?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: