Hacker News new | past | comments | ask | show | jobs | submit login

The Wayback Machine crawls stuff based on popularity (Alexa top million), search engine metadata (donated by blekko), the structure of the web, and the desires of our various crawling partners, ranging from the all-volunteer ArchiveTeam to 400+ libraries and other institutions who use our ArchiveIt system. And, finally, there's always the "save page now" button at https://archive.org/web/

There are big privacy issues to getting data from browsers. A lot of websites depend on "secret" URLs, even though that's unsafe, and we don't want to discover or archive those. That means we need opt-in, and smarts.

We do have a project underway with major browsers to send 404s to us to see if we have the page... and offering to take the user to the Wayback if we do.




Are there any plans to support archiving Web 2.0 pages?

More and more people are starting to rely on "archive.is" as it handles Web 2.0 content without issue. But I'm concerned about the survivability of that survice, and whether it can handle big growth.


That last feature is great... And reminds me I need to install Resurrect Pages on my new PC!


Is there a JSON API call that can be made to archive.org to archive a provided URL and get a success/fail response back?


Alas there's no formal save-page-now API, but if you experiment with using it from a browser, it's not hard to call from a program: fetch https://web.archive.org/save/<url>. The return is html, but if you examine the headers you get back, you'll see that Content-Location: tells you the permanent archived name of that particular capture.

I call APIs like this "accidental APIs"! From looking at our traffic, we have quite a few programmatic users of it.


Thank you!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: