Hacker News new | past | comments | ask | show | jobs | submit login
Waybackpack: download the entire Wayback Machine archive for a given URL (github.com/jsvine)
261 points by ingve on May 3, 2016 | hide | past | favorite | 40 comments



Wow this is amazing!!! I've built my whole site around showcasing wayback content: http://www.StartupTimelines.org/

This definitely makes it a whole lot easier, wish I had access to it from day 1. Great work guys/gals!!

Shameless plug, if you found it interesting, please consider donating:

https://www.tilt.com/tilts/startup-timelines-support-fund

http://archive.org/donate/?


First visit...

"Sorry, you've already viewed 3 Startup Timelines"


Sorry about that - this is a known bug we're trying to fix. If you refresh or view incognito it might let you browse after a minute or so. This is no way intentional to try to get you to sign up


>This is no way intentional to try to get you to sign up

Yes, it is. Why else would you put a 3-startup limit for non-registered users, and present them with a registration page? If the script is malfunctioning, turn it off. Maybe if I could actually see the content I would sign up, but since I have to sign up first, I guess I'll do what most others who are visiting your site for the first time are doing: Go away and never come back.

Do yourself a favor and get rid of intentional annoyances. You're already funding this thing with donations.


I think we can take someone at their word when they say:

    Sorry about that - this is a known bug we're trying to fix.
    This is no way intentional to try to get you to sign up
Be civil. Don't say things you wouldn't say in a face-to-face conversation. Avoid gratuitous negativity. (-:

I'm sure this bug will be fixed shortly, right bakztfuture?


Yes - Orik thank you so much for understanding. We're using a library called Flask-limiter, so I'm looking into what could be causing this ... I probably misread the docs somewhere.

Startup Timelines was always made to be free and accessible, it didn't even ask you to create an account until last month (been running the site for a year now). I don't want anyone to be upset so I've quickly made an account you can use to browse if you've gotten this rogue error:

username: hn_user

password: startuptimelines1

(all accounts are full btw)

I'm sorry and hope this doesn't ruin your take on the site forever. There's a tour that walks you through the site when you register, so, here are screenshots of the pages:

Tour page 1: http://i.imgur.com/5DCwdbg.png

Tour page 2: http://i.imgur.com/o7ghamJ.png

Tour page 3: http://i.imgur.com/iHN775V.png

let me know if there is anything else I can do guys bakz[at]bakzdesign.com ... sorry and thank you again


Surprising, how few changes some sites have over the years


My design is 5 years old. I changed a few small things, and while I'd love to make a redesign, I also know that users HATE such changes.

Also, a few new users even compliment me on the great design. I guess it isn't that bad ;)


Internet Archive has an HTTP header called "X-Archive-Wayback-Perf:"

I can guess what it means but maybe someone here has some insight?

It certainly looks like their Tengine (nginx) servers are configured to expect pipelined requests. It has no problem with greater than 100 requests at a time. See HTTP header above.

Downloading each snapshot one at a time, i.e., many connections, one after the other, perhaps each triggering a TIME_WAIT and consuming resources, may not be the most sensible or considerate approach. If just requesting the history of a single URL, maybe pipelined requests over a single connection is more efficient? I'm biased and I could be wrong.

However their robots.txt says "Please crawl our files." I would guess that crawlers use pipelining and minimize the number of open connections.

I have had my own "wayback downloader" for a number of years, written in shell script, openssl and sed. It's fast.

IA is one of the best sites on the www. Have fun.


How much load does this place on the Internet Archive? It'd be a shame if this thing's access patterns caused them trouble.


If I read the code correctly, it's one-at-a-time? Which minimizes the stress; if we're slow, it'll slow down.

It'd be nice if it had identification in the UserAgent, so that we could complain to the right people if it was a problem.


Hi, Greg! Library author here. I'd be happy to add a configurable UserAgent. Perhaps the default would be a generic "waybackpack" but could be configurable to add contact info for the user. Does that sound about right? Prefer a different approach?

And, yep, the library is intentionally designed only to request one snapshot at a time.


waybackpack would be a great default; encouraging the actual user to add contact info would be better for you because we could complain to them instead of you :-)


Updated, merged, and pushed to PyPi as part of v0.1.0: https://github.com/jsvine/waybackpack/pull/5

Thanks again for the feedback. Really appreciate it — and the existence of the Internet Archive and Wayback Machine.


IA is very supportive of automated access, they even have a post about how to use wget to batch download lots of items: https://blog.archive.org/2012/04/26/downloading-in-bulk-usin...

Granted, that is not the Wayback Machine but I am sure they love people using them.


Agreed. From a quick look at code, it seems it just fires off every fetch request immediately after another compeletes.

Hopefully it gets patched to have a built-in rate limit (X requests per minute/hour).


If it's already serial, and only works with one backend (rather than being an arbitrary mirroring tool like wget), then the Wayback server can easily "express its preferences" for rate-limiting by adding artificial delay to request-responses that pass the rate-limiting threshold. Backpressure shouldn't be the client's responsibility.

(It only is traditionally, because so many sites do nothing to protect themselves from "being too nice", so arbitrary-backend mirroring-client devs allow their users the option to ask for less than they want. This isn't a sensible protocol design, on either side; it doesn't optimize for, well, anything.)


Ha fun, I've made a similar tool not so long ago: https://github.com/hartator/wayback-machine-downloader/


I've been using this with great success too; https://github.com/hartator/wayback-machine-downloader


I wish I had an OS now that I could run this... I wanted a tool like that for a long time, so I can reconstruct some of SimCity series documentation (Maxis had a sort of tradition of team members writing as "PR stunt" detailed accounts of their work on the game, and the few bits I could scavenge from Archive.org, because this is not available on EA site anymore, had been great to help modding efforts, and in my effort to "restore" the SimCity 4 game to work on modern OSes)


What OS might you be running that you can't run Python on it?


I was looking for something like this literally two hours ago. Thanks!


Always been confused by how the wayback machine works. Feel like if they were able to partner with browsers to anonymously hash of content and discover new content combined with do a better job doing version control that there index would be a lot big and granular too.


The Wayback Machine crawls stuff based on popularity (Alexa top million), search engine metadata (donated by blekko), the structure of the web, and the desires of our various crawling partners, ranging from the all-volunteer ArchiveTeam to 400+ libraries and other institutions who use our ArchiveIt system. And, finally, there's always the "save page now" button at https://archive.org/web/

There are big privacy issues to getting data from browsers. A lot of websites depend on "secret" URLs, even though that's unsafe, and we don't want to discover or archive those. That means we need opt-in, and smarts.

We do have a project underway with major browsers to send 404s to us to see if we have the page... and offering to take the user to the Wayback if we do.


Are there any plans to support archiving Web 2.0 pages?

More and more people are starting to rely on "archive.is" as it handles Web 2.0 content without issue. But I'm concerned about the survivability of that survice, and whether it can handle big growth.


That last feature is great... And reminds me I need to install Resurrect Pages on my new PC!


Is there a JSON API call that can be made to archive.org to archive a provided URL and get a success/fail response back?


Alas there's no formal save-page-now API, but if you experiment with using it from a browser, it's not hard to call from a program: fetch https://web.archive.org/save/<url>. The return is html, but if you examine the headers you get back, you'll see that Content-Location: tells you the permanent archived name of that particular capture.

I call APIs like this "accidental APIs"! From looking at our traffic, we have quite a few programmatic users of it.


Thank you!


Is there a way to download content using this? There is a zip I'm trying to get from a particular site and it keeps failing due to some kind of download cap.

Using --continue with wget doesn't work (I'm guessing they turned it off).


Does this let you get at material which is hidden due to the current robots.txt, even though it wasn't in force when the site was crawled?


That would be wonderful!

Also stuff that's been censored for other reasons.


Is it there any way to download the assets of the website too? Right now the html has urls pointing to archive.org.


You're halfway to a blockchain!


One area I'm interested in is a legally binding verification of content at a specific point in time - for example: tracking changes to a breaking news article on CNN.

I'm not sure what technologies would be required to implement something like this, but I feel like the Internet Archive would be important and BitCoin might be away of encouraging verification from a globally distributed network of 3rd-party verifiers.


You should check out zeronet: https://zeronet.io You should also check out 21's new pay-per-call api features:https://21.co/features/


This idea has already been explored, here's one implementation: https://proofofexistence.com/


Hmmm, I'd say not quite. That's a precursor though, sure.


Thanks for the pointer; this might work if it had a vetted API for grabbing content online and signing it.


People use the Wayback Machine like that in court cases all of the time, mainly for patent prior art.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: