Sorry about that - this is a known bug we're trying to fix. If you refresh or view incognito it might let you browse after a minute or so. This is no way intentional to try to get you to sign up
>This is no way intentional to try to get you to sign up
Yes, it is. Why else would you put a 3-startup limit for non-registered users, and present them with a registration page? If the script is malfunctioning, turn it off. Maybe if I could actually see the content I would sign up, but since I have to sign up first, I guess I'll do what most others who are visiting your site for the first time are doing: Go away and never come back.
Do yourself a favor and get rid of intentional annoyances. You're already funding this thing with donations.
Yes - Orik thank you so much for understanding. We're using a library called Flask-limiter, so I'm looking into what could be causing this ... I probably misread the docs somewhere.
Startup Timelines was always made to be free and accessible, it didn't even ask you to create an account until last month (been running the site for a year now). I don't want anyone to be upset so I've quickly made an account you can use to browse if you've gotten this rogue error:
username: hn_user
password: startuptimelines1
(all accounts are full btw)
I'm sorry and hope this doesn't ruin your take on the site forever. There's a tour that walks you through the site when you register, so, here are screenshots of the pages:
Internet Archive has an HTTP header called "X-Archive-Wayback-Perf:"
I can guess what it means but maybe someone here has some insight?
It certainly looks like their Tengine (nginx) servers are configured to expect pipelined requests. It has no problem with greater than 100 requests at a time. See HTTP header above.
Downloading each snapshot one at a time, i.e., many connections, one after the other, perhaps each triggering a TIME_WAIT and consuming resources, may not be the most sensible or considerate approach. If just requesting the history of a single URL, maybe pipelined requests over a single connection is more efficient? I'm biased and I could be wrong.
However their robots.txt says "Please crawl our files." I would guess that crawlers use pipelining and minimize the number of open connections.
I have had my own "wayback downloader" for a number of years, written in shell script, openssl and sed. It's fast.
Hi, Greg! Library author here. I'd be happy to add a configurable UserAgent. Perhaps the default would be a generic "waybackpack" but could be configurable to add contact info for the user. Does that sound about right? Prefer a different approach?
And, yep, the library is intentionally designed only to request one snapshot at a time.
waybackpack would be a great default; encouraging the actual user to add contact info would be better for you because we could complain to them instead of you :-)
If it's already serial, and only works with one backend (rather than being an arbitrary mirroring tool like wget), then the Wayback server can easily "express its preferences" for rate-limiting by adding artificial delay to request-responses that pass the rate-limiting threshold. Backpressure shouldn't be the client's responsibility.
(It only is traditionally, because so many sites do nothing to protect themselves from "being too nice", so arbitrary-backend mirroring-client devs allow their users the option to ask for less than they want. This isn't a sensible protocol design, on either side; it doesn't optimize for, well, anything.)
I wish I had an OS now that I could run this... I wanted a tool like that for a long time, so I can reconstruct some of SimCity series documentation (Maxis had a sort of tradition of team members writing as "PR stunt" detailed accounts of their work on the game, and the few bits I could scavenge from Archive.org, because this is not available on EA site anymore, had been great to help modding efforts, and in my effort to "restore" the SimCity 4 game to work on modern OSes)
Always been confused by how the wayback machine works. Feel like if they were able to partner with browsers to anonymously hash of content and discover new content combined with do a better job doing version control that there index would be a lot big and granular too.
The Wayback Machine crawls stuff based on popularity (Alexa top million), search engine metadata (donated by blekko), the structure of the web, and the desires of our various crawling partners, ranging from the all-volunteer ArchiveTeam to 400+ libraries and other institutions who use our ArchiveIt system. And, finally, there's always the "save page now" button at https://archive.org/web/
There are big privacy issues to getting data from browsers. A lot of websites depend on "secret" URLs, even though that's unsafe, and we don't want to discover or archive those. That means we need opt-in, and smarts.
We do have a project underway with major browsers to send 404s to us to see if we have the page... and offering to take the user to the Wayback if we do.
Are there any plans to support archiving Web 2.0 pages?
More and more people are starting to rely on "archive.is" as it handles Web 2.0 content without issue. But I'm concerned about the survivability of that survice, and whether it can handle big growth.
Alas there's no formal save-page-now API, but if you experiment with using it from a browser, it's not hard to call from a program: fetch https://web.archive.org/save/<url>. The return is html, but if you examine the headers you get back, you'll see that Content-Location: tells you the permanent archived name of that particular capture.
I call APIs like this "accidental APIs"! From looking at our traffic, we have quite a few programmatic users of it.
Is there a way to download content using this? There is a zip I'm trying to get from a particular site and it keeps failing due to some kind of download cap.
Using --continue with wget doesn't work (I'm guessing they turned it off).
One area I'm interested in is a legally binding verification of content at a specific point in time - for example: tracking changes to a breaking news article on CNN.
I'm not sure what technologies would be required to implement something like this, but I feel like the Internet Archive would be important and BitCoin might be away of encouraging verification from a globally distributed network of 3rd-party verifiers.
This definitely makes it a whole lot easier, wish I had access to it from day 1. Great work guys/gals!!
Shameless plug, if you found it interesting, please consider donating:
https://www.tilt.com/tilts/startup-timelines-support-fund
http://archive.org/donate/?