The article claims that books scanning is slowing and links to an article which says it's still going in some places, but explicitly says that it's slowing down because some of the libraries are running out of books that need scanning.
I'm not all that interested in arguing the title point of the article, but when an article provides a whole stack of "evidence" and superficial investigation reveals that most of it does not support what the article claims, I question the motives of the author.
Hi! I wrote the article. First off, I find it disingenuous that you don't mention you work for Google. But, hey! I'll give you the benefit of the doubt that it was a simple oversight.
Addressing your comments:
1. Google News Archive is, without question, a dead project. No new material is being added, no new development is being made, and it's unsupported. They removed the News Archive and homepage and redirected it to News.
The method Google suggests for web search isn't limited to news articles, making it effectively useless for research. (It shows everything indexed in Google.)
You can search for some newspapers in Google Search, but it's impossible to find any date before January 1970, order by date, or filter by publication. You're stuck with post-1970 date filtering for all papers, ordered by relevance.
https://www.google.com/search?q=site%3Agoogle.com%2Fnewspape...
2. I didn't say Groups was dead. I said it was effectively dead for research purposes, which is true. For example, you can't search or filter by date across groups anymore:
https://groups.google.com/forum/#!search/linux
Not to mention, only a fraction of the total posts are indexed and available in Google Search. For example, changing your query to limit to 1995 only results in 70 posts. There were many more than that being posted monthly in 1995 in comp.os.linux.advocacy alone.
3. It's entirely plausible that Google's library partners are running low on books, though that doesn't explain why the project appears to be completely dormant. As I mentioned, the official blog stopped updating in 2012 and the Twitter account's been dormant since February 2013. It doesn't seem like any book's been added in the last year -- no new books from January 2014 to today:
https://www.google.com/search?q=a&biw=1146&bih=933&source=ln...
4. The 20% time thing is interesting. As a Google engineer, I imagine you'd have a better perspective on that than I would.
Former employees have explicitly said that 20% time no longer exists in the way it used to, and current employees, including here on Hacker News, say that it exists but only on top of your existing workload (effectively making it 120% time). I tend to trust them over a PR person, but really, that was a brief aside in my overall article.
The fact that a tiny fraction of the former functionality of a service is possible, albeit with an obscure and user-unfriendly method, does not detract from the overall point:
Google's current priorities don't appear to be in archiving the past.
For a specific example of real problems caused by killing Google News Archive search, it affected the work of Wikipedia editors. I and a lot of other editors had found it very useful as a high-quality and fast way to find good sources for articles we were working on (https://en.wikipedia.org/wiki/Wikipedia:Free_English_newspap...). There's not really anything else like it, so you end up combing through tons of general Google results that don't qualify as reliable sources in order to find a few newspaper articles.
Sigh. Standard disclaimer: nothing I write here has got anything to do with my job, I'm not representing my employer in any way. I cannot talk about anything I know personally. I've also never worked on any of these projects. However, I can read, describe, and link to public material on the internet like anybody else.
In all honesty, I have no interest in the news archive projects; I read an HN link, followed some of the links in it, and said "this isn't what I was promised on the previous page". It sounds like you just made a second attempt at writing the article. I suggest you take the original one down and put this one up instead; it stands up to at least the completely superficial fact-checking of reading the links in it, which makes it a significant improvement - although it now appears to be a list of fairly straightforward bug reports. (I like bug reports. Bug reports are actionable.)
Engaging with the subject would require substantially more effort on my part to research and investigate what's going on here, because I don't know anything about it beyond what I read in links here. I'm not going to do that. However, I would encourage anybody with an interest in this subject to do the research and write up their findings.
Saying that these problems look like bug reports is dismissive of the depth of the problems. Stopping development of products and removing access to features isn't unintentional, and a lot of people have already complained about each of these problems over the years. Andy's article is making a larger point that what has happened to these products is part of a pattern, that Google is not being as responsible in stewarding its information as its mission statement said it would try to be.
> Saying that these problems look like bug reports is dismissive of the depth of the problems.
Personally I completely disagree with your priorities. I think a bug report is far more valuable, since people can act on bug reports and make things better, while I would not anticipate any meaningful action as a result of speculation about mission statements.
That makes me wonder why the Chromium bugtracker appears to effectively be a black hole, if the bug reports are so "valuable". I don't think I've ever gotten a single response to my CSS calculation bug report.
> Hang on, this article links to instructions on how to do some of the things it claims can't be done.
After confirming that the linked sources disagree with major claims from the article, I've flagged it. I encourage others to do the same, for the sake of truth and honesty.
This is more cheap dismissal than a "superficial investigation". The author writes "Google News Archives are dead, killed off in 2011, now directing searchers to just use Google." and you have discovered that this is true?
Why isn't archive.org distributed P2P at this point? Instant, massive, redundancy.
Let me download some software and allocate how much of my drive space i'd like to help them with. The software would then intelligently use that space as their distributed backup system. Then they can focus on collection and collation with one less thing to worry about.
Let's guess that the average person can contribute some 300GB of their disk space. If IA wants to keep a minimum of 5 copies on the network (probably a safe number given how many people will be constantly dropping out), they need (20x1024x1024x5)/300 = 349,525 people contributing their disk space. That doesn't seem even close to attainable.
I originally thought that wasn't a lot of people. However, Folding@Home (which ought to be the most well known share-processing power group) only has 179,234 computers in its network.
This is a good example of narrative trumping reality, and we geeks are every bit as susceptible as anyone else.
The actual thesis of the article is that Google is losing interest in archiving efforts. Perhaps true, perhaps partially true, perhaps false, or perhaps unsupported either way. Entirely valid and worth exploring.
The second thesis is that the Internet Archive is doing good work. Great!
The title, however, is “Never trust a corporation to do a library’s job”. A generalization which no article can prove or disprove. Just dogma.
As an old guy in this industry, I may have a perspective here that many others lack in this discussion.
When Google acquired the dejanews archive, it was AWESOME! The search capability was exponentially better than what came before it. Google made a big production about this, including also of rhetoric about the importance of preserving the heritage of the Internet.
Then one day, mercurial Google released a competitor to Yahoo groups, and seeded it with the Usenet archives. Except that most every aspect of Google groups was worse than what came before. I gave up on the product a decade ago as my job role changed, but at that time thing like date searching were completely and utterly broken. Perhaps it's fixed now.
As people progress within Google, priorities will shift. So if you like some free offering without a monetization strategy (ie. Google Reader), you're likely to be disappointed. It's not evil, it is just life.
Librarians are a different lot. Preservation of knowledge is their thing. While the headline may be a little dramatic, it is accurate.
Correct or not, the articles thesis is more like you can't know which corporations to trust with archiving, therefore you can't (or shouldn't) trust any. It's not possible to determine, a priori, when or if Google will decide that archiving is no longer interesting.
Apparently reports of the demise of googles archive are greatly exaggerated, although the fact that they are plausible represents the danger.
Especially apt considering the first libraries were corporations. Non-profit corporations, that is. But people forget that "public library" is a 20th century neologism.
The libraries at Pergamum and Alexandria were, AFAIK, state run affairs basically sponsored by the rulers at the time. Scholars from far-and-wide were invited and given a wage and services so that they could focus on scholastics. The rulers donated the [cost of] books initially but also it seems taxed all ships entering Alexandria; books being retained and copies produced and returned to the original owners.
It seems pretty close to the modern conception of a public library to me - paid for by the state, directed towards scholarship, open to the public, run for non-fiscal profits.
A corporation to me says 'a closed group seeking it's own ends', whilst these state sponsored institutions were probably primarily for the purposes of the rulers I'm not sure equating that to corporations is helpful. As part of the Mouseion it seems the library at Alexandria was much closer to a public work than a private enterprise.
I'm not a historian, just someone who's read a few articles on this particular [first?] early library (there's some good AskHistorian threads on it).
I struggle with something similar. Building reocities was a one-time affair, hosting and maintaining it really starts to add up. But I'll keep it alive as long as I can and as long as it is being used.
archive.org is awesome but it has limits on what it can archive. Many of the sites I look for there have broken images or missing flash elements and with more and more sites using complex javascript the limits of what archive.org can curate are only going to become more of an issue.
There is tons of stuff from the 90s that is gone. For example, back in the day there were a few audio shows that were recorded in realaudio. archive.org doesn't capture that stuff. I was going through the archives of bluesnews.com for a project and found all these references to interviews with people like Carmack and Romero and other more obscure people and the audio is just gone. I tweeted at one of the guys who owned the company and he says the harddrives are in storage and he wants to get them online again one day but that day may never come.
Another example, gamespot used to have some good articles about the history of things like rts games and such. archive.org has most of that stuff but not all. I messaged the people doing the tech support on the site about it and they were like "yeah, that was a few site upgrades ago" and had not interest in trying to get that stuff back online properly.
With joystiq and tuaw shutting down it is only a matter of time before aol just pulls the plug on those sites, too.
All that is true but it was also true before digital media. While saving "everything" (whatever that means exactly) may or may not be a laudable goal there are always going to be practical limits to what any organization can realistically achieve for the reasons you cite and others. I could probably name any number of online magazines that have gone belly-up or at least restructured and old content in CMS systems is effectively gone forever as a result. I'm not sure how you would even approach archiving a complex site with high fidelity.
Even if the Library of Congress, say, took the task on with good funding, lots of things wouldn't be preserved. (And I'm not sure I would like seeing the government in this role in any case.)
"Even Google Search, their flagship product, stopped focusing on the history of the web. In 2011, Google removed the Timeline view letting users filter search results by date, while a series of major changes to their search ranking algorithm increasingly favored freshness over older pages from established sources. (To the detriment of some.)"
I don't know if this is what they mean, but I can search by date just fine:
Overall, I agree with the article, but we have the internet archive for the internet, perhaps it's time for another organization for everything else. I am sure there are those organizations, but they don't seem all that large or effective.
I think the difference is that the date search is for finding stuff that was originally posted / crawled in that date range. (I believe) the original comment was regarding searching some HTML, for example, as it was at that date.
That is to say: if you had a page that changed over time, you would be able to search for that page as it existed in a particular date range, not just searching for a page that was posted / first crawled in that range.
I can't really disagree with the basic point. Expecting a for-profit organization--even if they're Google--to reliably over a long period of time provide an archiving service that's mostly a money [EDIT] leak isn't a realistic hope. The fact that it's embroiled Google in a number of ongoing legal disputes doesn't make it any easier.
On the other hand, it's not as if anyone (other than the Internet Archive of course) has exactly been stepping up to the plate. There's also the question of what a library is for these purposes and what do they have the right to do with respect to archiving copyrighted digital text and media.
At some point Google Groups got really inconsistent, where it didn't turn up many, many postings and threads when searching for them. For some time you could find them if you tried several of their country TLDs (google.de, google.com), they brought up different results.
A bit later even that didn't work anymore, Google Groups has spotty coverage, at best.
Then they hid their advanced search interface. If you knew the URL you could still use it. Then they removed it altogether.
Nowadays you have probably no way to specifically search for something ("author:", "group:", "mid:"), nothing is working satisfactorily anymore).
That's sort of the point of the article, no? And it's not like Deja News was likely to stay functional in any case.
It's probably also worth pointing out that before Deja News and Google were ever involved, a large chunk of Usenet came close to being lost [1] and significant chunks of it have been in any case. So it's not as if preservation necessarily happens in the absence of corporate involvement either.
What I don't understand is why companies aren't held accountable for their obvious responsibilities. Google has to publicize these archives. It shouldn't be up to them if their public records are publicly accessible. I was born on this planet and I have the right to be able to get a copy of any Usenet archive Google has. For free. Whenever I want.
I really really hate them for locking up this data (like so much its not funny, like major humanitarian crime imho).
I don't see how Google should have any responsibility wrt. Usenet.
Google is not the only party having Usenet archives. They are not even in any way special. They aren't the "official" archives, just as Deja wasn't.
If you're really interested in getting your hands on Usenet archives, going way back to the Ice Age, there are plenty of ways to do so.
Especially with the huge Hamster community (http://www.tglsoft.de/freeware_hamster.html), who are archiving almost fanatically, you should be able to get virtually everything you'd want just by asking nicely.
Other parties who certainly have archives are all the major news servers.
The point isn't that they are required to be custodians of Usenet, it's that they volunteered to become a custodian, and then dropped the ball on that function in the most noxious way possible.
Tying it back to the article, you can't trust a company to do something in the public interest, or maintain that work in perpetuity unless you pay them.
Usenet is a limited example, a more impactful one would be Google locking down access to Maps, including all of the metadata contributed by the public. There's nothing stopping them from doing that.
> Especially with the huge Hamster community (http://www.tglsoft.de/freeware_hamster.html), who are archiving almost fanatically, you should be able to get virtually everything you'd want just by asking nicely.
What is the Hamster community? It sounds interesting. The link points to some software that could be used to help archive, but doesn't say much more (at least browsing around the Google translation of it).
I really hope this comment was intended sarcastically. It disturbs me that I can't tell. It disturbs me even more that other commenters believe it wasn't.
Kinda skipping over the part where they grabbed it at death's door, aren't you?
Suppose Google had done nothing. The alternative wouldn't have been a beautifully indexed, easily searchable completely up-to-date Deja News archive today. It would have been that archive rotting in a landfill or that equipment being wiped and reused.
Has Google's stewardship of the Deja News archive been disappointing at times? Of course. But the idea that Google has taken anything away from you or anyone else reflects an amazing sense of entitlement and an astonishing lack of perspective.
That's just making things up. Deja News could have ended up anywhere, including at the Internet Archive. The people running it sold it to Google in the belief it would be kept running. Had they not done it someone else would have.
The talk about entitlement completely misses that Google bought it only to prompty run it into the ground. If you do that in a market leading position to your competitors that's downright illegal in most jurisdictions.
That may not be applicable here, but you should be able to understand why this upset a lot of people.
> Deja News could have ended up anywhere, including at the Internet Archive.
This is exactly what I'm talking about. The idea that a company that is falling apart and selling itself off in pieces was going to donate one of its last major assets to charity... I don't even know where to start.
That was the whole problem. Running the archive wasn't a business (at least deja.com never figured out how to make it one and eventually gave up trying), but the archive ended up in the hands of investors and/or creditors that didn't care about that. You may be partially right - if Google hadn't bought them someone else might have (maybe Yahoo, AOL, Microsoft, or ...). I'm not sure about that (which is why the Google deal was good news at the time), but let's suppose you're right. All that would mean is that you would have ended up complaining about how some other buyer ran it into the ground. If you're going to make the case that Google got in the way of the Deja News archive's happy ending, you need a lot more than that.
When European style central libraries were first established, they were actually legally mandated copyright holding libraries, as in they were an arm of the government that was entitled to copies of any work published, regardless of authorship and licensing. The library had a perpetual legal "copyright" codified in law to any works produced by any author, and had the legal power to acquire, buy, and make copies themselves of any works that needed to be archived. They were authorized scribing/photocopying centers so to speak. Modern day libraries have shed much of their powers, but those powers are why we have access to historical texts in libraries today (instead of them being destroyed and lost in history).
In light of archivalship being a money loosing proposition for private corporations, it seems that we have the need for an internet/online library to have the right to copy/hold the entire internet in perpetuity for the use of future humankind - that's the mandate of the Internet Archive, and it would be great to see that codified in a UN resolution, for example.
Is it just me or is the way back machine behind a 2400kbps modem? When I try it either doesn't work or it takes minutes to load.
I can also say that hard experience has taught me to fear nonprofits. A corporation makes money by satisfying needs as well as from rich people financing it. Non profits are financed by rich people so you are working 100% for the 1% instead of 99%.
You see all the same niggardlyness with nonprofits but at least in a corporation it is possible you can improve your service and make money and get rewarded for it. Nonprofits tend be lose-lose or no deal.
This sort of question will become obsolete, once data are freely available. 4TB tarball with Usenet archive. And nice ecosystem of open-source tools for mining.
Already happened with maps thanks to Open Street Map.
Usenet's just an example. Its hosting at another site with better search tools could (I assume) be dealt with relatively easily given that it's pretty much just text. There may, of course, be issues of which I'm not aware. After all, this hasn't happened.
By no means though is the ongoing archiving and sharing of petabytes of complex web sites and other information repositories a simple problem even if the data were readily available and the issues associated with rehosting copyrighted material worked through.
Breaking it down -- In a world that keeps increasing its demands of you, you either die after a long streak of satisfying increasing demands (and giving observers the impression your capability was infinite -- a hero!) or you live long enough to see those demands exceed your capability to satisfy (and giving observers a clear view of your shortcomings -- a villain!)
One of the article's main point was that the mission has changed, its no longer about "all the worlds information" because it (storing old historical information) doesn't help their business.
> The Internet archive is a 501c(3) corporation, isn't it?
That's just the way the US tax code labels it. The main distinction is that as a non-profit, single-purpose organization the Internet Archive exists only to archive things and isn't subject to changing shareholder whims or takeovers.
It's much easier to imagine e.g. Google getting pressure to drop Google Books if it was deemed unprofitable.
A library is a public institution and thus is not really the same as thinking it as a "non profit". It could be thought of an organisation that the public has shares in, that the public are the angel investors in. Thus, we, the people, have an investment in a library. Allowing a private corporation to run a part of our institutions effectively dilutes the investment of the angel investors.
This changes the thinking about money and sustainability from non-profit, charity through to for-profit to one of a public ownership.
The lessons that we the investors (and thus as a kind of board member) in our public institutions are making is that big public spirited helpful corporations are ultimately corporations, and will ultimately behave as such. And we should never allow our investments to be diluted in this way again.
It complains that the news archives frontend is gone, but then links to the page which explains how to do the same things using search: https://support.google.com/news/answer/1638638?hl=en
It also complains that groups is dead because you can't search by date... but the exact same method used in those instructions works just fine: https://www.google.co.uk/search?q=site%3Agroups.google.com+a...
The article claims that books scanning is slowing and links to an article which says it's still going in some places, but explicitly says that it's slowing down because some of the libraries are running out of books that need scanning.
It links to an old quartz article from 2012 claiming that "20% time is dead". After the first three paragraphs, that article links to the rebuttals: http://qz.com/116196/google-engineers-insist-20-time-is-not-... http://qz.com/117164/20-time-is-officially-alive-and-well-sa...
I'm not all that interested in arguing the title point of the article, but when an article provides a whole stack of "evidence" and superficial investigation reveals that most of it does not support what the article claims, I question the motives of the author.