Google tried to make this work, but they were sued; and then they made a deal, and then many, many people objected to the resulting deal. This is the usual process whereby a corporation is first criticised for having too much power, and then when they relinquish power they are criticised for not doing enough.
> Page wanted to know how long it would take to scan more than a hundred-million books, so he started with one that was lying around. Using the metronome to keep a steady pace, he and Mayer paged through the book cover-to-cover. It took them 40 minutes.
> Michigan told Page that at the current pace, digitizing their entire collection—7 million volumes—was going to take about a thousand years. Page, who’d by now given the problem some thought, replied that he thought Google could do it in six.
> In just over a decade, after making deals with Michigan, Harvard, Stanford, Oxford, the New York Public Library, and dozens of other library systems, the company, outpacing Page’s prediction, had scanned about 25 million books. It cost them an estimated $400 million. It was a feat not just of technology but of logistics.
> At its peak, the project involved about 50 full-time software engineers. They developed optical character-recognition software for turning raw images into text; they wrote de-warping and color-correction and contrast-adjustment routines to make the images easier to process; they developed algorithms to detect illustrations and diagrams in books, to extract page numbers, to turn footnotes into real citations, and, per Brin and Page’s early research, to rank books by relevance.
Doesn't that take you back to an optimistic time when Google was exciting, and we thought that it could do amazing amounts of good for the world? I miss that era.
The "deal" resulted in anyone being able to download the full books if they go to a university library which is a partner of Hathi Trust. Of course, after that download, you have the full PDF and you can do whatever you want!
This isn't protection for copyright holders. (Hathi Trust got its books from Google Books scans and doesn't pay copyright holders.) This "deal" isn't helping anyone and it hurts researchers.
HathiTrust does not provide access to in copyright materials to anyone. They provide research access for "non-consumptive" use through the HathiTrust Research Center.
That's not correct. First of all, all books are copyright by their authors.
Anyone can go to a partner library (there's a list on the login page of the Hathi site) and download those books as PDFs. Try it! I've done it many times.
The "deal" (which failed, btw) was a crappy one. The only solution is legislation, not a "deal" made with the Author's Guild that would give Google and only Google special status.
Here's another little-known fact: Every American is entitled to a library card from the Library of Congress (you have to go there in person to get one).
The Library of Congress is a Hathi Trust partner! So if you go get that card, you can download all of the out-of-print books that Google scanned on your own computer. No copyright holders are getting paid (and no one is being harmed), so why all these barriers in-between?
I can confirm that the Library of Congress Reader card is all it takes to login. And you don’t have to be a US Citizen to get one, but you do need a passport or other US-recognized identification to present and validate in-person in D.C. And you have to do a bit of research on how and where to get it, they don’t just hand them out at the front desk as other libraries might. The Library of Congress uses the card to distinguish researchers from one-off tourists and so while the card is easy to get, they have just enough process in place that it’s clear it’s not a souvenir and you have to traverse a maze of hallways to get it. (Or you had to when I did, at least...) But once you have it, just login online and you’ll have access to Hathi Trust here.
It's too bad you can't just plug-in a license, passport number or something online and get a virtual card. Seems like such as wasted opportunity to expand access to resources for pretty much free.
Yep. I’m a Canadian, got one while visiting family in the US. I did at the time work for the Toronto Public Library, but that wasn’t a consideration for them. :)
Thanks for pointing this out. I'm graduating relatively soon and was disappointed that I couldn't download books after that. Just tried my Library of Congress login and everything worked great.
Note that a Library of Congress card expires every 2 years as I recall.
Incredible as it may seem, I've actually seen that asserted. Years ago when I was a kid, before we had internet, my family would get a lot of magazines. Whoever picked up the mail, usually whoever came home most recently after it was delivered, would flip through every magazine and tear out all the advertisements and throw them in the trash before putting the magazine on the counter for the family to read. I wonder, was that 'theft'? We also used to change the channel or mute the television/radio whenever advertisements came on. Was that 'theft'?
Usually people who call adblocking theft start squirming when these pre-digital examples of ad avoidance are put to them.
Unless that author, musician, or artist operated in a vacuum, their work is a direct derivative of the work of other authors, musicians, and artists.
Every dollar a musician earns I'd, therefore, a dollar they take out of the hand of the musicians, whose work they based theirs off of.
This is why the public domain exists. You can make derivative works without paying anyone... But your work will fall into the public domain, so that you pass this benefit on to the next generation.
But you don't take money out of their hand, because there is no money to take out to begin with. Or did you mean imaginary money? If you did, well, I could come up with many different ways as to how you are doing the exact same thing to virtually anyone. :P
Anyways, you can't physically remove and deprive an owner of an idea. It doesn't fit the definition of theft at all.
If the public library system weren't grandfathered in it would never exist today. Too many people want their cut and nobody is willing to make a stand for the public good.
I think maybe that's his point. There are some crazy people in this world, so no matter how reasonable anything is there will always be somebody who gets frothing mad over it for no rational reason.
Google Books has failed to live up to its promise as the company has moved away from its original mission of organizing information for people.
Google was only about organizing all the worlds information while search ads was an unlimited fountain of money. As Google's ability to generate money with search ads has dwindled, their more grand (and not monetizable) projects have been either starved for resources or outright killed.
Sure the lawsuit was a pain. And book publishers are turds for arguing that they still have rights to books that they won't publish ever again. But the courts found that there was nothing wrong with Google having the information[1]. That trove of text could be the worlds greatest source of knowledge but as we all know, people using internet search for work never click on ads and not enough of them are willing to pay a subscription service price to cover the cost of infrastructure. Google hoped that at one time they would make money by printing on demand those books that were out of print but people wanted, but that was shot down by short sighted publishers and agents. Perhaps it will be taken up by Amazon which has the resources to do it.
It is sad. But it's not just google books, it's google search, google news, youtube, etc.
There was a time when all of google's properties catered to the users. Their search engine was the best. Google news was the best aggregate site. Youtube recommends used to be amazing to the point you could spend hours following their recommends.
Now google search, google news, youtube, etc are all garbage. It doesn't serve the people. It serves corporate interests. You can thank media companies and the elites who pressured them for that.
No need to cast aspersions on 'elites' and 'corporate interests', what is pressuring them is that the ratio of the amount of money coming in to that going out, has to be maintained at a certain level for Google to remain Google. They really have only two choices there, either sell more ads (generally means putting ads on more things, or coming up with new ways to charge for new things like being in the 'shopping' box on product searches) or cut costs which means shutting down projects, reducing staff, Etc. Depending on how you look at it, Google gets something like one 500th of what they used to get for an ad on their web site.
Am I cast aspersions or just stating facts? Google changed because of pressure from the elites and corporate interests who used the media to badger them. It certainly isn't in google's interest to make their product worse purely to benefit others.
Your previous comment was attributing without evidence, actions of malice by descriptive but undefined third parties. That is the definition of "casting aspersions."
"Stating facts" would start with something like, "See this evidence that Google's policies were changed by <corporate entity> or <person or persons>."
Since you are doing the former, and not the latter, I conclude that the answer to your question is that yes, you are casting aspersions.
I'd assume you'd already know that media and elite pressure is why google changed since most people here work in the tech industry. Are you new to HN or do you work in a non-tech industry?
This reporter claims that she got youtube to change it's search list.
These channels had been up for many years. Why do you think all of a sudden google decided to remove them?
Certainly it wasn't corporate, media or elite's pressure. So then who? Aliens? When chinese or russian social media companies remove and change their policies, why do you think that is? Aliens as well?
After 10 years of spectacular success of youtube being "you"tube, why did it suddenly become "corporate"tube? Why did they change their recommends, trending, etc? Must be aliens. It can't possibly be the elites and the media constantly attacking it?
"Facebook and Google are doomed, George Soros says"
Yeah, in the beginning it was to gather lots and lots of people, which you can only do by being useful or living up to your claims, and as it gathered attention from Governments and so on, it had to comply with their regulations, effectively making it less and less useful. I remember the days when I could find any PDFs back then! Those days are over.
> If I worked at Google, I would have implemented a text-based date-prediction algorithm to flag erroneously classified books. (I have actually done this and sent a list to the HathiTurst of books they may have erroneously released into the public domain. It works).
If you're using HathiTrust seriously and aren't affiliated with a partner library, consider Hathi Download Helper to get complete public domain books archived to local storage. I wrote an earlier command-line version of the tool. Someone else built a GUI and put in the work to keep up with the evolving API.
I often use Google to locate a book, then check Internet Archive and HathiTrust if it's old enough that it should be public domain under US law. I really appreciate HathiTrust putting in the effort to check copyright renewals and make more of their materials fully visible. I don't appreciate the technical barriers to downloads that they erect, but that's out of the hands of the developers working there. As long as their web viewer shows individual pages you can be sure there will be a way to reassemble full books.
Once I needed a copy of The Congressional Record from the 40's and Google had marked it non-public! I had to make several calls to Hathi to explain that the Record is public by law and then they had to contact Google to get the restriction lifted.
It's a lot of rigamarole for information that researchers need.
Google Books has issues of Billboard magazine dating back to 1942. It used to be valuable for research, but it's become much less so over the years. Currently, search results that return actual magazine issues are limited to the first page. After that, it's just normal Google links. Even searching for something like Elvis or Glenn Miller, both of whom should have been in a crapload of issues, returns only one page of relevant results.
Trying to search by date is very hard. Limiting search to "Glenn Miller October 1942" might return one or two relevant results, or it might not return any. Trying to search by issue date doesn't work at all.
They have an index of Billboard issues which allows you to go to individual issues and read them, but the index stops at 50 pages, and for a weekly magazine, that limits the index to only a handful of years. Using the index, you can't go directly to issues before the 1980s, and with search by issue date useless, that means you're just out of luck if you want to see a particular issue in the 1970s.
They do seem to be crippling book search on purpose. Just yesterday I was looking for "PC Mag 1997 january Pentium MMX" and Google refused to return PC Mag 7 Jan 1997 issue results, whats even more weird clicking "browse all issues" returns
The requested URL /books/serial/ISSN:08888507?rview=1 was not found on this server
but "About this magazine" will happily give you a list of all scanned issues :o and opening january one will let you search it and will return positive results.
I completed a Masters in Mathematics back in 2014, involving a lot of historical research into the development of geometry. At the time, the ease with which I could open up books written over a century ago, search them and read them, was a fantastically useful tool.
I have some of those sources still on my hard drive in their scanned PDF format. They've now effectively vanished from the open internet. So much, available for such a short window. Our children will never believe us when we tell them what was once right there at our fingertips, and those that do should never forgive us.
It would be nice if anyone could build such tools, but all of that data is locked up inside of places like Google Books and Hathi Trust. Google isn't even interested in making their metadata available, other than by running searches.
This breakage reminds me of similar constrained-search problems with the Google Groups/ Dejanews USENET archive. Once upon a time it was nice to research with.
It reminds me of the demise of google's code search and github's woefully inadequate code search. Debian's Code Search is okay, but github allowing it would be great.
Searching within books is broken too. If you search for 'cat', any instance of the word 'cats' will also be returned (and any other word beginning with 'cat'), but with the message, "No preview is available." No link to the page of the result is provided. It seems like the kind of bug that should be straightforward to locate and fix, but it has been this way for years. (My guess is that the tool that builds previews doesn't recognize partial matches.)
It's not a bug. It's on purpose. They actively limit previews, full page views, and downloads -- even though you can go to a university and download the same book from Hathi Trust.
Hathi Trust isn't paying the copyright holders, either, so who cares?
Based on my experience with research using Google Books and Hathi Trust, this author is correct. Google has purposely broken Google Books so that it doesn't compete with any paid sources for these materials -- even if there aren't any paid sources and the book is out of print!
The TL;DR is that Google Books started out with the goal of digitizing every book ever written. Publishers sued, so they crippled their search and display functions and handed over the full texts they already had to a group called Hathi Trust.
Hathi Trust is seriously crippled on purpose. It only allows access to full texts when you are sitting in a physical library of one its partner universities. That's right... I can drive to a big university, sit in their library, and download a full PDF of any book I like. But if I'm at my house, I can only read one page at a time in a browser. This is ridiculous. Hathi Trust is helping the oil business more than they're helping researchers.
The marriage of Google Books content and Hathi Trust as a distribution platform is a joke. In some cases, you will even have to order an old book from interlibrary loan (see worldcat.org) if you can even get it -- when all the while Google has a scanned copy!
My grandfather wrote a book in the 1940s that’s been out of print since the mid 50s. Every entity associated with the book is dead, including the publisher, which merged into another in the late 50s and is probably an inactive imprint of some successor company. Grandpa died in 1985, and a cousin or I is likely the heir to his rights.
I have a copy of the book, which I bought via Alibris from a bookstore in Wales 15 years ago. If you needed the book for research, you’d probably get it via inter-library loan from a university or a big city library. Whomever the publisher is, they don’t have it and aren’t selling it. In no scenario does anyone get paid for transacting, other than a reseller or the post office.
Stay with me for a second; I'm going to go wildly afield... does intellectual property need an Adverse Possession law, too? [1]
Your grandfather had a book, the rights of which should have been presumably passed down to you. My grandfather had a patented mining claim that has been passed down to me. Where the ownership of your grandfather's book is questionable, for me the physical corners of the property are questionable. A number of them are defined by things like a "4 foot spruce post" or a "12 inch diameter tree trunk" that haven't survived the ravages of time.
But it is important that I patrol my property at least every couple of years because of Adverse Possession. If someone else were to use my property continuously and I don't say anything against it, one day their trespassing suddenly and magically would become ownership. For a land-owner, it is a scary idea that someone can steal my property from me, as has actually happened. [2]
But I can acknowledge that it makes some sense. It comes from the idea that land is meant to be used, and if you aren't using it, maybe the person who is using it should get the rights.
If nobody can stand up for an intellectual property claim, perhaps some kind of adverse possession is in order.
We actually do have a sort of 'adverse possession' in US copyright law, for libraries: the "Last 20" clause, which IA has recently begun exploiting for distributing still-copyrighted books (https://blog.archive.org/2017/10/10/books-from-1923-to-1941-...).
Counter-argument: Adverse possession is justified by the scarcity of real property. Which does not apply to IP.
- Real property (i.e. land) is a scarce and limited resource. If a party is making productive use of the land, they should hold title. (There is only so much arable land. If someone raises crops, let them.)
- Intellectual property (particularly copyright) is not a scarce or limited resource. (Create your own copyrightable work if you wish to own the rights.)
Of course, adverse possession presumes that land ought to be made "useful." (Haven't thought yet about how these critiques of real property theory map to IP.)
Much of real property theory arose from the assumption that the government should recognize and encourage the "highest and best use" of real property. Traditionally, the highest and best use of land is the use that can most profit from the land's resources; often mining, grazing, farming, logging.
This is problematic.
- This view justifies colonization, and taking land from original inhabitants who don't use the land to extract resource value.
- This view does not recognize preservation of an ecosystem as a valuable use.
- This view does not account for externalities from use of the land's resources.
I think it's worth noting that you can calculate an estimate of the externalities and remove that from the profit to achieve a more balanced justification. Though unfortunately, unless you counted the loss of culture as an externality then you could still trivially justify the removal of land from those less productive/ technologically advanced than you.
Furthermore, even though I'm not personally supportive of the removal of land at the individual's loss I do have to ask if the removal could account for a net gain overall; improving many people's lives. Perhaps profit isn't the best measure of improvement to the collective but it is at least indicative.
> - Intellectual property (particularly copyright) is not a scarce or limited resource. (Create your own copyrightable work if you wish to own the rights.)
It isn't scarce for those seeking rent from it; it absolutely is for those seeking to use the works under copyright. In case of books, music, movies, games, etc., the works are not substitute goods. If I need a particular book for my research, there's a good chance I can't just take a different book instead. So there is acute scarcity involved for a subset of parties interested in a copyrighted work.
I like the analogy between real estate and copyright.
I think a simpler* solution would be to return to limited terms on copyright (say, 14 years), and require periodic renewal by the rights-holder to extend that term. As part of the renewal process, you'd either need to demonstrate active use of the copyright, or pay a fee (or both?).
* Simpler from a process point of view. I understand it's probably not simple politically.
Exponentially increasing fee, set in a way that past two-three such renewals, even the biggest and richest corporations would think twice before paying up.
Right. And orphan works legislation has tended to be opposed by individual creators or at least the organizations that purport to represent them. The (not totally unreasonable) theory is that individuals or their estates--let's leave aside the fact that copyright terms are almost certainly too long--may well inadvertently not keep on top of what's needed to keep works non-orphaned. But "Disney" (or whoever) will most certainly be ready to pounce on anything they can acquire for free for whatever technical reason.
It's also a fundamental problem with our current copyright law where material remains in copyright for many decades after it has stopped being printed and sold through primary channels.
Copyright should automatically expire 10 or 15 years after the last printing IMHO. If nobody cares enough to put it up for sale or even make a tiny print run just to renew the copyright, why should the government continue to enforce it?
Of course this scheme falls apart a bit in the digital age, except that even ebooks get pulled from the shelves for no apparent reason. Maybe we should just go back to having to explicitly renew copyright after 15 years or so, with a fee just large enough to convince people to drop dormant works. Maybe a couple hundred bucks every 5 years.
The best part would be having some easily accessed online system where you could check the copyright status of any work, including current contact information for the rightsholders if you want to arrange payment.
Expiring copyright after 15 years no matter the printing might even be a better solution. That would allow other people to build and extend works, movies and software.
Yup. The way I see it, a lot of IP problems are caused by people who want to take the law intended for promoting new works, and turn it into a source of passive income.
But people wouldn't be encouraged to make original works if their great grandchildren would be robbed of the opportunity to squabble over the rights to it in the courts.
Here's an easy example... Google Books has many old issues of Popular Mechanics which exist in full downloadable form on the Internet Archive -- yet you can only see individual pages on Google Books and can't download the magazine.
This is because Google Books is acting like they own the copyright (or, at least, they feel the need to police it.)
There are many cases where you can download the entire book from Hathi Trust when you are sitting at a university library, giving you a PDF you can use anywhere. But you cannot even see the entire book or download it from Google Books (which has given its scan to Hathi). This is just stupid.
Google Books seems to have major problems. An alternative interested parties should explore is Archive.org. The Internet Archive has a significant collection of scanned books and other materials.
And all of them are full view if you are physically sitting in a Hathi Trust "partner" university library. These libraries are open to the public and allow downloading and saving of the materials you browse, making the whole point of locking them up completely pointless.
This is not correct. "Partner" access does not provide additional access to in copyright materials. They do have more convenient download options for public domain works and additional services.
Mind emailing me? You don’t have contact info in your profile. I’d be interested in hanging at a library for a few hours with my laptop and have some questions before going.
To whom is this directed? If it's to me, just write your question here. I've pretty much already explained how it works. Go to the Hathi Trust login screen. Pull down the list of partner institutions. Go to one of them in person. (or get a Library of Congress card, also in person.)
After the institution logs in, you can download anything you want. Bring a flash drive or portable hard disk and take home your PDFs.
I'd recommend calling the partner library first to make sure someone there knows the Hathi login. It's not that popular a resource, sadly, and many people have no idea what it is. You may also find a friendly librarian who is willing to do the download for you and email you the PDF, saving an in-person visit.
One thing to note: You don't have to be a student to use the resources of university libraries. They're open to everyone.
I wonder if the date problem is a bug, not a feature?
Is it possible to dump the metadata of a book and check if they have the right date? There should probably be multiple dates for a book -- date written, date copyrighted, date published, date of latest edition, etc.
My guess is that Google does not have a publicly-available issue tracker for Google Books so you can't easily report this problem. Hacker News is a good way to get their attention, though...
What would be a scientific approach to compare search results? Let's say I do a search on DDG and on Google, how do I determine which engine provided the most accurate results?
I think if you have a population of people doing the same search, split randomly across the two sites, things like how long it takes to leave your site through a search result, how far down that result is, and how often people come back to rephrase their search are all good metrics.
Not sure there's an answer for a single search for a single individual though.
My comments seem to strike a chord with a lot of HN users, I've had more than one person go through my comment history and reply to days or weeks old comments of mine.
https://www.theatlantic.com/technology/archive/2017/04/the-t...
The end of that article is a not-so-subtle plea for someone within google to perhaps accidentally anonymously place this material in public.