[This is a reposting of a comment I made in response to Siva Vaidhyanathan’s questions about my previous post. I am traveling, and can only produce brief answers to his questions now. Later this week I’ll get to most of the issues in more detail here.]
Let me start by reminding everyone that I do not speak for Google, nor am I engaged in generalized cheerleading on Google’s behalf. Rather, I am arguing that the University of Michigan Library is doing a Good Thing in its digitization project with Google.
Below are Siva’s questions, and my responses:
He dismisses serious search problems as temporary, yet fails to confront the problem that Google cannot and will not explain the factors and standards that put one book above another in search results.
Actually, I don’t mention search at all in my post. Nor (see above) do I speak for Google.
As users discover poorly-scanned files on the Google index, how can they alert Google to the problem? Why does nothing in the contract between Michigan and Google include quality-control standards or methods?
Please see Michigan’s agreement with Google, clause 2.4, the relevant part of which reads: “U of M will engage in ongoing review (through sampling) of the resulting digital files, and shall inform Google of files that do not meet benchmarking guidelines or do not comply with the agreed-upon format. Should U of M encounter a persistent failure by Google to meet these guidelines or supply the agreed-upon format, U of M may stop new work until this failure can be rectified.” The agreement is online at: http://www.lib.umich.edu/mdp/umgooglecooperativeagreement.html
How do we know this index will last for decades? What image file system is Google using and what ensures its preservation?
I believe that in my post I said that the UM library (like other partner libraries) is also storing and preserving the files that Google scans. Maybe Google won’t last for decades, but the libraries will, and the libraries are pretty serious about preservation.
How is the “library copy,” that electronic file that Michigan and others receive as payment for allowing Google to exploit their treasures, NOT an audacious infringement of copyright? It violates both the copyright holder’s right to copy and right to distribute. Doesn’t a university library have an obligation to explain this?
It’s hard to get past the first premise of this set of questions. One literal answer would be to say that there is no such electronic file, because Google is not obtaining anything by means of exploitation.
I must say that I am troubled that the author of a very sensible book about copyright is so enthusiastic about trashing Google that he is willing to give up on the uses, notably scholarly uses, that are permitted in the higher-numbered sections of the Copyright Act. As my institution’s copyright lawyer says: “FAIR USE, it’s the law.” And my institution believes that when we have Google digitize our holdings we do so under the law and in order to make uses that are not only lawful, but that are completely consistent with the undergirding purpose of copyright law.
Siva is much younger than I am, so he may be willing to wait decades before finding out how scholarship and society can benefit from digitized and searchable collections from some of the world’s great libraries. For myself, I’d like to unleash my colleagues and our students on this remarkable resource while I’m still around to see what happens.
Finally, re Ryan Shaw’s post, yes, we receive the OCR.
Hi Paul,
Thanks for addressing my question about the OCR. It’s really great that you’re clearing up some of the misconceptions around Michigan’s collaboration with Google. I had one additional question about the OCR: I notice that MBooks doesn’t highlight keyword search results on the page images like Google Books and Open Library do. Is this because you don’t have the word bounding box info from the OCR, or just because you haven’t implemented it in the interface?
Cheers,
Ryan
November 6, 2007 @ 11:32 pm
I’m not Paul, but I can say with authority that we do not receive the word coordinates in the OCR from Google. So, we can’t implement the functionality.
November 9, 2007 @ 10:12 am
it’s simple to write a program to get the coordinates.
so it’s just a waste of disk-space to store that info…
and since, in the long run, you will serve the _text_
rather than the _scans_ anyway, the coordinates are
unnecessary. worry about important stuff instead…
-bowerbird
November 13, 2007 @ 2:32 am
besides, it’s not as if umichigan can’t do its own o.c.r.
-bowerbird
November 13, 2007 @ 2:39 am
[…] Techdirt covered a back and forth and back regarding Google’s partnership with the University of Michigan for book scanning. If I had […]
December 2, 2007 @ 3:11 am
[…] another intellectual debate with three moves recently: Paul Courant to Siva Vaidhyanathan to Paul Courant. Total elapsed time: two days. Two days! The Courant-Vaidhyanathan exchange was literally 365 times […]
June 6, 2011 @ 4:01 pm