On being in bed with Google
One of the things that surprises me most about reactions to the Google Library Project is that smart people whom I respect seem to think that the only reason that a university library would be involved with Google is because, in some combination, its leadership is stupid, evil, or at best intellectually lazy. To the contrary, although I may be proved wrong, I believe that the University of Michigan (and the other partner libraries) and Google are changing the world for the better. Four years from now, all seven million volumes in the University of Michigan Libraries will have been digitized – the largest such library digitization project in history. Google Book Search and our own MBooks collection already provide full-text access to well over a hundred thousand public domain works, and make it possible to search for keywords and phrases within hundreds of thousands more in-copyright materials. This access is altering the way that we do research. At least as important, the project is itself an experiment in the provision and use of digitized print collections in large research libraries. I do not see how we can discover the best ways to use such collections without experiments at this scale. In sum, I believe that our library is doing exactly what it should do in the best interests of scholarship and our users, now and in the future.
So I’m puzzled when people ask, “How could serious libraries be doing this? How could they abdicate their responsibilities as custodians of the world’s knowledge by offering their collections up as a sacrifice on the altar of corporate power? Why don’t they join the virtuous ranks of the Open Content Alliance partners, who pay thousands of dollars to digitize books at a rate of tens of thousands of volumes a year?” It seems like those who ask such questions have little appreciation of what Michigan and the other Google partners are actually up to.
Google is on pace to scan over 7 million volumes from U-M libraries in six years at no cost to the University. As part of our arrangement with Google, they give us copies of all the digital files, and we can keep them forever. Our only financial outlay is for storage and the cost of providing library services to our users. Anyone who searches U-M’s library catalog, Mirlyn, can access the scanned files via our MBooks interface. That’s right, anyone. (Copyright law constrains what we can display in full text, and what we can offer only for searching, but we share as much as we can consistent with prudent interpretations of the law.) For an example of an MBook, take a look at The Acquisitive Society by R. H. Tawney.
In a recent New York Times article about mass digitization projects, Brewster Kahle was quoted as saying: “Scanning the great libraries is a wonderful idea, but if only one corporation controls access to this digital collection, we’ll have handed too much control to a private entity.”
I agree with him. I’m an economist with a particular interest in public goods, which is how I came to be involved with libraries in the first place. Libraries have a long and honorable history of preserving information and making it accessible. Moreover, even at their best, for-profit institutions cannot be expected to serve general public interests when those interests run counter to those of their shareholders. So I would be distressed if a single corporation controlled access to the collections of the great academic libraries, just as I find it troubling, on a smaller scale, that a handful of publishers control access to much of the current scientific literature.
But Google has no such control. After Google scans a book, they return the book to the library (like any other user), and they give us a copy of the digital file. Google is not the only entity controlling access to the collection – the University of Michigan and other partner libraries control access as well. Except we don’t think of it as controlling access so much as providing it.
Since 2005, Siva Vaidhyanathan has been making and refining the argument that libraries should be digitizing their collections independently, without corporate financing or participation, and that those who don’t are failing to uphold their responsibility to the public. “Libraries should not be relinquishing their core duties to private corporations for the sake of expediency.”
“Expediency” is a bit of a dirty word. Vaidhyanathan’s phrase suggests that good people don’t do things simply because they are “expedient.” But I view large-scale digitization as expeditious. We have a generation of students who will not find valuable scholarly works unless they can find them electronically. At the rate that OCA is digitizing things (and I say the more the merrier and the faster the better) that generation will be dandling great-grandchildren on its knees before these great collections can be found electronically. At Michigan, the entire collection of bound print will be searchable, by anyone in the world, about when children born today start kindergarten.
Google brings to us extraordinary technical and computing power and tremendous financial resources. The libraries bring an understanding of our collections and our users, and a profound commitment to public access. We are not relinquishing our duties in the name of expediency; we are working with a capable partner to create a far more useful resource than we could create on our own. (Would I prefer that a charitable foundation would support this work on the same schedule as Google, and make everything available to everyone, subject only to copyright restrictions? You bet. I would prefer it even more if that foundation would buy out all of the rights holders for all out of print works. Can someone tell me the name of the foundation, please? In the meantime, it seems to me that being in bed with Google is way better than sleeping alone.)
It’s true that the digitized files from Google’s scans are often far from perfect. Historian Robert Townsend, Paul Duguid, and others have raised technical questions about the quality of Google’s scans, and their appropriateness for preservation. Those are important questions, and there is a great deal of work to be done, both by Google and by the libraries, before we consistently achieve the level of quality and bibliographic reliability that are essential to successful scholarly practice. I will discuss some of the specific steps we are taking to address quality in a future post, but for now I will just say that the solution of these problems will require the serious engagement of academic libraries, and that the visibility of the problems is essential to their solution. Mass digitization on the scale of the Google library project was unimaginable five years ago, and it comes as no surprise to me that we are learning a lot as we go long. We are learning in the tradition of serious academic work, by putting our ideas and our resources in the public eye, where they can be seen, and criticized, and improved.
I am interested in your statement:
“As part of our arrangement with Google, they give us copies of all the digital files….”
I had understood that Google was giving the library partners copies of all the books as images. But if they were giving you ALL the digital files they would also be giving you their ASCII interpretation of the scanned images. This would be a good idea, but is it really the case?
The project is certainly of great importance and a huge step forward in the development of digital libaries. Something wonderful will result.
November 5, 2007 @ 9:05 am
Have you written a letter to the NY Times responding to the article by Katie Hafner? It was very one-sided and there should be a response. This post is very informative and while it is too long for a letter to the Times, the information could be summarized for them.
November 5, 2007 @ 9:07 am
[…] Librarian Paul Courant has launched his own blog. He starts it off with a strong (but long-winded) defense of the University of Michigan-Google agreement that provides for the digitization of the University […]
November 5, 2007 @ 1:19 pm
These issues also come up - again not in the depth covered here - in the recent article in the New Yorker by Anthony Grafton,
“Digitization and its discontents.
November 5, 2007 @ 2:24 pm
[…] On being in bed with Google | Au Courant (tags: google digitization partnerships) […]
November 5, 2007 @ 5:19 pm
Hey Provost!
“In the meantime, it seems to me that being in bed with Google is way better than sleeping alone.” Boy, what a naughty comment!
November 6, 2007 @ 12:24 am
I think Brewster’s latest project, The Open Library, is a good solution to some of these issues.
http://demo.openlibrary.org/
However as someone else mentioned, the real issue is search, that is where the corporate stranglehold comes in.
November 6, 2007 @ 10:14 am
Prof. Courant,
Your post is a refreshing breath of pragmatic and user-centered air.
And I’m not just saying that because I’m an SI alum.
It’s good to see someone in your position blogging. We need more of that. Thank you for doing it.
November 6, 2007 @ 10:18 am
[…] of Michigan, has a new blog that begins with a candid assessment of what it’s like “being in bed with Google.” Google antagonist Siva Vaidhyanathan provides an immediate response and some good, […]
November 6, 2007 @ 11:42 am
let me give you my support and thanks on this issue!
i strongly believe the decision to work with google
was the _correct_ judgment, without any question!
if memory serves, the umichigan digitization plan
had a timeframe of _99_ years for its completion…
to get the job done in _7_ is truly a huge blessing.
frankly, i don’t see how people can think otherwise.
and to do it without substantial cost to the people
of michigan is some tasty frosting on a great cake!
especially with no other entities providing funds…
i’m also tremedously appreciative of your policy of
releasing your _digital_text_ and not just the scans,
at least for public-domain books on which you can.
this is an _extremely_ generous gift from umichigan,
one i’m surprised more google partners aren’t giving.
you’ve taken the _right_road_ for the public domain,
and i — for one — love you for your wisdom on that.
i will have additional points to make very soon, but
let me register my approval for these choices so far…
your actions have been righteous and brave. hurray!
-bowerbird
November 6, 2007 @ 7:48 pm
So that there is some light on an area that seems to get people confused: are there restrictions on the digital files that are offered back to the libraries? The answer is yes.
Here is a part of UC’s. I understand there are similar ones in other agreements.
http://www.cdlib.org/news/ucgoogle_cooperative_agreement.pdf) section
4.9 states that the “University shall develop methods and system for
ensuring that substantial portions of the University Digital Copy are
not downloaded from the services offered on University’s website or
otherwise disseminated to the public at large.”
This applies to all books even those whose physical versions are public domain.
I recommend others reading these agreements. They are generally quite readable.
Some libraries have decided that these are worth their participation and some have not.
This is a complex issue, but this is a meaningful part of the debate, and worth understanding.
-brewster
November 8, 2007 @ 1:38 am
The first postings on this blog offer great and welcome openness to people trying to understand a project that, itself full of promise, has been obscured by silence or secrecy, depending on where you stand with regard to the major conspiracy theorists. Michigan’s attitude to Google Books, both in the work it is doing and in the willingness to talk about it deserves great praise. Nonetheless, the question that strikes me on reading Paul Courant’s posts is not whether Michigan can work with what Google gives it. That’s evident from looking at the admirable Michigan catalogue alone. Rather, the question is whether Google will learn from Michigan. Will, primarily, Google discover from Michigan that, while there’s a lot you can do by machine alone, there’s a lot about book’s that requires human diligence and intervention? Will, for example, Google take back from Michigan files improved by Michigan and put them in place of its own? Or will Google search and rank Michigan’s files with its own (or even in place of its own) in returning Google searches? And will Google be willing to make its ranking translucent, if not transparent?
Why are these questions important? It is surely not conspiracy theory to believe it more likely that, for example, Michigan students will transfer (have transferred) their allegiance from the Michigan catalogue to Google Book Search than the other way around. Ease tends to trump all other concerns in search. Consequently, whatever Michigan or other partner libraries do with regard to quality, they are in danger of being swamped by Google. All Courant’s thoroughly persuasive arguments and all the admirable work of its librarians will be undercut if Google book search becomes the primary portal of access, for with regard to books Google’s elegantly simple search can be as misleading and worrying as it is seductive.
Here I would like to make a mild correction to Courant’s kind allusion to my own work. Neither Robert Townsend nor I wrote solely about the quality of Google’s scans. These are noticeably poor in many places, but as the optimists assume, they can be fixed. Let us assume that they will be. Of greater concern to me, and I think to Townsend, is Google’s attitude towards metadata. Problems here seem to be legion, but let me take the ones that intrigued me in the piece Courant pointed to. Google currently seems unable to distinguish between different volumes of a multivolume edition. Unlike a search in the Michigan catalogue, a search in Google’s box gives no indication when it lands you in the first or a subsequent volume. (A quick, crude, but surely typical title search as I write this of Tristram Shandy, Decline and Fall, or Clarissa lands you, if you take the first readable text, in vols v, vol iii, and vol iv respectively, none marked as such by Google in the page returned, and only one marked as such on the “about this book” link.)
Part of the problem here is the allure of Google ranking. People have come to expect from Google search that the results will be ranked by some inherent notion of quality–the question Courant himself raises. Are we then to assume that volume iii of a novel is in some way sufficiently better than volume i if the former turns up on top? (If the answer is yes, a reader might still hope to be told both that it is volume iii and why it is preferred.) Equally, if we are concerned with quality, it may be too much to expect that Google can electronically distinguish between multiple good editions, but its own wars with copyright should tell it that, if it lends, as it seems to, a certain priority to texts published before the copyright window closes in the 1920s, it’s going to put at the top of its searches a lot of what might be politely called dubious texts, lent cultural authority, though they are, by the accompanying name of some of our major research libraries. (People from Michigan often complain that they get little respect in the publicity Google puts out about the Google Books. It seems to me that they don’t get much respect in the search rankings either. In the highly unscientific search I noted above, Stanford, Harvard, and Stanford again came up first, though Michigan, by all accounts, has many more books in the database.)
The problems of search, and related ones of elementary metadata and quality assessment are ones that need a little thought and care, as Michigan has shown. So, to repeat by way of conclusion, while it is indeed admirable to see, as Courant notes, what Michigan can gain from working with Google, it would be a relief to know how much Google could learn from working with Michigan.
November 8, 2007 @ 10:20 am
metadata schmetadata. google will fix its errors,
or else someone else will build a better catalog
into google’s own scan-sets. (yeah, right, sure,
google’s gonna let someone upstage them at it,
because they understand so little about search.)
paul courant, your biggest problems are _not_
with google, they are right in your own shop…
when you’re ready to talk about ‘em, let me know.
-bowerbird
November 12, 2007 @ 11:03 pm
From section 4.9 of the google/uc agreement:
4.9: Use of University Digital Copy.
University shall have the right to use the University Digital Copy, in whole or in part at University’s sole discretion, subject to copyright law, as part of services offered to the University Library Patrons. University may not charge. receive payment or other consideration for us of the University Digital Copy except that University may charge for use of any services supplemental to the original work that the University supplies that add value to the University Digital Copy (for example, University may charge University Library Patrons for access to annotations to works from professors and scholars but the original work will always be accessible without a fee), and to recover copying costs that actually incurred. University agrees that to the extent it makes any portion of the University Digital Copy publicly available, that it will identify the works, in a statement on a web page or other access point to be mutually agreed by the Parties, as “Digitized by Google” or in a substantially similar manner. University shall implement technological measures (e.g., through use of robots.txt protocol) to restrict automated access to any portion of the University Digital Copy or the portions of the University website on which any portion of the University Digital Copy is available. University shall also prevent third parties from (a) downloading or otherwise obtaining any portion of the University Digital Copy for commercial purposes, (b) redistributing any portions of the University Digital Copy, or (c) automated and systematic downloading from its website image files from the University Digital Copy. University shall develop methods and systems for ensuring that substantial portions of the University Digital Copy are not downloaded from services offered on University’s website or otherwise disseminated to the public at large. University shall also implement security and handling procedures for the University Digital Copy which procedures shall be mutually agreed by the Parities. Except as expressly allowed herein, University will not share, provide, license, or sell the University Digital Copy to any third party.
- end
So while a project like OCA may be scanning at a much slower rate and at a cost to the libraries, it is important to state (again and loudly) that the OCA material aims to be truly open and distributed.
Not knowing what it looks like, if there are similar stipulations in the Google/UofM agreement, then there is a strong argument that says: In the long run, UofM material may be less useful to the “public at large” than the Open Content material because of all the robot.txt/etc blocks placed upon them.
For works that are out of copyright and are in the public domain, it seems wrong for a library to block access. Why should I have to use the Google or UofM search engines to find these works? What innovation is encouraged if hackers/programmers are not able to download the books in an automated fashion and redistribute them again in new and potentially more useful ways?
November 13, 2007 @ 8:21 pm
update:
here’s section 4.4.1 of the UofM/Google contract (available at: http://www.lib.umich.edu/mdp/umgooglecooperativeagreement.html)
4.4.1 Use of U of M Digital Copy on U of M Website. U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M’s sole discretion, as part of services offered on U of M’s website. U of M shall implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any portion of the U of M Digital Copy or the portions of the U of M website on which any portion of the U of M Digital Copy is available. U of M shall also make reasonable efforts (including but not limited to restrictions placed in Terms of Use for the U of M website) to prevent third parties from (a) downloading or otherwise obtaining any portion of the U of M Digital Copy for commercial purposes, (b) redistributing any portions of the U of M Digital Copy, or (c) automated and systematic downloading from its website image files from the U of M Digital Copy. U of M shall restrict access to the U of M Digital Copy to those persons having a need to access such materials and shall also cooperate in good faith with Google to mutually develop methods and systems for ensuring that the substantial portions of the U of M Digital Copy are not downloaded from the services offered on U of M’s website or otherwise disseminated to the public at large.
November 14, 2007 @ 10:28 am
Dear Paul,
Congratulations and thanks for your post. It’s really helpful to see you enter the discussion, as someone who must know more than almost anyone else about the arguments pro and con Michigan’s involvement with Google. At Yale, we have recently announced our mass digitization cotnract with Microsoft, and I posted a short piece about that on my blog last week (http://www. library.yale.edu/mtblog/ulibrarian) . For us of course, it’s just a beginning in the mass digitization field. Here’s an excerpt: The contract has already attracted a considerable amount of attention in the press. It is gratifying to find that mass digitization continues to grab the attention of such august organs as the New Yorker and The New York Times. I think it’s a terrific affirmation of the central importance of libraries in the information economy and in society far more broadly, that people really do care about what we do with books, and how reading and the dissemination of knowledge will happen in the future. I have been asked questions that range from a concern about the preservation of Yale’s assets (”Why are we giving this material away free to the world?”) to others that surround the concern that books are going to be neglected or sidelined in the future. The answers in both cases seem clear to me: through our contract with Microsoftwe are acquiring digitized assets that otherwise we could not afford to create, while at the same time we are fulfilling our mission to share resources with the entire scholarly community. And digitizing books more often encourages greater use of the original, not less. It also gives a new life to works of scholarship that otherwise would be available only to a tiny fraction of the community that stands to benefit from them.
A very high priority for many of our readers is the provision of enhanced digital access to our collections. This new partnership represents a huge step forward in that direction and compliments the work we have already done to digitize collections in the Beinecke Rare Book and Manuscript Library, Manuscripts and Archives, the Map Collection, the Visual Resources Collection, and the Lewis Walpole Library, among others. This collaborative project also advances Yale’s goal to build an international global presence for the University.
November 14, 2007 @ 4:28 pm
What if often absent in this Google vs. OCA debate is the question of where we are with discoverability right now (and for at least the immediate future). We also tend to overstate how unlikely, or “impossible,” it is that these books would ever be scanned ever again.
In searching for a number of Internet Archive books in Google Book Search, it seems that unless the book has also been scanned from one of the partner libraries, there is only metadata and the standard links, none of which shows that the full text is available at IA. E.g., http://books.google.com/books?id=IJu1GwAACAAJ&dq Over in regular Google, the IA copies can be found, sometimes, and not easily–more metadata than a title is needed and the indexed link is obscure.
Whether the virtual invisibility of OCA content in Google continues or not, OCA libraries should not be so quick to congratulate themselves that they are independent of Google. OCA partners can be unconcerned about this if they believe: a) Google will soon change course and fairly include OCA content; b) Google will fade and their successor will behave differently; or c) that search engine exposure isn’t critical because people will find these books through our library catalogs. Assuming catalogs don’t rise to the challenge, the admirably shareable and crawlable local copies are only marginally discoverable unless Google makes it so. Whether we like it or not, for now Google is where discovery happens. The Google partners are choosing a course that puts more books in front of the current generation of readers.
November 14, 2007 @ 4:59 pm
“For an example of an MBook, take a look at The Acquisitive Society by R. H. Tawney.”
The link you posted doesn’t work. I also searched for the book via Mirlyn, and found the metadata for it. Following its link produced the same error.
Perhaps too many people downloaded it based on your link, and Google complained that U of M was violating its contract, by allowing the public to download public domain books from its collection? (You have not addressed paragraph 4.4.1 in UM’s contract — pointed out by your adversaries — that prohibits UM from releasing these digitized books to the public. Is that term in effect? Is Google excusing your lapse? Or was it a surprise to you, that you couldn’t do as you wished with “your own” scans of your own books, within the parameters of copyright law? Did you enter into that restriction knowingly and willingly?)
Here’s another link — to UM’s scanned “Alice’s Adventures in Wonderland” from 1866:
http://mirlyn.lib.umich.edu/F/HXS939LKUX1KG4Y3152MSM2AFB4RUY42HGPGFEKXISJRSJ9P73-64006?func=full-set-set&set_number=031169&set_entry=000019&format=999
Unfortunately, following the link results in an error page, “Sorry, you aren’t permitted access to that item.” The catalog says it’s “Available for non-commercial, internal use by students, staff, and faculty for academic and research purposes only.” DRM on public domain books is alive and well at UM. Now THAT’s more in line with the Google contract — you had me worried for a minute.
November 14, 2007 @ 7:42 pm
MBooks has been down intermittently all day because an extensive rewiring project in the LIbrary shut down power to the server. Google, of course, had nothing to do with it
The copy of Alice in Wonderland that you found in the catalog is part of a 19th century literature database, copyright 1999, and also has nothing to do with Google. When we do get a copy of Alice from a Google scan, it will be available on MBooks, assuming restoration of power to the server.
November 14, 2007 @ 10:10 pm
Paul, this is getting more and more interesting.
So despite a published contract that prohibits you from doing it, you’re releasing (some? all?) out-of-copyright books scanned by Google. While refusing to say anything about the contract term or the practice.
But despite the lack of copyright protection, you are denying access to 1866 Carroll works because they arrived(?) as part of a “19th-century literature database, copyright 1999″! How can any collection of entirely out-of-copyright materials be “copyright 1999″? Even if an introduction was copyrightable, “mere aggregation” of copyrightable text with non-copyrightable text doesn’t produce a copyright on Carroll.
I think Paul’s library has fallen down some kind of odd rabbit hole…
November 20, 2007 @ 4:44 am
I think you are confusing copyright law with contract law. If you agree to a contract that requires you to restrict access to a collection to some subset of the world, it doesn’t matter what the copyright status of an individual object within that collection might be. You are obliged to honor your contract.
This is not unique to any one library when it comes to electronic collections. There are any number of vendors who have made a living reselling materials in electronic form, including materials in the pubic domain. Libraries that license these materials are required to abide by the license.
November 20, 2007 @ 2:06 pm
Thanks, Chris, that is exactly what I was going to say.
While it is always our preference to make things widely available, we have an obvious specific interest in making things available to our own students and faculty, and sometimes this involves entering into contracts that limit our ability to distribute public domain works.
November 20, 2007 @ 3:10 pm
ok, the contract language that brewster cited _is_
troubling. indeed, it is _extremely_ troubling…
but it’s most disturbing because i’d like to think
libraries had better sensibilities about avoiding
contractual language antithetical to their mission.
as for the practical matter of obtaining material,
however, this is not a very big roadblock to me…
1. google itself makes the page-scans available,
and in a form much more convenient than u-m…
(i speak only about the public-domain material;
i’ve given up in despair about copyrighted stuff.)
2. google also makes their o.c.r. available, albeit
not in a form that is quite as convenient as u-m.
however, the text from both is _fatally_flawed_
(which is the real issue that should be discussed),
so this point becomes moot, since we’ll need to
repeat the o.c.r. process on the scans anyway…
(oh, and by the way, brewster, your o.c.r. is _also_
fatally flawed, and i’ve told your people why, but
– just like u-m — nobody seemed to care one bit,
and the problems remain, unfixed.)
3. neither google nor u-m can put into place a
system that could thwart a determined effort to
download the material in a _distributed_ manner
– 2 million people downloading just a little bit –
short of making _all_ of it totally _unavailable_,
which would become a public-relations nightmare.
in other words, we will “route around the damage”.
(thank you, john, for all the good you have done.)
4. google cannot legally prevent us from re-using
its public-domain scans. not only do they know it,
they also know that they can’t even _try_ to do it,
not without causing big damage to the arguments
they need in defending against the authors guild…
that’s why they phrase it as a “request” when they
ask us not to do it on the first page of each .pdf…
-bowerbird
November 22, 2007 @ 4:55 pm
As much as I admire and incorporate Anthony Grafton’s work in my own research, I was disappointed to see him write his New Yorker piece, which to me exhibited a singular lack of understanding of what the Google/library partnership means for scholarship.
Having begun my career as a librarian in the early days of humanities computing in the Hatcher Library, I now find myself making use of the fruits of the past 20 years of digital library development in finishing my own dissertation. Writing on the origins of anthropology in 18th century Germany, I can honestly say that finishing this dissertation at mid-career would not have been possible without the Google Book project: what I have been able to discover, collect and incorporate into my research would simply have not been possible even a few years ago. What I have found in my area of study through Google Books has been exhilarating, indeed at times nothing short of breathtaking. The relationships and continuity of ideas I have been able to discern through searching the database will play a vital role in structuring my dissertation. And this is only in the relatively early stages of the Project.
Using Google Books as a powerful research tool for primary sources, and as part of my work in doing a comprehensive bibliographic study of secondary materials, I feel I can make a claim for ‘comprehensive’ in a way that was simply not possible a few years ago. As a research librarian, I was frequently struck by what scholars failed to find or incorporate into their research. With Google Books (and Scholar), JSTOR, Muse, and the many other powerful databases at my disposal, I cannot disagree more with the claim that a practicing researcher can be “overwhelmed” by too much material. For me this is simply an admission that effective tools for searching and analysis have not been learned, and that the related digital tools have not been incorporated into one’s work to effectively organize this material.
With Google Books and these other databases, I have been spared the countless months (or years) of travel in the US and Europe, and the tedious working through of printed catalogs and finding aids. For Grafton’s generation such travel and work in the archives (and yes, the romanticizing of ’smelling’ old paper, meeting future spouses in the stacks, etc etc) was closely associated with their scholarly work. With such paradigm-changing tools like Google Books, I can condense the collecting phase of primary and secondary material to a fraction of the time required in years past. Instead, I can spend my time more valuably on analysis and writing. In so doing, the critical value of search, discovery, and analysis will take on a vital significance for the future success of scholarly activity.
November 26, 2007 @ 6:06 pm
I have come to rely on Google and MSN to lead me not only to books I don’t have but also to those that are sitting on the shelf behind me, leading me to that piece I know I read but can’t remember the source or page number. However, I also live outside the US and Google (and Michigan) will not allow me to access anything published after 1863 as it may still be under copyright here, the US law applying only to US published materials and only within the US. What I am waiting for, then, is for Michigan, NYPL, etc. to make not the books but the indexes available for MY library catalogue and, eventually, for the various national legal deposit/cataloguing agencies to create full-text indexes of their acquisitions. That, I think is what ’search’ in a library context is all about. We should no longer be limited to searching metadata, or contents notes, for books: we now know what keyword searching of journals and Google/MSN books can lead to and need to move beyond what we could to to what we can do.
November 26, 2007 @ 11:13 pm
You may still search within the restricted books at Michigan, and get a list of all pages with keyword hits on them. I’ve found that useful — almost an index — when deciding whether I wanted to request a book from storage here on campus.
November 27, 2007 @ 9:57 am
[…] amusing remarks are made in the back-and-forth comments between Paul Courant and Siva Vaidhyanathan about the general laziness of modern students, to the effect that they […]
November 27, 2007 @ 11:57 am
[…] the University of Michigan. Recently, he posted on his blog, Au Courant, defending the decision on collaboration between U-M Libraries and Google Scholar, which will result in the digitization of approximately 7.0 million books owned […]
November 28, 2007 @ 8:40 am
[…] Techdirt covered a back and forth and back regarding Google’s partnership with the University of Michigan for book […]
December 2, 2007 @ 2:33 am
paul said:
> We are learning in the tradition
> of serious academic work, by
> putting our ideas and our resources
> in the public eye, where they can be
> seen, and criticized, and improved.
as i made clear above (#10), i am a _huge_
supporter of your collaboration with google,
and your wonderful decision to share content
with the world at large, a very generous act…
so i certainly hope i don’t appear ungracious
in pointing out _problems_ with your content.
for instance, in your “making of america” section
– all your doing, nothing related to google at all –
you have a page where the scan looks like this:
> http://quod.lib.umich.edu/cgi/t/text/pageviewer-idx?c=moa&cc=moa&idno=afk3913.0001.001&frm=frameset&view=image&seq=189
now let’s switch over to the “text” version of it:
> http://quod.lib.umich.edu/cgi/t/text/pageviewer-idx?c=moa&cc=moa&idno=afk3913.0001.001&frm=frameset&view=text&seq=189
it’s quite obvious that the text all runs together.
whether you want to keep the original linebreaks
is a question that’s reasonable (i think you should),
but i’m positively certain that we will all agree that
it’s unacceptable to run the _paragraphs_ together.
this problem plagues this entire book, and indeed
_many_ of your books, perhaps even a _majority_.
frankly, i don’t know how it escapes your attention.
it is the first of many problems i’ll point out to you,
so i hope you’ll bear with me.
-bowerbird
December 6, 2007 @ 1:57 pm
It’s the browser that’s running the text together–that’s what happens to unformatted text in an HTML browser. If you “view source,” you’ll see that the text has largely retained the line breaks, except for instances where the OCR software has pulled together a hyphenated word.
December 6, 2007 @ 3:38 pm
“it’s the browser”…?
that’s your response?
um, yeah, i _know_ “it’s the browser”.
that’s the way that h.t.m.l. works.
when you wanna start a new paragraph,
you have to put in a bracket-p code…
so why wasn’t that done?
that’s my question.
or, at the very least, a bracket-br
could’ve been coded after each line.
but, as it is, it’s just plain _wrong_…
and it’s wrong the way a rank amateur
gets .html wrong…
and it pervades your entire library…
-bowerbird
December 6, 2007 @ 8:49 pm
Getting into a discussion with bowerbird is always such a disappointment. One has to wade through such levels of insult and misunderstanding. The answer, as I’ve said to bowerbird in other fora but he refuses to hear, is that this is not a priority for us. We have limited resources and a long list of priorities and have to make choices. We don’t have an army of people sitting on their hands waiting for something to do. It’s all about priorities.
December 7, 2007 @ 8:05 am
yes, perry, i’ve already heard your excuses…
now i’d like to hear what dr. courant has to say,
not just about this point, but about _several_
problems — real issues — with the library which
he has inherited, and now is responsible for…
so, paul, is this what you want your online display
of book pages to look like, with all the paragraphs
run together in a meaningless jumble? or is that
– as perry says for you — “not a priority for us”…
-bowerbird
December 7, 2007 @ 11:21 am
It appears to me that we are talking about two separate things here: OCR quality and the way systems display that OCR. MBooks and MoA appear to display the OCR in different ways, actually; one that adds HTML markup to the OCR in the hopes of making it look pretty in the browser (MBooks) and the other leaving the OCR as it is output by the OCR system so that when you save the page, you have the text in its cleanest possible form.
I can see benefits to both choices, though the MoA display seems to have been driven more by an academic community who wants to extract text for reuse, and the MBooks display more for a general audience who wants things to look “pretty” in the browser and won’t complain about having to pull out a bunch of superflous tags if they want to paste a quoted passage in a Word document, for instance.
December 10, 2007 @ 11:21 am
Looks like the <BR/> I used to illustrate got swallowed by the blog, so let me retry that last thought (giving me a chance to type “superfluous” properly):
…the MBooks display more for a general audience who wants things to look “pretty” in the browser and won’t complain about having to pull out a bunch of superfluous <BR/> tags if they want to paste a quoted passage in a Word document, for instance.
December 10, 2007 @ 11:26 am
but wait, there’s more…
look at this page, page 175 from another book:
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?seq=175&view=image&size=100&id=mdp.39015016881628&u=1&num=181
as you’ll note, “king lear” in the third line is in quotes.
but now look at the o.c.r. text for that page:
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?seq=175;view=text;size=100;id=mdp.39015016881628;u=1;num=181;page=root;orient=0
oops, we’ve lost the quotemarks. this lost-quotemarks problem is endemic.
it infests nearly all of the books that were scanned by google. but curiously,
it doesn’t seem to manifest over at google itself, so it appears to be a glitch
that’s introduced at the university of michigan. and — just so you know –
em-dashes are also routinely lost, and so are the end-of-line hyphens, so it
appears to be a problem with the _upper-ascii_ characters.
to see the problem with end-of-line hyphens being dropped, see page 176.
and to see the problem with em-dashes being lost, see page 177.
these missing-character problems are even more prevalent and pernicious than
the paragraph-jumbling mentioned above, and are extremely difficult to correct.
-bowerbird
December 10, 2007 @ 12:23 pm
As I said, OCR quality and display of OCR are two separate problems. I have no insider knowledge of Google’s OCR practices and whether these are things they can improve, or Michigan can improve, or whether Google is sending Michigan the same files they display (I’d guess no on that, as in the past, I have seen pages displayed in MBooks that are not displayed in Google Books).
December 10, 2007 @ 12:56 pm
chris said:
> As I said, OCR quality and display of OCR
> are two separate problems.
i’m not sure i know exactly what you mean, chris.
it’s clear to me that umichigan does _something_
that causes these “high-bit” characters to be lost.
these problems make the text quite unusable,
at least from the basic standpoint of creating
a version of the book consisting of digital text.
which, to my mind, is the only _worthwhile_
kind of cyberlibrary to have, in the long run.
scan-sets are merely “pictures of books”…
they’re suitable as a first approximation, but
what we really want is more flexible digital text.
so essentially, because of this umichigan mistake
which is causing all these characters to disappear,
people will have to re-do the o.c.r. on the scans…
which means the decision that i lauded above
(in comment #10) as “an extremely generous gift”
turns out to be relatively hollow in its execution…
of course, it should be fairly easy to track down
where the mistake is being made, and correct it,
so as to substantially improve the text presented.
so that is what i’d like to hear from dr. courant,
a simple statement that that action will be taken.
-bowerbird
December 10, 2007 @ 3:00 pm
We are aware of the problem, and are working with Google to resolve it. We expect it to be resolved by the end of Q1 2008.
December 11, 2007 @ 11:38 am
great! glad to hear it!
-bowerbird
December 12, 2007 @ 1:23 pm
[…] Courant’s thoughtful November blog postings regarding the University of Michigan’s Google partnership are helping to focus the […]
January 4, 2008 @ 3:15 pm
i assume we’re still on-track for this solution.
if not, please let me know about the hold-up.
thank you.
-bowerbird
January 25, 2008 @ 2:36 pm
[…] I won’t argue against the utility of Google Book at Michigan–it’s definitely a worthwhile endeavor that never would have taken place outside investment by private industry. Individual text […]
February 4, 2008 @ 10:20 pm
i assume we’re still on-track.
if not, please let me know…
happy leap-day…
-bowerbird
February 29, 2008 @ 4:36 am
We are getting texts from Google with em dashes and quotation marks, eg “The adventures of Baron Munchausen, from the best English & German editions..”:
http://hdl.handle.net/2027/mdp.39015064541447
and a different edition of “Books and Culture” than bowerbird’s example from #37 above:
http://hdl.handle.net/2027/mdp.39015034569833
These text files do not include end-of-line hyphens consistently.
We are getting new copies of files from Google that will replace the older files with missing em dashes and quotation marks.
March 4, 2008 @ 12:32 pm
> We are getting texts from Google
> with em dashes and quotation marks
fantastic! congratulations!
> These text files do not
> include end-of-line hyphens consistently.
that often happens in the o.c.r. process,
and can be expected. it doesn’t indicate
a systemic mistake in handling the files…
so, in honor of the date, march forth!
-bowerbird
March 5, 2008 @ 9:08 pm