Text- and data-mining (TDM) present an interesting challenge to libraries and folks who think about library policy questions, because it reverses the polarity of an important legal distinction: the difference between owning a copy and owning a copyright. For the last several decades, libraries have been working through the challenge of owning copies but not rights. The core question has been: how can we take advantage of new technologies to leverage the investments we’ve made in our physical collections of copies, though we don’t own copyrights in these materials? In the next couple of decades, when it comes to TDM, we may be working through the opposite challenge. Our question may be: how can we take advantage of the broad rights we have under fair use when control of massive collections of physical copies (including bits and bytes on servers) has shifted increasingly to publishers? Up until now libraries have been secure in their possession of copies but have had to puzzle through rights; in the future, when it comes to TDM, we may be secure in our rights but need to puzzle through how to get copies. We need to think strategically about this shift.
Copyright law, indeed all intellectual property, deals with intangibles. Personal property rights apply to specific physical objects—your laptop, your wallet, your valuable hunting knife. Real property rights apply to plots of land and buildings. These are things and places you can visit and touch. But intellectual property applies to abstract objects—inventions, “marks,” and, for copyright, “works.” This is why Judge Story observed in the landmark US fair use case Folsom v. Marsh that intellectual property is as close as the law gets to “metaphysics”—it regulates physical acts (copying, distribution, performance) and objects (copies), but it does so by reference to a non-physical world of abstract things. There is an old truism that “possession is nine-tenths of the law,” but copyright grants rights almost without regard to possession of tangible things. A work has to be “fixed in a tangible medium”1 before it can gain copyright protection, but that copy can be destroyed subsequently without any effect on the rights in the intangible work. Lawful owners do get some special rights by virtue of owning a copy—mainly to sell, lend, or otherwise dispose of that particular copy.2 But that’s about it.
Just as copyright owners have the right to do or to authorize certain uses of protected works, the public has a right (thanks to fair use, among other limitations and exceptions) to make protected uses of those works. Perhaps the most well-established and thoroughly-litigated aspect of the fair use right at this point is the right to make what are often called “non-consumptive uses.” We now know beyond the shadow of any rational doubt that combining copyrighted works into a searchable database that yields useful factual information is a permissible fair use. Anyone has a right to create such databases, and to release factual information and limited “snippet”-style previews from the underlying works in the database in appropriate contexts. We know this because there is a long and diverse line of caselaw holding as much.
The two appellate circuits that hear the bulk of copyright cases—the Ninth Circuit (presiding over Hollywood and Silicon Valley) and the Second Circuit (presiding over New York)—have come down the same way on the issue. The most recent opinion in this line, Judge Leval’s for the Second Circuit in Authors Guild v. Google, is a tour de force of fair use thinking, written by the judge who literally coined the phrase (or the word, anyway: “Transformative”) that the Supreme Court later blessed as “the heart of fair use.”3 In that opinion, Leval explains why text- and data-mining, and all similar “non-consumptive uses,” are as transformative as it gets: using copyrighted works to glean new insights, and to liberate factual information for the enrichment of the public, all without intruding on the normal market interests of the copyright holder. This is the heart and soul of fair use.
That’s all fine and good, but it turns out that the laws of physics still apply; you have to have physical access to copies in order to use them. Google didn’t just say “fair use!”, snap its fingers, and shazam! a database of books appeared on its servers. It had to partner with dozens of libraries, send trucks to pick up the hundreds of thousands, and ultimately millions, of books that made their way into its Google Books search engine, scan those books, clean the OCR (kinda), and on and on. Before Google did it, the main barrier to massive digitization was believed to be as much logistical as legal, and optimistic estimates for libraries doing this themselves were measured in decades. Copyright may not care much about “sweat of the brow,” but reality does; you can’t mine something that you don’t posses or can’t get permission to access.
And unfortunately, fair use rights do not include a legal right of access to a physical copy.4 A work may enter the public domain and be free for all to copy and share, but if no one has a copy to work from when the term of protection runs out, it will remain lost to history.5 In this way, fair use is similar to the First Amendment’s right to speak, which does not ensure a right to access to the means of mass communication.
In the analog era, libraries built large physical collections of copies they owned. In the digital era, more and more we are buying licenses to digital content hosted on vendors’ servers. We may have a contractual right of perpetual access, but we increasingly do not have (and have not necessarily wanted) physical copies that we control. That practice shifts the benefits of copy ownership (as distinct from copyright ownership) to the vendors, on whose servers large corpora now sit. This is also why Elsevier’s acquisition of platforms like SSRN and bepress is so clever. Even if all the content in those platforms is technically open access, or is at least made freely available online, so long as access is limited to a one-by-one basis, Elsevier will still hold a de facto exclusive right to mine those corpora by virtue of being the sole holder of physical copies.
To be sure, we can negotiate for access to physical copies, and for the “right” to do TDM on licensed corpora (a right that is grounded in the publishers’ control of the corpus, not copyright!). We should do that. And when contracts are silent about TDM and related practice, we should zealously defend our rights under the default rules of copyright law to conduct TDM. But we can do more.
There has been a lot of talk about “community owned infrastructure” in library-land, some of it facilitated by SPARC. While there are many reasons to explore this idea and to value community ownership, perhaps one of the most important is that if we don’t own the infrastructure, and if we can’t access the copies, our fair use rights may (perversely!) be much less useful to us than they were in the analog era. When it comes to TDM, even in the post-Google Books era, possession is still nine-tenths of the law.
(And speaking of Google Books, libraries do own and control an amazing corpus of millions of books digitized in connection with that project, along with other digitized volumes, brought together as part of the HathiTrust corpus. That treasure trove of data has a kind of gravitational pull that creates exciting opportunities for partnerships with others whose collections might be much more valuable in combination with Hathi’s corpus than standing alone. Those kinds of collaborations, grounded in the investments libraries have already made in collection-building, could help ensure that the monopoly on physical access doesn’t take over for the copyright monopoly as a driver of power imbalances between libraries and some publishers.)
This guest post was written by Brandon Butler, Director of Information Policy at the University of Virginia Library.