Earlier this year, we noticed that some academic publishers have revised the copyright notices on their websites to state they reserve rights to text and data mining (TDM) and AI training (for example, see the website footers for Elsevier and Wiley). This new language may create confusion among libraries and researchers as TDM and AI-based analysis become an increasingly important aspect of research.
SPARC asked Kyle K. Courtney, Director of Copyright and Information Policy for Harvard Library, to address key questions regarding these revised copyright statements and the continuing viability of fair use justifications for TDM.
1. Do revised copyright statements at the bottom of a website explicitly reserving rights to AI training change the fair use case for these activities?
The short answer is: no, probably not.
These statements at the bottom of a website exist within the purview of contract law (or licensing “Terms of Use”). First, let’s talk about some principles of contract law that will help guide the analysis to this question.
Broadly speaking, contract law is about enforcing promises. A contract is a promise or a set of promises, which the law recognizes as a sort of “duty” and provides remedies when that promise is breached. A license is a legal interest created by a rightsholder that grants use-privileges to a non-rightsholder. In other words, a license is a type of contract — it is a “contract not to sue.” (Therefore, we’ll occasionally use the terms “license” and “contract” interchangeably.)
As you can imagine, contract and licensing agreements can determine what a user can do within legal bounds, including text and data mining with AI-based tools. However, this depends on whether the contract or license is valid in the first place. Whether a contract or license is valid depends on how it’s entered into. Two key aspects of this are “acceptance” and “mutuality.”
- “Acceptance” means that when one party offers a contract, the offer must be clearly accepted by the other party—either through words, actions, or the performing the obligations as called for in the contract.
- “Mutuality” is a “meeting of the minds” between the contracting parties. In other words, the contracting parties must have understood and agreed to the basic substance and terms of the contract.
With these principles in mind, let’s take a look at the language that sits at the bottom of a webpage, known as “browse-wrap licenses,” that are at issue here.
Browse-wrap licenses are a type of non-negotiable, unilateral contract where explicit agreement was not obtained. This is a very different kind of license than the standard negotiated vendor license that libraries work with all the time (See Section 4 below for more on those licenses). Browse-wrap licenses are typically a fixed display of the terms and conditions (or “Terms of Service/Use”) for using the webpage or the resource, usually found through a hyperlink or language in the footer of the page. This indicates to the user that by using the resource, they are bound by those terms.
These browse-wrap agreements may be valid (and therefore enforceable), but only if there was acceptance or mutuality (a “meeting of the minds”). A study by two law professors in 2019 found that 99% of the 500 most popular U.S. websites had terms of service written as equally complex as an academic journal article, which makes them possibly inaccessible to most humans. (Uri Benoliel & Shmuel I. Becher, “The Duty to Read the Unreadable,” 60 B.C. L. Rev. 2255 (2019)) How can someone “accept” a contract if they do not understand it?
However, acceptance and mutuality may be fairly implied based on the user’s conduct after the user is put on actual or reasonable notice that their access or use is subject to these terms and conditions. An example of such conduct could be continued access to or use of the website, database, or service; or, the conduct could be that the user downloaded the product.
A strategic question to ask for on-campus projects involving TDM or AI training is: How will you access the material necessary for the use? Answers will vary. Some access will certainly be via a library-licensed resource, which is sometimes part of institutional-wide access. (More on those licenses below) Other access might be through an individual subscription containing an agreement that you clicked on in order to access the material. And some access might be through public-facing websites featuring terms of use. Each of these examples have varying implications for whether a valid contract exists. But simply because the owner of the website puts language at the bottom of their website, does not mean that the language is binding.
2. Is the fair use case for using AI-based tools on licensed content any different or any weaker than the fair use case for using AI training tools more broadly?
Yes, the fair use case for using AI-based tools on licensed content and using AI training tools more broadly can be different (but not necessarily “weaker”) in some aspects, but they both rely on the same set of principles established in copyright law under the fair use analysis. Simply because the content is licensed does not change the analysis in and of itself.
Fair use is a balancing test comprising four factors. Let’s go through each of these factors with regard to AI training utilized on licensed content and on AI training tools more broadly.
- Transformative Use: The fair use analysis typically hinges on whether the use of copyrighted material transforms the original work into something new and different. When using AI-based tools on licensed content, the transformative nature of the use is a critical factor. For instance, if the AI tool is used to analyze or extract insights from the content for research or educational purposes, it is more likely that this would be considered transformative. However, if the tool merely duplicates the content without adding significant new value, the fair use argument may be slightly weaker.
- Nature of the Work: Fair use considerations may vary depending on the nature of the copyrighted work. For instance, using AI training tools on factual or nonfiction content might be stronger under fair use compared to using such tools on highly creative or expressive works. Similarly, the fair use case for using AI-based tools on licensed content could be stronger if the content is primarily factual or if the transformative use outweighs the commercial impact on the original work.
- Amount and Substantiality of the Portion Used: Fair use analysis also looks at the amount and substantiality of the portion used in relation to the whole copyrighted work. AI-based tools should not use more of the original content than necessary to achieve the transformative purpose. If the tool uses only a small portion of the content and the use is transformative, the fair use argument may be stronger.
- Effect on the Market: Fair use analysis considers the potential market impact of the use on the original work. If the use of AI-based tools on licensed content competes with or diminishes the market for the original work, it could weaken the fair use argument. However, if the use enhances the market for the original work or serves a different purpose, such as research or criticism, it may strengthen the fair use case.
The very core of the fair use doctrine is such that the user does not need permission from a rightsholder in order to make fair use of a work. A license is inherently a set of permissions or restrictions. While the nature of the licensed content can change the fair use analysis based on the characteristics of the work, the fact that it is licensed does not in itself change the fundamental four-factor analysis.
3. Is there a meaningful distinction between the fair use case for using AI-based tools in an academic setting and the core question in the New York Times (NYT) v. OpenAI litigation?
Yes, there can be meaningful distinctions between the fair use case for using AI-based tools in a non-profit academic setting and the core question in commercial, for-profit settings such as in the NYT v. OpenAI litigation. Let’s examine the case and the claims made about AI training.
In December of 2023, the NYT sued OpenAI and Microsoft for copyright infringement, contending that millions of articles published by The NYT were used to train automated chatbots that now compete with the NYT as a source of reliable information. The lawsuit claims that OpenAI’s “commercial success is built in large part on OpenAI’s large-scale copyright infringement,” and accuses OpenAI of “using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it.”
The NYT alleged that: (1) OpenAI’s platform is powered by large language models containing copies of The NYT’s content; and (2) OpenAI’s platform generates output that recites The NYT’s content verbatim, closely summarizes it, and mimics its expressive style. Thus, the alleged misuses relate to both training the AI and the generative AI output based upon the underlying input.
The distinctions between the fair use case for using AI-based tools in a non-profit academic setting as opposed to the commercial setting at the core of the NYT v. OpenAI litigation primarily revolve around the purpose and nature of the use, as well as the potential market impact.
In a non-profit academic setting, the use of AI-based tools on copyrighted content is more likely to be considered transformative, especially if it’s for research, criticism, commentary, or educational purposes. Courts tend to favor fair use when the purpose of the use is non-commercial and serves the public interest, such as advancing knowledge or facilitating academic discourse. Conversely, in a commercial, for-profit setting, such as the issue in the OpenAI lawsuit, the transformative nature of the use might be more scrutinized, especially if the primary purpose is to generate revenue.
The nature of the copyrighted work being used can also impact the fair use analysis. In an academic setting, where the focus is often on non-fictional or factual works, the fair use argument might be stronger since factual material is typically afforded less copyright protection. However, in a commercial setting, where highly creative or expressive works are involved, the fair use case could be more challenging to establish.
Fair use analysis also considers the potential market impact of the use on the original work. In a non-profit academic setting, where the use is unlikely to compete directly with the market for the original work, the fair use argument may be stronger. Conversely, in a commercial setting, where the use could potentially affect the market for the original work or its derivatives, the fair use case may be weaker. In the lawsuit, the NYT directly alleges that OpenAI negatively affects the market for the NYT’s content. This is unlikely to be the case in an academic, non-profit setting.
Courts may also consider whether the use was made in good faith and whether it complies with fair dealing principles. Non-profit academic institutions typically have a reputation for operating in good faith and for contributing positively to the advancement of knowledge, which could weigh favorably in a fair use analysis. In a commercial context, where profit motives may be more prominent, courts may scrutinize the use more closely to ensure it aligns with fair use principles.
Although the fundamental four-factor fair use test is the same for using AI-based tools in an academic setting and the core question in the NYT v. OpenAI litigation, there are meaningful distinctions between the purpose and nature of the use, as well as the potential market impact in either of these cases.
4. Many of the resources available to our patrons are not governed by Terms of Use floating on the bottom of a webpage, but by licenses full of terms created by vendors. These licenses are long and complex, but sometimes do not mention any type of patron use beyond “authorized uses” which are limiting. And they may or may not mention TDM or AI training at all. Does fair use apply to the use of those licensed resources of materials?
Normally, fair use will survive general licensing disclaimers.
As said above, contract law is about enforcing promises. A contract is a promise or a set of promises, and the law provides remedies for the breach of those promises. Licenses are most often granted within the context of a contractual relationship and often the same words used to create the license are also contained in the same instrument that also memorializes a contract. A license has been called a “contract not to sue.”
However, a license is not all that matters when it comes to figuring out whether and how licensed collections can be (potentially) fairly used, even for TDM and AI training. Depending on the contract, you might not have made any specific contractual promise about fair use. If that’s the case, and your contract is silent about fair use, then fair use (or another default legal right) will be the default for what you may or may not do.
Think of it this way: Contract law and fair use rights are separate sources of authority. You can seek permission (a license) to use a covered work, or you can exercise your own rights under the law. If the copyright holder withholds permission, that doesn’t necessarily undermine fair use. This is because fair use is, by definition, the pre-existing right to make certain uses without permission.
That being said, whether a proposed fair use like TDM or AI training “survives” a license will depend on the specifics of the contract. For example, if there is language describing the limits of a license, such as a statement that a particular license is “for [SPECIFIC] use only,” (e.g. “for personal use only”), it should be read to leave fair use intact.
Or, if there is “contractual silence” (i.e. the license says nothing about it) about a particular fair use activity, the contract should also be read to leave fair use rights intact.
However, language of clear prohibition, or a promise not to engage in certain uses in a mutually agreed upon contract, will most likely override fair use rights. An example of clearly prohibitory language is: “User agrees not to…” or “User shall not…” This is a promise by the user not to exercise their fair use rights.
Another clause to consider is a fair use “savings clause.” This language is often recommended for standard inclusion in many library licensing agreement negotiations. This language is typically a statement such as “nothing in this agreement shall be interpreted to limit…. the Licensing Organization’s or any Authorized User’s rights under Fair Use….” This clause helps clearly preserve fair use rights for authorized users by stating that even if there are terms in the license that limit, restrict, or prevent a fair use, they simply do not apply. An agreement with this kind of clear, broad savings language generally allows you to ignore contrary language elsewhere in the agreement (as long as the use is otherwise lawful and fair).
Even if a contract contains a fair use savings clause, this still may not prevent a vendor from expressing concern about certain user activities—even contractually protected fair uses. The vendor (or even both parties) might simply not know exactly what a savings clause means.
If a vendor contacts you to express objections about a user’s activities (such as TDM or AI training), the fair use savings clause in your contract with the vendor can help clarify the patron’s right to fair use. This could also require more discussion about the interpretation of the clauses, and potentially be an opportunity to educate, learn, and collaborate.
Finally, libraries can also consider inserting language into vendor contracts stating that the license agreement has precedence over any click-through/browse wrap license on the licensor’s site, and that any proposed language for a click-through license must be approved by the licensee prior to implementation.
This is needed because some rights holders provide a license, but then also include a hyperlink to “additional terms” using a separate URL, for which you may now be bound. Sometimes these separate terms are different or confusing because they may be generic and not specific to the license that you are working with.
Wrap-Up
The evolving landscape of copyright, licensing, and the increasing prominence of TDM and AI training pose challenges and opportunities for libraries and researchers. The questions above just scratch the surface!
The revised copyright statements at the bottom of websites may not inherently change the fair use case for TDM and AI activities, as they operate within the realm of contract law. The nuances of browse-wrap licenses and the complexities of contractual agreements underscore the importance of negotiating explicit language to assert precedence over website terms. Moreover, the fair use analysis for AI-based tools on licensed content versus broader AI training tools emphasizes the critical role of transformative use and distinctions between fair use in academic versus commercial settings, as exemplified by the NYT v. OpenAI litigation. Lastly, the interaction between fair use and licensing agreements underscores the importance of fair use savings clauses to preserve users’ rights and navigate potential conflicts.
As libraries and researchers continue to navigate these complexities – clarity, collaboration, and ongoing education are essential to ensuring fair access and use of copyrighted materials in the digital age. Fortunately, we have great folks in this space already analyzing and writing on these topics! For an excellent review of AI and fair use for libraries, including their importance of fair use to the library mission in the new evolving AI and TDM spaces, please read and share the recent University of California blog post on similar topics by our excellent colleagues Rachael Samberg, Tim Vollmer, and Samantha Teremi at https://osc.universityofcalifornia.edu/2024/03/fair-use-tdm-ai-restrictive-agreements.