Pirates of the Claudeibbean: Judge Alsup Splits the Baby on AI Fair Use
- June 25, 2025
- Snippets
Practices & Technologies
Artificial IntelligenceIn one of the most consequential decisions to date involving generative artificial intelligence (genAI) and copyright law, Judge William Alsup of the U.S. District Court for the Northern District of California ruled that some of Anthropic’s uses of copyrighted books to teach its large language models (LLMs) qualify as fair use. As a result, he found that the use of digitized versions of purchased hard copy books does not infringe their authors’ copyrights. However, other uses – including certain acts of copying from pirated sources – did not receive such protection.
Background
The plaintiffs, authors of both fiction and non-fiction books, sought to hold Anthropic liable for unauthorized copying of books in training its Claude LLM. While the three named plaintiffs had authored only a few published books, they were (and are) seeking to represent a much more extensive putative class of authors. As a result, at issue were not only the few titles that they had authored, but millions of other books that Anthropic had acquired, some via purchase and others via piracy.
To train Claude, Anthropic assembled a massive corpus of digitized texts, including over 7 million pirated books downloaded from online sources. It also purchased millions of books in print format, then destructively scanned them into searchable digital documents. All of these digital copies of books were organized into a permanent, general-purpose library, subsets of which were selected to train various Claude models.
The plaintiffs alleged that Anthropic infringed their copyrights through copying and retention of digital copies of their books and further copying and use during LLM training. Critically, however, they did not allege that Anthropic infringed their copyrights through Claude providing infringing content to the public in response to user prompts. In other words, unlike some other pending genAI copyright infringement cases, whether Claude did or could create infringing copies was not at issue. In the parlance of these types of cases, the dispute was focused on the input side of the genAI model rather than the output side.
Motion and Rulings
Anthropic sought early summary judgment based on the affirmative defense of fair use. Fair use is written into the Copyright Act itself, providing a complete statutory defense to the acts of infringement set forth in §§ 106 and 106A of the Act. It reads,
Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work.
The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.
17 U.S.C. § 107.
While the statute is clear that the four enumerated fair use factors are not exclusive, case law has developed only around those four (and Judge Alsup did not consider others). One potential issue that Judge Alsup expressly refused to consider was the potential bad faith of the accused infringer. Although fair use is viewed as an equity-based defense and the Supreme Court has therefore said that “[f]air use presupposes good faith and fair dealing,” he chose to apply more recent jurisprudence that did not consider the accused infringer’s subjective intent.
The first factor is the purpose and character of the use, mentioning whether the use is for profit or not, but the focus has shifted more to whether the use is “transformative.” While Judge Alsup does not provide a test for whether a use is “transformative” (and courts generally have been pretty “squishy” in determining what that means), a transformative use generally adds new expression, meaning, or purpose to the original copyrighted work, whereas a non-transformative use replicates the work. Thus, as the Supreme Court has found, 2Live Crew’s use of Roy Orbison’s “Oh Pretty Woman” was a transformative parody, while Andy Warhol’s overlay of color on a photograph of Prince in the characteristic Warhol style was not transformative.
The second factor is the nature of the work. The copying of factual or nonfiction works is more likely to be deemed fair than the use of highly creative or fictional content. Copyright does not protect facts (or even styles), it protects expression; nonfiction is based on facts (or, at least, is supposed to be), whereas fiction is based on the imagination of the author.
The third factor is the amount and substantiality of the used portion of the work. Generally, the smaller the portion taken, the more likely it is to be fair use. Usually, copying of an entire work weighs heavily against fair use, but copying a short segment does not (unless it is at the heart of the work).
The fourth factor is the effect of the use on the potential market for the work. If the use serves as a substitute or causes market harm to the work, it is less likely to be considered fair use.
The court disaggregated Anthropic’s different uses of the books and assessed the fair use of each separately. These separate uses were training uses, scanning and format-shifting of purchased books, and use and retention of the pirated copies.
Judge Alsup found that at least parts of the first, third, and fourth factors weighed in favor of a finding of fair use, except for the pirated books. However, he uniformly found that the second factor weighed against Anthropic across all of the copyrighted works and categories of copying. While there is often less protection to published and factual works than to unpublished or highly creative ones, as discussed above, the plaintiffs’ books contained sufficient expressive to warrant copyright protection. In fact, Anthropic acknowledged that these works were selected for training precisely for their stylistic quality and compelling composition, reinforcing their expressive character. Therefore, the discussion below focuses on the remaining three factors for each of the uses.
Use of Books for Training LLMs (Considered Fair Use)
First factor: Judge Alsup determined that the use of copyrighted works to train Claude’s LLM to be “spectacularly” transformative. In doing so, he ignored the copying from typewritten form to digital, viewing it as just a necessary step in service of the teaching of the LLM. While the models did compress and/or memorize works in the process, the result was a set of new, statistical relationships between parts of language that enabled Claude to generate human-like textual passages, this transformative nature factored in favor of fair use for Judge Alsup.
Third Factor: The full text of the books was used, but according to Judge Alsup, in a non-expressive form, which undermined the substantiality. This seems nothing more than allowing the first factor to ride riot over the third – the complete books were used because of the quality of their expression. Furthermore, the court rejected the argument that only part of the books could have been used, reasoning that other books could have been used if these ones hadn’t been. But that ignores the facts of the case, as it is undisputed that the entirety of the plaintiffs’ books were actually used. Judge Alsup thus accepted that full copying was necessary for LLM training, but did not result in expressive reproduction of the works. Thus, Judge Alsup found this factor also favored fair use.
Fourth factor: The plaintiffs offered no evidence that Claude’s outputs supplanted their books or harmed their sales. This is where the plaintiffs’ decision not to argue that Claude provided users with infringing works, whether a tactical choice or a practical one, really hurt them. Because no infringing outputs were alleged and end users could not access or extract original content from Claude, the court found no market substitution, again favoring fair use.
Scanning and Storing Purchased Print Books (Considered Fair Use)
First factor: Anthropic legally purchased, then destructively scanned millions of print books to create searchable digital files for internal use. The court held that this format-shifting was a distinct kind of transformative use akin to the space- and time-shifting seen in found to be fair use in previous cases (for example, the famous Sony Betamax decision allowing “time-shifting” of television broadcasts, or the gray market importation of foreign texts). But that seems to be a poor analogy, given that there are clearly separate markets for hard-copy and digital books, with most publishing houses refusing to sell (and only licensing) digital copies. Judge Alsup also found it was important that the digital files replaced, but did not multiply, the physical copies. This factor favored fair use.
Third factor: The entire books were copied, but only once per purchased copy, and the original physical versions were destroyed. The court found the one-to-one conversion acceptable given the internal storage purpose and lack of distribution. Of course, that has nothing to do with the amount and substantiality of the copying; once again, it seems to be allowing the first factor to supersede the third. So, Judge Alsup found this factor also favored fair use.
Fourth factor: Since Anthropic had lawfully purchased the print editions and did not distribute the scanned versions, the court found no harm to the market for the works. Judge Alsup based this in part on the fact that there was no evidence that Claude would provide users any copyright-protected portion of the plaintiffs’ works, so they had gotten the benefit of their one sale (even if they weren’t permitted to choose the format). This factor was neutral with regard to fair use.
Downloading and Retaining Pirated Books (Not Considered Fair Use)
First factor: Anthropic downloaded over seven million pirated books, including the plaintiffs’ works. The court rejected Anthropic’s claim that this was a precursor to transformative training use, or that later legal purchasing could insulate the earlier pirating. Retaining full-text pirated copies indefinitely, even when not used for training, was instead akin to building a private digital library by theft – especially when legal copies of the books were available for purchase (and were actually later purchased!). The court found this use to not be transformative and that this factor weighed against fair use.
Third factor: Full copies of the books were retained by Anthropic, regardless of whether they were ever used in training. In this case, in marked contrast to his analysis of the use of legally purchased books, Judge Alsup found the scope and permanence of this copying weighed against fair use.
Fourth factor: The court found that Anthropic’s piracy undercut the legitimate market for both print and digital copies. Acquiring these copies by way of infringement was inexcusable even if there was a later transformative use of the copies. Additionally, Anthropic chose piracy expressly to avoid licensing costs (that is, paying for the economic value) of the works. This factor weighs against fair use.
Conclusion
As an early decision of fair use jurisprudence in the context of genAI training, this outcome is significant. The court drew clear distinctions between different uses of copyrighted material. Transformative genAI training and internal format-shifting were permitted, but unauthorized copying was not. Regardless of whether Judge Alsup drew the line in the correct place, this reinforces the principle that fair use is context-specific and does not provide blanket immunity for infringing behavior simply because some downstream uses may be innovative or socially valuable.
Going forward, LLM developers should trace the provenance of their training data and stay on the lawful side of the boundary between transformative use and unauthorized reproduction. Doing so involves making sure that they obtain proper licenses to all copyrighted training data that has not entered the public domain.
But even in just LLM input-side fair use disputes like this one, there will be numerous district court decisions in the coming months adding nuance (or disagreeing with) the reasoning here. And there are likely to be many appeals as well. So, while this case establishes a line in the sand, it may soon be washed away.