Copyright and Generative AI: Recent Developments on the Use of Copyrighted Works in AI

September 2, 2025

Generative AI models rely on billions of copyrighted works as training data. These works are not merely “data,” but creative expressions protected under copyright law. In May 2025, the United States Copyright Office (USCO) issued a comprehensive pre-publication report addressing the use of copyrighted works in the development and deployment of generative AI systems. This report responds to congressional inquiries and stakeholder interest, providing an analytical framework for evaluating copyright issues raised by AI training, with a particular focus on fair use and licensing.

Generative AI Training

Generative AI models — such as large language models and image generators — are developed through a multi-phase, iterative training process that requires vast daftasets, often including copyrighted works. The report notes that the quality and diversity of training data are critical to model performance, and that data is often acquired through web scraping, third-party datasets and, in some cases, unauthorized sources. The development process can be paused and resumed with different datasets, goals or actors. For example, Meta’s generative AI model “Llama 3” was publicly released and used by third parties to create and train new models like Perplexity’s Sonar and Nvidia’s Nemotron. As a result, references to a model’s “training” often obscure which data was used, how it was used and by whom.

Points of Copyright Concern

The USCO identifies multiple stages in AI development where copyright infringement risks may arise, including:

  • Data collection and curation: downloading, copying and modifying works for inclusion in training datasets
  • Training: use of works to adjust model parameters, with potential for memorization of protected expression
  • Deployment: use of trained models in systems that may generate outputs that resemble or are derived from copyrighted works
  • Retrieval-Augmented Generation (RAG): systems that retrieve and incorporate external content, potentially reproducing copyrighted materials in other outputs

The report emphasizes that these acts, absent a license or defense, may constitute prima facie infringement of the reproduction, derivative work, public display or public performance rights.

Fair Use Analysis
The core legal issue addressed in the report is whether the use of copyrighted works in AI training can be excused as fair use under Section 107 of the Copyright Act. The fair use analysis is extensive and framed around the four statutory factors:

  1. Purpose and character of the use: The central dispute is whether AI training is “transformative.” The report finds that training a generative AI model may often be transformative, particularly when the model is used for research, analysis or non-substitutive tasks. However, when models are trained or deployed to generate outputs that compete with or closely resemble copyrighted works, the use is less likely to be considered transformative. The USCO rejects the argument that AI training is analogous to human learning, noting that AI training involves perfect copying and analysis at superhuman scale.
  2. Nature of the copyrighted work: The use of highly creative works receives stronger protection and weighs against fair use, while the use of factual or functional works may favor it. Generative AI models are often trained on a mix of expressive and factual content.
  3. Amount and substantiality of the portion used: Generative AI training often involves ingesting or copying entire works at scale, which ordinarily weighs against fair use. However, where such copying is necessary for a transformative purpose and little or none of the copied material is made accessible to the public, this factor may be less significant.
  4. Market harm: The report identifies several forms of potential market harm, including lost sales, market dilution (where AI outputs compete with the type or style of original works used in the training data) and lost licensing opportunities.

The USCO concludes that fair use determinations are highly fact-specific. Uses such as noncommercial research or analysis that does not enable reproduction of protected expression are likely to be fair. In contrast, the USCO notes that commercial uses involving large-scale copying of expressive works to generate competing outputs, especially where licensing is available, are unlikely to qualify as fair use.

As generative AI continues to disrupt creative industries, U.S. courts are deciding how far the fair use doctrine can stretch in the age of algorithms. In Bartz v. PBC, a federal judge ruled on summary judgment that Anthropic’s use of lawfully purchased and pirated copyrighted books to train its chatbot “Claude” qualified as fair use. The training was considered transformative, similar to an author researching rather than copying. However, in ruling on the retention of digitized copies of physical books purchased by Anthropic and pirated copies of books downloaded to create a central library, the court distinguished between lawful and unlawful acquisition. The court held that retaining digitized copies of lawfully acquired books for Anthropic’s central library may be protected, but retaining over 7 million pirated copies of books for its library was not. Summary judgment was denied on the issue of maintaining a central library containing the pirated books, which the court held was not itself a fair use excusing Anthropic’s piracy. That issue will go to trial. This was the first ruling recognizing large-scale book training as potentially fair use, but it is far from a blanket exemption. Parallel cases are also underway. Like the Anthropic case, these lawsuits test whether transformative use and public benefit outweigh the commercial nature and scale of copying in generative AI training.

Licensing Options for AI Training

The USCO also evaluates existing and proposed licensing models to address the growing demand for training data. Voluntary licensing is seen as a promising avenue in sectors like stock photography and music, where rights are centralized and monetization is more mature. Yet for broader content types such as literature, journalism or publicly available web content, voluntary licensing may be impractical without substantial infrastructure. The USCO highlights the potential of collective management organizations to reduce transaction costs and facilitate bulk licensing, while noting antitrust considerations and the need for further guidance. The report also considers statutory approaches, including compulsory licensing, extended collective licensing and opt-out mechanisms as fallback options should market-based solutions prove inadequate. Any legal framework must balance feasibility, fairness and innovation, avoiding undue burdens on developers and rightsholders.

International Approaches

The report surveys international approaches, including the European Union’s text and data mining exceptions, Japan’s flexible copyright exemption for machine learning and Israel’s fair use framework. These reflect varying degrees of openness to AI training. Such disparities present challenges for global AI development, especially when models are trained on cross-border datasets. The USCO suggests that some harmonization of copyright standards may be necessary to support international AI research and commerce.

Policy Recommendations

The USCO recommends allowing voluntary licensing markets to continue developing without government intervention. Should market failures arise in specific contexts, targeted solutions such as extended collective licensing may be considered. The USCO emphasizes the need to balance technological innovation with the rights and incentives of creators and commits to ongoing monitoring and advising Congress as the legal and technological landscape evolves.

Key Takeaways From the USCO Pre-Publication Report

  • The use of copyrighted works in generative AI training raises significant legal and policy questions, particularly regarding fair use and licensing.
  • Fair use determinations are fact-specific and depend on the purpose, nature, amount and market effect of the use.
  • Voluntary licensing markets for AI training are emerging, but challenges remain for licensing at scale.
  • Government intervention is premature; continued market development and targeted solutions for specific failures are recommended.
  • The USCO will continue to monitor developments and provide guidance as needed.
Subscribe