top of page

Authors allege Meta used copyrighted books for AI training despite legal warnings

In a recent development in the ongoing copyright infringement lawsuit, it has been alleged that Meta Platforms, the parent company of Facebook and Instagram, proceeded to use thousands of pirated books for training its artificial intelligence (AI) models, despite prior warnings from its legal team. This revelation comes from a new filing in a copyright infringement lawsuit, which consolidates two separate cases brought against Meta by notable figures, including comedian Sarah Silverman and Pulitzer Prize winner Michael Chabon.

The lawsuit, initially filed this summer, claims that Meta utilized copyrighted works without proper authorization to train its AI language model known as Llama. A California judge previously dismissed a portion of the lawsuit filed by Sarah Silverman but indicated that the authors would be given the opportunity to amend their claims.


The new filing, submitted on Monday, alleges that Meta's legal department was aware of the legal risks associated with the use of copyrighted books for training its AI models. The complaint includes chat logs of a Meta-affiliated researcher, Tim Dettmers, discussing the procurement of the dataset in a Discord server. These chat logs serve as potential evidence indicating Meta's awareness that its use of the books might not be protected by U.S. copyright law.


In the quoted chat logs, Dettmers describes his interactions with Meta's legal department over the legality of using book files as training data. He mentions that, due to legal reasons, they were unable to use the dataset, known as "The Pile," in its current form. The complaint also highlights Dettmers stating that Meta's lawyers had conveyed that the data cannot be used or models cannot be published if trained on that data.


Although the specific legal concerns are not explicitly mentioned in the chat logs, there are references to worries about "books with active copyrights." The chat participants discuss the potential applicability of the fair use doctrine, a U.S. legal principle that allows certain unlicensed uses of copyrighted works, to justify training on the dataset.


This development is part of a broader trend where tech companies face legal challenges from content creators accusing them of using copyrighted material to build generative AI models. If successful, such cases could impact the generative AI landscape by potentially increasing costs and legal risks for AI companies using copyrighted works without proper authorization.


Meta released the initial version of its Llama language model in February, acknowledging the use of datasets such as "the Books3 section of ThePile." The dataset reportedly contains 196,640 books. The release of Llama 2, Meta's latest version of the model, earlier this summer, marked a significant move in the generative AI market, offering a free-to-use model for companies with fewer than 700 million monthly active users. The lawsuit against Meta underscores the complex legal challenges and ethical considerations surrounding the use of copyrighted material in the development of AI models.

bottom of page