Column

What’s an Author to Do? Shadow Libraries in the Age of AI.

On March 6th, a prominent group of publishers including the 5 biggest global book publishers (Hachette, Penguin Random House, HarperCollins, Macmillan and Simon and Schuster) filed a lawsuit in NY federal court to try and shut down the shadow library “Anna’s Archive”. A decade ago, John Willinsky described scholarly publishing as having its “Napster moment” with the emergence of pirate sites like LibGen and Sci-Hub. The race to train large language models using sites like Anna’s Archive (which is the successor site of Libgen/Sci-Hub) feels like a second act, where these sites are not just channels for pirated books and articles, but also sources of training data for large language models (LLMs). This is also not just limited to commercial publishers, HathiTrust recently reported that a large portion of its collection had been obtained and redistributed on Anna’s Archive.

Lawsuits against shadow libraries are not new – publishers and creators have been attempting to get pirated creative works down off the internet for as long as it has existed, as is reflected from the almost never-ending list of lawsuits listed on the Torrentfreak blog. What has emerged over the past few years is that these lawsuits now emphasise the role that sites like Anna’s Archive play in training large language models (LLMs), in that “Publishers’ action is now especially critical in light of reports that Anna’s Archive is actively advertising that it will provide high speed access to—and indeed has already supplied stolen works of authorship to— developers of large language model AI systems (“LLMs”) and data brokers.” (Para 1. https://publishers.org/wp-content/uploads/2026/03/Apress-v-Annas-Archive-Complaint.pdf).

The rapid pace of technological progress combined with fierce competition among companies involved with developing AI models have led to an ethical vacuum, with countries around the world racing to develop policies to catch up. One of the many casualties of this vacuum are authors and creators, whose published work have become the primary training materials for AI models, frequently without any compensation being paid. And since the big tech companies behind AI development have embraced a progress-trumps-all approach to training, they have turned to pirate websites and shadow libraries like Anna’s Archive as part of their training data.

This has naturally led to a myriad of lawsuits and accusations. For example, in Kadrey v. Meta, it was alleged that Meta trained their LLM on Books3, a dataset including the full-text of almost 200,000 pirated books. In this decision, Meta won a narrow victory, where it was determined that the use of this dataset was fair use. Conversely, Bartz et al. v. Anthropic PBC ended in the largest copyright class action settlement in US history ($1.5 billion dollars). Court documents from this case provide the most vivid example of AIs rampant appetite for training data – in addition to content from shadow libraries, Anthropic also hired Tom Turvey, the former head of partnerships for Google’s book scanning project, and tasked him with obtaining “all the books in the world”. Anthropic then purchased, scanned and destroyed millions of mostly-used print books and built a giant electronic corpus that it planned to keep in perpetuity. Anthropic’s settlement was largely a result of its use of a “central library” of pirated works, despite Judge Alsup’s ruling that training on lawfully acquired books was fair use. In addition, many other major tech companies, including Nvidia, Salesforce, and Apple, have been accused of using similar strategy for LLM training.

Of course, it’s not just big tech profiting from this landscape. Big publishers, including some of the most prominent scholarly publishing firms like Taylor & Francis and Wiley have eagerly licensed their publication for use by big tech for AI training, with authors only finding out about these agreements through news stories or press releases. Another model that is more progressive was taken by Cambridge University Press, who allow authors to opt out of having their works used for training while also paying royalties. These are just a few examples, for a longer list, see the Ithaka S+R generative AI license agreement tracker. This reflects a wider shift in which big publishers are becoming less like information vendors and more like data brokers, while also investing in the development of their own AI tools and platforms that leverage the content they own and license.

So where does this leave authors and creators? At this point, it is likely that most English language publications, blog posts or posts on the internet have been used as training data for multiple LLMs. Authors who do not want their content used for training are left with few options. They can publish in places that allow authors to opt out, although an opt-out doesn’t mean much if training data is being pulled from shadow libraries. They can also look to emerging models of licensing like the Creative Commons (CC) Signals project which will allow rightholders to “signal their preferences for how their content can be reused by machines based on a set of limited but meaningful options shaped in the public interest.” The success of this model is contingent on AI training respecting these “signals”, and considering big tech’s track record with shadow libraries and copyright compliance, it is hard to imagine AI companies treating CC Signals as anything more than optional.

Start the discussion!

Leave a Reply

(Your email address will not be published or distributed)