Column

AI Today: Grand Theft Auto or Public Benefactor?

“This is the largest theft in the United States, period.” Such is the judgment of author and scriptwriter Justine Bateman who has complained to the US Copyright Office that the AI industry has scraped her work, much as it has everything else, having exhausted Wikipedia and Reddit it is moving on YouTube transcripts and Google docs. This is what it takes to assemble the trillions of words needed to expand the training of ever-more-powerful Large Language Models (LLMs). As a result, Bateman’s complaint has become a common charge. Authors (notably Sarah Silverman and John Grisham), publishers (Universal Music), and newspapers (New York Times) are lining up to sue OpenAI, Microsoft, and other firms for violating their copyright.

In the United States, where this legal action is largely taking place, copyright’s fair use exception is bound to be the first line of defense by the companies sued. In US law, four factors are considered in judging what qualifies for a fair use exception. As I and others see it, the workings of LLMs is so well served by the first factor, involving the character of the use, that it nullifies the other three.

The LLMs are arguably engaged in what the courts refer to as transformative use. That is, LLMs “know” not to reproduce Bateman’s texts because of copyright (if still imperfectly). Yet as it has been informed by these texts, among a great many others, it can refer to or summarize them. This transformative use reduces the relevance of the other fair use factors. These entail how much was copied and whether that copying is being used in a commercial way (but there is no presence of a copy). The fourth involves whether the copying enables a LLM to compete with Bateman’s work or harm her market. It’s likely just the opposite, as it can increase attention to her books.

But more than that, just as Justine Bateman likely consumed hundreds of (complete) books and thousands of other sources in creating her own original works, the LLMs are computationally following in the footsteps of this longstanding cultural tradition. It is a given that every writer owes much to those who wrote before them. This cultural indebtedness is not something copyright is intended to correct. Nor is copyright meant to rectify how some writers (and now machines) will make far more money than those whose work they consume or to which they refer. Copyright was instituted, after all, “for the encouragement of learning,” as the title of a recent history of Canadian copyright by Myra Tafik puts front and center.

Let me then apply this theme to scholarly publishing, which is my intellectual property beat in this column. Here what’s noteworthy is how this field’s largest publisher Elsevier is offering to sell for “AI and digital transformation” the complete texts published in its 2,500 journals, as well as abstracts from the 7,000 publishers that it indexes in its Scopus, and 11 million conference papers that it also happens to possess. I can appreciate that establishing this business model is an excellent strategy for cutting off the fair use claim that LLM use does not interfere, for example, with Bateman’s sales.

Yet Elsevier’s re-selling this body of work could undermine the interests of the authors, which is to say, researchers. They turned over our copyright to the publishers long before such secondary sales were a prospect, and did so only to give the publishers an incentive to widely circulate the work.

The researchers’ interest is not in protecting their work’s commercial value, but in maximizing its contribution to learning and humankind. The principle has long been realized through the university library, where faculty and students access vast quantities of texts to underwrite their research (without paying royalties at least to the journal authors). The principle has been expanded in the digital era with the move to open access to research, which I’ve often discussed in this column. And now, we are seeing how LLMs can act as a further resource for pursuing research goals. Among a number of ongoing initiatives, there’s the success of AlphaFold, an AI system developed by Google’s DeepMind, in solving the protein-folding puzzle for over 200 million protein structures involving amino acid sequences, which will greatly speed up drug discovery and treatment.

The research community, then, has a special interest in facilitating the development of LLMs that can support research in the public interest. What is encouraging, in this regard, are the growing number of open source LLMs that can be freely used, as is the expansion of open access to research which is, in effect, helping to enrich LLMs’ knowledge base and the factual grounding of these systems’ output. By the same token, this community should be no less concerned about how Elsevier, if not yet other publishers, is seeking to re-sell its publications to AI companies. Such moves to further commercialize research and scholarship, largely created at public expense, is bound to restrict researcher and public access to the resulting LLMs (see the Right to Research movement).

So when Bateman, Silverman, Grisham and others have their day in court, I see a need for something of a “public defender” to speak out on behalf of both the cultural traditions by which older works are transformed into new sources of knowledge through, in this case, publicly accessible LLMs. As LLMs continue to demonstrate extraordinary powers, their growth needs to be guided by the public benefits they can serve, which is very much in the original spirit of copyright law.

Comments

  1. LLMs should be used as digests or encyclopedias on steroids. That is, if an entire database is made accessible to a third-party AI system that transaction should be seen as the database being made accessible to a library or other institution. Copyright should remain with the copyright holders. LLMs should be allowed to summarize and quote and have the responsibility of giving credit where credit is due. And doing so correctly.

    If LLMs are to be used to solve problems they need access to large amounts of information to put all the pieces together in the puzzle. However, once those pieces are placed together credit should be given to all the human beings who created those puzzle pieces. In other words, the final product or solution is the result not solely of AI capabilities but of a collaboration of human input, human creation and ingenuity, and AI learning. AI puts the pieces together.

  2. P.S. It is also up to human beings to question the answers or solutions of an AI. If the AI is seen or accepted as infallible then that would present a huge problem for humankind.

  3. John Willinsky

    Thanks, Verna, I’m in full agreement on the importance of crediting sources, both as a reward and for verification purposes. I remain hopeful that, when it comes to research and scholarship as critical sources, the academic community will be supportive, if not insistent, on this right of access for LLM, given the potential benefits. I say insistent, as the scholarly publishers of this work, will have others ideas about further profiting from our work through AI access agreements.