Generative AI: The Awards and the Infringement
The week of October 7th this year was quite something for Artificial Intelligence (AI). It was the object of two consecutive Nobel Prizes, awarded just days apart. The first, in Physics, went to John Hopfield and Geoffrey Hinton (a British Canadian) for laying the foundations of machine learning. The second in Chemistry, won by Demis Hassabis, John Jumper, and David Baker, was for utilizing AI to predict millions of intricate protein structures that are key to understanding molecular interactions.
In stark contrast to this double triumph are some 20 copyright infringement suits filed against OpenAI, Microsoft, Google, Nvidia, Anthropic, and other AI companies. Among those suing are the New York Times and seven other newspapers; various authors, including notables Jonathan Franzen, John Grisham, Sarah Silverman and others; open source software coders; and Scarlett Johansson (over the use of her voice).
The software coder case Doe v. GitHub has been the first to be heard. U.S. District Judge Jon Tigar in Northern California dismissed most of the charges on June 24, 2024, noting that GitHub’s Copilot was not producing identical code to that of the plaintiffs, and the case is now under appeal.
With that first ruling in place, I’d like to weigh in, with special consideration, as always, to the scholarly publishing angle (and thus with little to add, alas, to Johansson’s case). While these suits have all been filed in the U.S. their impact is bound to be felt in Canada, as always.
The first thing to note is that Judge Tigar’s ruling on GitHub has general applicability. Large Language Models (LLM) don’t as a rule cite more than excerpts from the texts to which they refer. It’s true, and incriminating, that the New York Times has examples of ChatGPT generating a number of its articles verbatim, but I suspect that’s an exception among those who’ve filed.
While this may reflect more recent engineering updates, when I asked ChatGPT, for purposes of this column, to show me the lyrics of Leonard Cohen’s “Hallelujah,” it responded, “Sorry, I can’t provide the lyrics to that song, but I can summarize its themes or discuss its meaning if you’d like!” When I asked about the out-of-copyright “Song of Myself” by Walt Whitman, it presented eleven lines from the opening as a key excerpt, along with a brief commentary that begins “this is just a small portion of the 52 sections…” The Sarah Silverman joke it provided — by no means among her best — was accompanied by analysis bringing it within the scope of fair use’s review and commentary allowances.
So while LLMs ingest the entirety of texts, they’re employed to both guide their language use and to inform their responses to prompts, without generally reproducing the works. The LLM also strikes me as offering an excellent instance of “transformative use,” among fair use defenses. The LLM uses the works, as well, in ways that promote, rather than compete with, the original, or as ChatGPT put it: “‘Hallelujah’ deals with love, loss, and spirituality, intertwining biblical references with personal reflection.”
So when it comes to the copyright principle to be applied to AI, I believe it is misguided to hold that insofar as OpenAI will profit from its use of authors’ copyrighted works, those authors deserve a cut of the take. That would make all of us teachers liable for every work we’ve read that now informs our teaching (if for a different sort of profit). Copyright exists to promote cultural and learned contributions to the benefit of all (see Nobel Prize #2 above in chemistry). It is not intended to monetize every transaction by which humans (and now machines) learn from each other. This does mean that AI companies should pay for access to the original works, rather than rely on pirate websites.
Certainly, it is very much in the spirit of research and scholarship to contribute to the promise of LLMs. Whereas, to have LLMs informed by everything online except academic work is not an encouraging prospect (while leaving LLMs to hallucinate the research references they cannot access).
Still when the scholarly publisher Taylor & Francis signed a $10 million non-exclusive “data access” agreement with Microsoft, it caught researchers by surprise, having forgotten how willingly they signed over their copyright to publishers in exchange for publication. Still, it is only fair for researchers to be credited for this AI use of their work. Credit is, indeed, required in utilizing open access research with Creative Commons attribution licensing (CC-BY). Google search, for example, now features its LLM “Gemini,” which offers a link to a source for each point made (along with a concluding caution “generative AI is experimental”).
For my part, let me conclude with a request to the scholarly publishers contemplating these AI windfalls, and the AI companies willing to pay for this access. If the agreement was to use this new revenue source to reduce journals subscription prices and Article Processing Charges (for open access), every dollar saved, they can be assured, will result in more research. This would perfectly suit, for example, OpenAI’s declared intent to become a “public benefit corporation,” which commits it to balancing business interests with societal and planetary gains. This would make for a noble cause, well aligned with AI’s recent awards.
Start the discussion!