Resources for
academics and university staff
Everything you need for each step
of your study abroad journey
Academic publishers have called for more protections and greater transparency over the way artificial intelligence chatbots are trained, amid a string of lawsuits seeking to protect copyrighted material.
The progress of legal cases alleging that work was copied without consent, credit or compensation by the likes of OpenAI – creator of ChatGPT and GPT-4 – and Google are being closely followed, with experts predicting that large academic publishers might start their own claims in time.
Data “is going to prove to be the moat that companies protect themselves with against the onslaught of generative AI, especially large language models”, predicted Toby Walsh, Scientia professor of artificial intelligence at UNSW Sydney.
“I can’t imagine the publishers are going to watch as their intellectual property is ingested unpaid.”
Campus webinar: Artificial intelligence and academic integrity
Thomas Lancaster, a senior teaching fellow in computing at Imperial College London, agreed. “There are academic publishers out there who are very protective of their copyright, so I’m sure some are actively trying to work out what content is included in the GPT-4 archive,” he said.
“I wouldn’t be surprised if we see academic lawsuits in the future, but I suspect a lot will depend on any precedents that come through from the current claims.”
In July, authors Mona Awad and Paul Tremblay filed a class action complaint in a San Francisco court alleging that their books had been “used to train” ChatGPT, because it was able to generate “very accurate summaries”. Comedian Sarah Silverman has started a similar claim.
OpenAI has said little about the sources that have been fed into its model, and it is unclear how academic research was used during its development.
However, Meta’s Galactica – which bills itself as a large language model (LLM) for science – is known to have been trained on millions of articles and claims to be able to summarise academic papers.
Many of these studies are available openly online, and LLMs also draw on news stories and reviews that discuss research findings, suggesting that publishers might find it difficult to prove that their copyright has been violated.
Dr Lancaster said, after checking for his own papers, it “appears GPT-4 has access to a lot of abstracts, but not the main paper text and detailed content”.
The myriad copyright laws used in different countries are a further complication, he added. Many governments have loosened the rules to enable data mining as a way of encouraging AI development.
Patrick Goold, reader in law at City, University of London, said even if publishers could prove that books and journals had been used in the training of chatbots, courts would likely rule that copyright has not been infringed because the AI “spits out an expression that is entirely unique”.
Despite the legal uncertainties, publishers told Times Higher Education, more needed to be done to protect academic work and to force AI developers to be more open in acknowledging their sources.
Wiley said it was “closely monitoring industry reports and related litigation claiming that generative AI models are harvesting copyright-protected material for training purposes, while disregarding existing restrictions on that information”.
“We have called for greater regulatory oversight and international collaboration, including transparency and audit obligations for AI language model providers, to address the accuracy of inputs and the potential for unauthorised use of restricted content as an input for model training,” a spokesperson said. “In short, we need more protections for copyrighted materials and other intellectual property.”
The American Association for the Advancement of Science, publisher of the Science family of journals, said there was a need for “appropriate limitations” to be put on text and data mining to avoid “unintended consequences”.
“Given the fast pace of artificial intelligence development, it is critically important to monitor the creation and adoption of guidelines for tools that can be trained on full-text journal articles, including for the purposes of replicating scholarly journal content, to ensure a focus on responsible and ethical development,” a statement said.
Elsevier said it did not permit its content to be input into public AI tools because “doing so may train such tools with Elsevier’s content and data, and other companies may claim ownership on outputs based upon our content and data”.
While there is widespread support for open access to academic publications among scholars, researchers have echoed calls for transparency in the development of AI to ensure that its outputs acknowledge scientific uncertainty and are not accepted uncritically.
Professor Walsh said this would help in the understanding of the “limitations and abilities of these systems”, but companies were generally becoming less transparent, “largely I suspect because they’re trying to avoid legal cases from those whose data they’re using”.
Anyone publishing academic work should be prepared for it to be “synthesised, analysed, recrystallised and sometimes misappropriated”, said Andy Farnell, a visiting professor of signals, systems and cybersecurity at a number of European universities.
“Research depends on exactly that process of ingestion and resynthesis that the AI is now doing better than research scientists, who have become fixated on grant applications and administrivia.”
tom.williams@timeshighereducation.com
Print headline: Journals seek safeguards on AI’s mining of research
Why register?
Or subscribe for unlimited access to:
Already registered or a current subscriber? Login
The AI chatbot may soon kill the undergraduate essay, but its transformation of research could be equally seismic. Jack Grove examines how ChatGPT is already disrupting scholarly practices and where the technology may eventually take researchers – for good or ill
ChatGPT’s ability to churn out mediocre papers should lead us to reappraise how research is carried out, reported and evaluated, says Martyn Hammersley
The technology threatens to impoverish research and destroy humans’ ability to understand the social world, says Dirk Lindebaum
ChatGPT must compel humanities scholars to rethink their acceptance of intellectual mediocrity and lax standards, says James Walker
Tracking teaching, grades and research productivity with electronic systems is affecting staff independence and well-being, union-backed survey finds
Graduates who can’t think critically without electronic assistance will be at a distinct disadvantage in the workplace, says Loïc Plé
These initiatives don’t demand extra funding, undervalue publisher input or create institutional or disciplinary divides, say Anthony Cond and Jane Bunker
Anglophone scepticism about the value of language study had been rising for many years before anyone had heard of Duolingo or ChatGPT. But while some academics believe technology will kill off universities’ remaining language departments, others dare to hope it will be their saviour. Patrick Jack reports
Subscribe to Times Higher Education
As the voice of global higher education, THE is an invaluable daily resource. Subscribe today to receive unlimited news and analyses, commentary from the sharpest minds in international academia, our influential university rankings analysis and the latest insights from our World Summit series.

More Stories
Anatomy of a Scam
Climate and Environmental Sustainability Within the IETF and IRTF
From Commitments to Practice: Internet Society’s Priorities for WSIS+20 Implementation