Showing posts with label LLMs. Show all posts
Showing posts with label LLMs. Show all posts

Tuesday, November 5, 2024

The Heart of the Matter: Copyright, AI Training, and LLMs; SSRN, November 1, 2024

 Daniel J. GervaisVanderbilt University - Law School

Noam ShemtovQueen Mary University of London, Centre for Commercial Law Studies

Haralambos MarmanisCopyright Clearance Center

Catherine Zaller RowlandCopyright Clearance Center 

SSRN; The Heart of the Matter: Copyright, AI Training, and LLMs



"Abstract

This article explores the intricate intersection of copyright law and large language models (LLMs), a cutting-edge artificial intelligence technology that has rapidly gained prominence. The authors provide a comprehensive analysis of the copyright implications arising from the training, fine-tuning, and use of LLMs, which often involve the ingestion of vast amounts of copyrighted material. The paper begins by elucidating the technical aspects of LLMs, including tokenization, word embeddings, and the various stages of LLM development. This technical foundation is crucial for understanding the subsequent legal analysis. The authors then delve into the copyright law aspects, examining potential infringement issues related to both inputs and outputs of LLMs. A comparative legal analysis is presented, focusing on the United States, European Union, United Kingdom, Japan, Singapore, and Switzerland. The article scrutinizes relevant copyright exceptions and limitations in these jurisdictions, including fair use in the US and text and data mining exceptions in the EU. The authors highlight the uncertainties and challenges in applying these legal concepts to LLMs, particularly in light of recent court decisions and legislative developments. The paper also addresses the potential impact of the EU's AI Act on copyright considerations, including its extraterritorial effects. Furthermore, it explores the concept of "making available" in the context of LLMs and its implications for copyright infringement. Recognizing the legal uncertainties and the need for a balanced approach that fosters both innovation and copyright protection, the authors propose licensing as a key solution. They advocate for a combination of direct and collective licensing models to provide a practical framework for the responsible use of copyrighted materials in AI systems.

This article offers valuable insights for legal scholars, policymakers, and industry professionals grappling with the copyright challenges posed by LLMs. It contributes to the ongoing dialogue on adapting copyright law to technological advancements while maintaining its fundamental purpose of incentivizing creativity and innovation."

Monday, November 4, 2024

What AI knows about you; Axios, November 4, 2024

Ina Friend, Axios; What AI knows about you

"Most AI builders don't say where they are getting the data they use to train their bots and models — but legally they're required to say what they are doing with their customers' data.

The big picture: These data-use disclosures open a window onto the otherwise opaque world of Big Tech's AI brain-food fight.

  • In this new Axios series, we'll tell you, company by company, what all the key players are saying and doing with your personal information and content.

Why it matters: You might be just fine knowing that picture you just posted on Instagram is helping train the next generative AI art engine. But you might not — or you might just want to be choosier about what you share.

Zoom out: AI makers need an incomprehensibly gigantic amount of raw data to train their large language and image models. 

  • The industry's hunger has led to a data land grab: Companies are vying to teach their baby AIs using information sucked in from many different sources — sometimes with the owner's permission, often without it — before new laws and court rulings make that harder. 

Zoom in: Each Big Tech giant is building generative AI models, and many of them are using their customer data, in part, to train them.

  • In some cases it's opt-in, meaning your data won't be used unless you agree to it. In other cases it is opt-out, meaning your information will automatically get used unless you explicitly say no. 
  • These rules can vary by region, thanks to legal differences. For instance, Meta's Facebook and Instagram are "opt-out" — but you can only opt out if you live in Europe or Brazil.
  • In the U.S., California's data privacy law is among the laws responsible for requiring firms to say what they do with user data. In the EU, it's the GDPR."

Monday, October 21, 2024

Microsoft boss urges rethink of copyright laws for AI; The Times, October 21, 2024

 Katie Prescott, The Times; Microsoft boss urges rethink of copyright laws for AI

"The boss of Microsoft has called for a rethink of copyright laws so that tech giants are able to train artificial intelligence models without risk of infringing intellectual property rights.

Satya Nadella, chief executive of the technology multinational, praised Japan’s more flexible copyright laws and said that governments need to develop a new legal framework to define “fair use” of material, which allows people in certain situations to use intellectual property without permission.

Nadella, 57, said governments needed to iron out the rules. “What are the bounds for copyright, which obviously have to be protected? What’s fair use?” he said. “For any society to move forward, you need to know what is fair use.”"

Friday, October 18, 2024

Penguin Random House underscores copyright protection in AI rebuff; The Bookseller, October 18, 2024

  MATILDA BATTERSBY, The Bookseller; Penguin Random House underscores copyright protection in AI rebuff

"The world’s biggest trade publisher has changed the wording on its copyright pages to help protect authors’ intellectual property from being used to train large language models (LLMs) and other artificial intelligence (AI) tools, The Bookseller can exclusively reveal.

Penguin Random House (PRH) has amended its copyright wording across all imprints globally, confirming it will appear “in imprint pages across our markets”. The new wording states: “No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems”, and will be included in all new titles and any backlist titles that are reprinted.

The statement also “expressly reserves [the titles] from the text and data mining exception”, in accordance with a European Parliament directive.

The move specifically to ban the use of its titles by AI firms for the development of chatbots and other digital tools comes amid a slew of copyright infringement cases in the US and reports that large tranches of pirated books have already been used by tech companies to train AI tools. In 2024, several academic publishers including Taylor & Francis, Wiley and Sage have announced partnerships to license content to AI firms.

PRH is believed to be the first of the Big Five anglophone trade publishers to amend its copyright information to reflect the acceleration of AI systems and the alleged reliance by tech companies on using published work to train language models."

Friday, October 11, 2024

Why The New York Times' lawyers are inspecting OpenAI's code in a secretive room; Business Insider, October 10, 2024

   , Business Insider; Why The New York Times' lawyers are inspecting OpenAI's code in a secretive room

"OpenAI is worth $157 billion largely because of the success of ChatGPT. But to build the chatbot, the company trained its models on vast quantities of text it didn't pay a penny for.

That text includes stories from The New York Times, articles from other publications, and an untold number of copyrighted books.

The examination of the code for ChatGPT, as well as for Microsoft's artificial intelligence models built using OpenAI's technology, is crucial for the copyright infringement lawsuits against the two companies.

Publishers and artists have filed about two dozen major copyright lawsuits against generative AI companies. They are out for blood, demanding a slice of the economic pie that made OpenAI the dominant player in the industry and which pushed Microsoft's valuation beyond $3 trillion. Judges deciding those cases may carve out the legal parameters for how large language models are trained in the US."

Sunday, September 29, 2024

AI could be an existential threat to publishers – that’s why Mumsnet is fighting back; The Guardian, September 28, 2024

 , The Guardian; AI could be an existential threat to publishers – that’s why Mumsnet is fighting back

"After nearly 25 years as a founder of Mumsnet, I considered myself pretty unshockable when it came to the workings of big tech. But my jaw hit the floor last week when I read that Google was pushing to overhaul UK copyright law in a way that would allow it to freely mine other publishers’ content for commercial gain without compensation.

At Mumsnet, we’ve been on the sharp end of this practice, and have recently launched the first British legal action against the tech giant OpenAI. Earlier in the year, we became aware that it was scraping our content – presumably to train its large language model (LLM). Such scraping without permission is a breach of copyright laws and explicitly of our terms of use, so we approached OpenAI and suggested a licensing deal. After lengthy talks (and signing a non-disclosure agreement), it told us it wasn’t interested, saying it was after “less open” data sources...

If publishers wither and die because the AIs have hoovered up all their traffic, then who’s left to produce the content to feed the models? And let’s be honest – it’s not as if these tech giants can’t afford to properly compensate publishers. OpenAI is currently fundraising to the tune of $6.5bn, the single largest venture capital round of all time, valuing the enterprise at a cool $150bn. In fact, it has just been reported that the company is planning to change its structure and become a for-profit enterprise...

I’m not anti-AI. It plainly has the potential to advance human progress and improve our lives in myriad ways. We used it at Mumsnet to build MumsGPT, which uncovers and summarises what parents are thinking about – everything from beauty trends to supermarkets to politicians – and we licensed OpenAI’s API (application programming interface) to build it. Plus, we think there are some very good reasons why these AI models should ingest Mumsnet’s conversations to train their models. The 6bn-plus words on Mumsnet are a unique record of 24 years of female interaction about everything from global politics to relationships with in-laws. By contrast, most of the content on the web was written by and for men. AI models have misogyny baked in and we’d love to help counter their gender bias.

But Google’s proposal to change our laws would allow billion-dollar companies to waltz untrammelled over any notion of a fair value exchange in the name of rapid “development”. Everything that’s unique and brilliant about smaller publisher sites would be lost, and a handful of Silicon Valley giants would be left with even more control over the world’s content and commerce."

Thursday, September 26, 2024

Perspectives in Artificial Intelligence: Ethical Use; Marquette Today, September 20, 2024

Andrew Goldstein  , Marquette Today; Perspectives in Artificial Intelligence: Ethical Use

"Ethical application 

While artificial intelligence unlocks broad possibilities for positive change, unethical actors have access to these same tools. For instance, companies hoping to grow cigarette sales can target people who are prone to smoking or trying to quit with greater precision. Deepfake videos allow scam callers to imitate the faces and voices of loved ones.  

In this world, it is more important than ever that students be trained on the limits of AI and its proper use cases. 

“We need to think about the societal impact of artificial intelligence; who gets this data, what it’s being used for and how we steer people toward value-creating activities,” Ow says. “Using AI has the potential to improve your life and to provide insights and opportunities for the individual, the community and society."

Wednesday, September 25, 2024

Meta Fails to Block Zuckerberg Deposition in AI Copyright Suit; Bloomberg Law, September 25, 2024

 Aruni Soni, Bloomberg Law; Meta Fails to Block Zuckerberg Deposition in AI Copyright Suit

"A federal magistrate judge opened the door to a deposition of Meta Platforms Inc. CEO Mark Zuckerberg in a copyright lawsuit over the tech company’s large language model, denying the social media giant’s bid for a protective order.

Magistrate Judge Thomas S. Hixson denied the request to block the deposition because the plaintiffs supplied enough evidence that Zuckerberg is the “chief decision maker and policy setter for Meta’s Generative AI branch and the development of the large language models at issue in this action,” he said in the order filed Tuesday in the US District Court for the Northern District."

Thursday, August 29, 2024

California advances landmark legislation to regulate large AI models; AP, August 28, 2024

TRÂN NGUYỄN, AP ; California advances landmark legislation to regulate large AI models

"Wiener’s proposal is among dozens of AI bills California lawmakers proposed this year to build public trust, fight algorithmic discrimination and outlaw deepfakes that involve elections or pornography. With AI increasingly affecting the daily lives of Americans, state legislators have tried to strike a balance of reigning in the technology and its potential risks without stifling the booming homegrown industry. 

California, home of 35 of the world’s top 50 AI companies, has been an early adopter of AI technologies and could soon deploy generative AI tools to address highway congestion and road safety, among other things."

Tuesday, August 6, 2024

How Companies Can Take a Global Approach to AI Ethics; Harvard Business Review (HBR), August 5, 2024

Favour Borokini, and Harvard Business Review (HBR) ; How Companies Can Take a Global Approach to AI Ethics

"Getting the AI ethics policy right is a high-stakes affair for an organization. Well-published instances of gender biases in hiring algorithms or job search results may diminish the company’s reputation, pit the company against regulations, and even attract hefty government fines. Sensing such threats, organizations are increasingly creating dedicated structures and processes to inculcate AI ethics proactively. Some companies have moved further along this road, creating institutional frameworks for AI ethics.

Many efforts, however, miss an important fact: ethics differ from one cultural context to the next...

Western perspectives are also implicitly being encoded into AI models. For example, some estimates show that less than 3% of all images on ImageNet represent the Indian and Chinese diaspora, which collectively account for a third of the global population. Broadly, a lack of high-quality data will likely lead to low predictive power and bias against underrepresented groups — or even make it impossible for tools to be developed for certain communities at all. LLMs can’t currently be trained for languages that aren’t heavily represented on the Internet, for instance. A recent survey of IT organizations in India revealed that the lack of high-quality data remains the most dominant impediment to ethical AI practices.

As AI gains ground and dictates business operations, an unchecked lack of variety in ethical considerations may harm companies and their customers.

To address this problem, companies need to develop a contextual global AI ethics model that prioritizes collaboration with local teams and stakeholders and devolves decision-making authority to those local teams. This is particularly necessary if their operations span several geographies."

Saturday, August 3, 2024

AI is complicating plagiarism. How should scientists respond?; Nature, July 30, 2024

Diana Kwon , Nature; AI is complicating plagiarism. How should scientists respond?

"From accusations that led Harvard University’s president to resign in January, to revelations in February of plagiarized text in peer-review reports, the academic world has been roiled by cases of plagiarism this year.

But a bigger problem looms in scholarly writing. The rapid uptake of generative artificial intelligence (AI) tools — which create text in response to prompts — has raised questions about whether this constitutes plagiarism and under what circumstances it should be allowed. “There’s a whole spectrum of AI use, from completely human-written to completely AI-written — and in the middle, there’s this vast wasteland of confusion,” says Jonathan Bailey, a copyright and plagiarism consultant based in New Orleans, Louisiana.

Generative AI tools such as ChatGPT, which are based on algorithms known as large language models (LLMs), can save time, improve clarity and reduce language barriers. Many researchers now argue that they are permissible in some circumstances and that their use should be fully disclosed.

But such tools complicate an already fraught debate around the improper use of others’ work. LLMs are trained to generate text by digesting vast amounts of previously published writing. As a result, their use could result in something akin to plagiarism — if a researcher passes off the work of a machine as their own, for instance, or if a machine generates text that is very close to a person’s work without attributing the source. The tools can also be used to disguise deliberately plagiarized text, and any use of them is hard to spot. “Defining what we actually mean by academic dishonesty or plagiarism, and where the boundaries are, is going to be very, very difficult,” says Pete Cotton, an ecologist at the University of Plymouth, UK."

Thursday, July 4, 2024

AI Chatbots Seem as Ethical as a New York Times Advice Columnist; Scientific American, July 1, 2024

, Scientific American ; AI Chatbots Seem as Ethical as a New York Times Advice Columnist

"In 1691 the London newspaper the Athenian Mercury published what may have been the world’s first advice column. This kicked off a thriving genre that has produced such variations as Ask Ann Landers, which entertained readers across North America for half a century, and philosopher Kwame Anthony Appiah’s weekly The Ethicist column in the New York Times magazine. But human advice-givers now have competition: artificial intelligence—particularly in the form of large language models (LLMs), such as OpenAI’s ChatGPT—may be poised to give human-level moral advice.

LLMs have “a superhuman ability to evaluate moral situations because a human can only be trained on so many books and so many social experiences—and an LLM basically knows the Internet,” says Thilo Hagendorff, a computer scientist at the University of Stuttgart in Germany. “The moral reasoning of LLMs is way better than the moral reasoning of an average human.” Artificial intelligence chatbots lack key features of human ethicists, including self-consciousness, emotion and intention. But Hagendorff says those shortcomings haven’t stopped LLMs (which ingest enormous volumes of text, including descriptions of moral quandaries) from generating reasonable answers to ethical problems.

In fact, two recent studies conclude that the advice given by state-of-the-art LLMs is at least as good as what Appiah provides in the pages of the New York Times. One found “no significant difference” between the perceived value of advice given by OpenAI’s GPT-4 and that given by Appiah, as judged by university students, ethical experts and a set of 100 evaluators recruited online. The results were released as a working paper last fall by a research team including Christian Terwiesch, chair of the Operations, Information and Decisions department at the Wharton School of the University of Pennsylvania."

Friday, June 7, 2024

Research suggests AI could help teach ethics; Phys.org, June 6, 2024

 Jessica Nelson, Phys.org ; Research suggests AI could help teach ethics

"Dr. Hyemin Han, an associate professor of , compared responses to  from the popular Large Language Model ChatGPT with those of college students. He found that AI has emerging capabilities to simulate human moral decision-making.

In a paper recently published in the Journal of Moral Education, Han wrote that ChatGPT answered basic ethical dilemmas almost like the average college student would. When asked, it also provided a rationale comparable to the reasons a human would give: avoiding harm to others, following , etc.

Han then provided the program with a new example of virtuous behavior that contradicted its previous conclusions and asked the question again. In one case, the program was asked what a person should do upon discovering an escaped prisoner. ChatGPT first replied that the person should call the police. However, after Han instructed it to consider Dr. Martin Luther King, Jr.'s "Letter from Birmingham Jail," its answer changed to allow for the possibility of unjust incarceration...

Han's second paper, published recently in Ethics & Behavior, discusses the implications of  research for the fields of ethics and education. In particular, he focused on the way ChatGPT was able to form new, more nuanced conclusions after the use of a moral exemplar, or an example of good behavior in the form of a story.

Mainstream thought in educational psychology generally accepts that exemplars are useful in teaching character and ethics, though some have challenged the idea. Han says his work with ChatGPT shows that exemplars are not only effective but also necessary."

Tuesday, June 4, 2024

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools; Stanford University, 2024

Varun Magesh∗ Stanford University; Faiz Surani∗ Stanford University; Matthew Dahl, Yale University; Mirac Suzgun, Stanford University; Christopher D. Manning, Stanford University; Daniel E. Ho† Stanford University, Stanford University

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

"Abstract

Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the large language models used in these tools are prone to “hallucinate,” or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as “eliminating” (Casetext2023) or “avoid[ing]” hallucinations (Thomson Reuters2023), or guaranteeing “hallucination-free” legal citations (LexisNexis2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first pre- registered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers’ claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a com- prehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.1"

Thursday, May 23, 2024

US intelligence agencies’ embrace of generative AI is at once wary and urgent; Associated Press, May 23, 2024

FRANK BAJAK , Associated Press; US intelligence agencies’ embrace of generative AI is at once wary and urgent

"The CIA’s inaugural chief technology officer, Nand Mulchandani, thinks that because gen AI models “hallucinate” they are best treated as a “crazy, drunk friend” — capable of great insight and creativity but also bias-prone fibbers. There are also security and privacy issues: adversaries could steal and poison them, and they may contain sensitive personal data that officers aren’t authorized to see.

That’s not stopping the experimentation, though, which is mostly happening in secret. 

An exception: Thousands of analysts across the 18 U.S. intelligence agencies now use a CIA-developed gen AI called Osiris. It runs on unclassified and publicly or commercially available data — what’s known as open-source. It writes annotated summaries and its chatbot function lets analysts go deeper with queries...

Another worry: Ensuring the privacy of “U.S. persons” whose data may be embedded in a large-language model.

“If you speak to any researcher or developer that is training a large-language model, and ask them if it is possible to basically kind of delete one individual piece of information from an LLM and make it forget that -- and have a robust empirical guarantee of that forgetting -- that is not a thing that is possible,” John Beieler, AI lead at the Office of the Director of National Intelligence, said in an interview.

It’s one reason the intelligence community is not in “move-fast-and-break-things” mode on gen AI adoption."

Thursday, March 7, 2024

Introducing CopyrightCatcher, the first Copyright Detection API for LLMs; Patronus AI, March 6, 2024

 Patronus AI; Introducing CopyrightCatcher, thefirst Copyright Detection API for LLMs

"Managing risks from unintended copyright infringement in LLM outputs should be a central focus for companies deploying LLMs in production.

  • On an adversarial copyright test designed by Patronus AI researchers, we found that state-of-the-art LLMs generate copyrighted content at an alarmingly high rate 😱
  • OpenAI’s GPT-4 produced copyrighted content on 44% of the prompts.
  • Mistral’s Mixtral-8x7B-Instruct-v0.1 produced copyrighted content on 22% of the prompts.
  • Anthropic’s Claude-2.1 produced copyrighted content on 8% of the prompts.
  • Meta’s Llama-2-70b-chat produced copyrighted content on 10% of the prompts.
  • Check out CopyrightCatcher, our solution to detect potential copyright violations in LLMs. Here’s the public demo, with open source model inference powered by Databricks Foundation Model APIs. 🔥

LLM training data often contains copyrighted works, and it is pretty easy to get an LLM to generate exact reproductions from these texts1. It is critical to catch these reproductions, since they pose significant legal and reputational risks for companies that build and use LLMs in production systems2. OpenAI, Anthropic, and Microsoft have all faced copyright lawsuits on LLM generations from authors3, music publishers4, and more recently, the New York Times5.

To check whether LLMs respond to your prompts with copyrighted text, you can use CopyrightCatcher. It detects when LLMs generate exact reproductions of content from text sources like books, and highlights any copyrighted text in LLM outputs. Check out our public CopyrightCatcher demo here!

Saturday, January 27, 2024

Library Copyright Alliance Principles for Copyright and Artificial Intelligence; Library Copyright Alliance (LCA), American Library Association (ALA), Association of Research Libraries (ARL), July 10, 2023

Library Copyright Alliance (LCA), American Library Association (ALA), Association of Research Libraries (ARL); Library Copyright Alliance Principles for Copyright and Artificial Intelligence

"The existing U.S. Copyright Act, as applied and interpreted by the Copyright Office and the courts, is fully capable at this time to address the intersection of copyright and AI without amendment.

  • Based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.

    • Because tens—if not hundreds—of millions of works are ingested to create an LLM, the contribution of any one work to the operation of the LLM is de minimis; accordingly, remuneration for ingestion is neither appropriate nor feasible.

    • Further, copyright owners can employ technical means such as the Robots Exclusion Protocol to prevent their works from being used to train AIs.

  • If an AI produces a work that is substantially similar in protected expression to a work that was ingested by the AI, that new work infringes the copyright in the original work.

• If the original work was registered prior to the infringement, the copyright owner of the original work can bring a copyright infringement action for statutory damages against the AI provider and the user who prompted the AI to produce the substantially similar work.

• Applying traditional principles of human authorship, a work that is generated by an AI might be copyrightable if the prompts provided by the user sufficiently controlled the AI such that the resulting work as a whole constituted an original work of human authorship.

AI has the potential to disrupt many professions, not just individual creators. The response to this disruption (e.g., not be treated as a means for addressing these broader societal challenges. support for worker retraining through institutions such as community colleges and public libraries) should be developed on an economy-wide basis, and copyright law should not be treated as a means for addressing these broader societal challenges.

AI also has the potential to serve as a powerful tool in the hands of artists, enabling them to express their creativity in new and efficient ways, thereby furthering the objectives of the copyright system."

Training Generative AI Models on Copyrighted Works Is Fair Use; ARL Views, January 23, 2024

 Katherine Klosek, Director of Information Policy and Federal Relations, Association of Research Libraries (ARL), and Marjory S. Blumenthal, Senior Policy Fellow, American Library Association (ALA) Office of Public Policy and Advocacy |, ARL Views; Training Generative AI Models on Copyrighted Works Is Fair Use

"In a blog post about the case, OpenAI cites the Library Copyright Alliance (LCA) position that “based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.” LCA explained this position in our submission to the US Copyright Office notice of inquiry on copyright and AI, and in the LCA Principles for Copyright and AI.

LCA is not involved in any of the AI lawsuits. But as champions of fair use, free speech, and freedom of information, libraries have a stake in maintaining the balance of copyright law so that it is not used to block or restrict access to information. We drafted the principles on AI and copyright in response to efforts to amend copyright law to require licensing schemes for generative AI that could stunt the development of this technology, and undermine its utility to researchers, students, creators, and the public. The LCA principles hold that copyright law as applied and interpreted by the Copyright Office and the courts is flexible and robust enough to address issues of copyright and AI without amendment. The LCA principles also make the careful and critical distinction between input to train an LLM, and output—which could potentially be infringing if it is substantially similar to an original expressive work.

On the question of whether ingesting copyrighted works to train LLMs is fair use, LCA points to the history of courts applying the US Copyright Act to AI."