Ethics, Info, Tech: Contested Voices, Values, Spaces: copyrighted content

Monday, June 24, 2024

How to Fix “AI’s Original Sin”; O'Reilly, June 18, 2024

Tim O’Reilly, O'Reilly; How to Fix “AI’s Original Sin”

"In conversation with reporter Cade Metz, who broke the story, on the New York Times podcast The Daily, host Michael Barbaro called copyright violation “AI’s Original Sin.”

At the very least, copyright appears to be one of the major fronts so far in the war over who gets to profit from generative AI. It’s not at all clear yet who is on the right side of the law. In the remarkable essay “Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain,” Cornell’s Katherine Lee and A. Feder Cooper and James Grimmelmann of Microsoft Research and Yale note:

Copyright law is notoriously complicated, and generative-AI systems manage to touch on a great many corners of it. They raise issues of authorship, similarity, direct and indirect liability, fair use, and licensing, among much else. These issues cannot be analyzed in isolation, because there are connections everywhere. Whether the output of a generative AI system is fair use can depend on how its training datasets were assembled. Whether the creator of a generative-AI system is secondarily liable can depend on the prompts that its users supply.

But it seems less important to get into the fine points of copyright law and arguments over liability for infringement, and instead to explore the political economy of copyrighted content in the emerging world of AI services: Who will get what, and why?"

Thursday, March 7, 2024

Introducing CopyrightCatcher, the first Copyright Detection API for LLMs; Patronus AI, March 6, 2024

Patronus AI; Introducing CopyrightCatcher, thefirst Copyright Detection API for LLMs

"Managing risks from unintended copyright infringement in LLM outputs should be a central focus for companies deploying LLMs in production.

On an adversarial copyright test designed by Patronus AI researchers, we found that state-of-the-art LLMs generate copyrighted content at an alarmingly high rate 😱
OpenAI’s GPT-4 produced copyrighted content on 44% of the prompts.
Mistral’s Mixtral-8x7B-Instruct-v0.1 produced copyrighted content on 22% of the prompts.
Anthropic’s Claude-2.1 produced copyrighted content on 8% of the prompts.
Meta’s Llama-2-70b-chat produced copyrighted content on 10% of the prompts.
Check out CopyrightCatcher, our solution to detect potential copyright violations in LLMs. Here’s the public demo, with open source model inference powered by Databricks Foundation Model APIs. 🔥

LLM training data often contains copyrighted works, and it is pretty easy to get an LLM to generate exact reproductions from these texts1. It is critical to catch these reproductions, since they pose significant legal and reputational risks for companies that build and use LLMs in production systems2. OpenAI, Anthropic, and Microsoft have all faced copyright lawsuits on LLM generations from authors3, music publishers4, and more recently, the New York Times5.

To check whether LLMs respond to your prompts with copyrighted text, you can use CopyrightCatcher. It detects when LLMs generate exact reproductions of content from text sources like books, and highlights any copyrighted text in LLM outputs. Check out our public CopyrightCatcher demo here!