Simon Willison's Weblog ; Can coding agents relicense open source through a “clean room” implementation of code?
"Can a model trained on a codebase produce a morally or legally defensible clean-room implementation?"
My Bloomsbury book "Ethics, Information, and Technology" was published on Nov. 13, 2025. Purchases can be made via Amazon and this Bloomsbury webpage: https://www.bloomsbury.com/us/ethics-information-and-technology-9781440856662/
Simon Willison's Weblog ; Can coding agents relicense open source through a “clean room” implementation of code?
"Can a model trained on a codebase produce a morally or legally defensible clean-room implementation?"
Sara Fischer, Axios; Nielsen's Gracenote sues OpenAI for copyright infringement
"How it works: Gracenote employs hundreds of editors who use human insight and judgment to create millions of narrative descriptions, original video descriptors, unique identifiers and other program identifiers that TV providers and other clients can use to help customers discover content.
For example, Gracenote editors described HBO's "Game of Thrones" as "the depiction of two power families — kings and queens, knights and renegades, liars and honest men — playing a deadly game of control of the Seven Kingdoms of Westeros, and to sit atop the Iron Throne."
In the lawsuit, Gracenote alleges OpenAI scraped and used a near-exact copy of that descriptor when prompted by a ChatGPT user to describe "Game of Thrones."
It provides several other examples where, with minimal prompting, OpenAI's various ChatGPT models recite large portions of Gracenote's program descriptions verbatim.
Between the lines: Gracenote's entire Programs Database, which includes its metadata and the proprietary relational map its editors use to connect that data, is registered with the U.S. Copyright Office."
Dan Milmo, The Guardian; Thousands of authors publish ‘empty’ book in protest over AI using their work
"Thousands of authors including Kazuo Ishiguro, Philippa Gregory and Richard Osman have published an “empty” book to protest against AI firms using their work without permission.
About 10,000 writers have contributed to Don’t Steal This Book, in which the only content is a list of their names. Copies of the work are being distributed to attenders at the London book fair on Tuesday, a week before the UK government is due to issue an assessment on the economic cost of proposed changes in copyright law."
"Abstract
The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment1. Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information2,3. Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding4. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour."
Dan Kagan-Kans , The New York Times; How 6,000 Bad Coding Lessons Turned a Chatbot Evil
"The journal Nature in January published an unusual paper: A team of artificial intelligence researchers had discovered a relatively simple way of turning large language models, like OpenAI’s GPT-4o, from friendly assistants into vehicles of cartoonish evil."
Jim Milliot , Publishers Weekly; Publishers Charge Anna’s Archive with Copyright Infringement
"A group of publishers including the Big Five is taking legal action to prevent the pirate website Anna’s Archive from illegally copying and selling their copyrighted material.
In a filing made March 6 in the U. S. District Court for the Southern District of New York, 13 book and journal publishers filed suit seeking a permanent injunction to stop Anna’s Archive from copying and distributing millions of infringing files. The suit highlights the magnitude of the material Anna’s Archive has stolen and the unorthodox methods it uses to monetize the material.
In a separate lawsuit brought by Atlantic Recording Corp. in December alleging Anna’s Archive had stolen thousands of audio files from the record label, Atlantic alleged that the website also purported to host “61,344,044 books” and “95,527,824 papers,” as of the December 29, 2025 filing date.
The publishers’ complaint alleges that Anna’s Archive has added over 2 million books and 100,000 papers since Atlantic filed its complaint was filed. The ongoing infringement is in keeping with Anna’s Archive’s goal “to take all the books in the world,” according to the publishers’ complaint."
Blake Brittain, Reuters; YouTuber sues Runway AI in latest copyright class action over AI training
"Artificial intelligence video startup Runway AI has been hit with a proposed class action lawsuit in California federal court for allegedly misusing YouTube content to train its video generation platform.
YouTube creator David Gardner said in the complaint filed in Los Angeles on Monday, that Runway bypassed YouTube's copyright protections to illegally download user videos for its AI training."
Ted Johnson, Deadline; Adam Schiff And John Curtis Introduce Bill To Require Tech To Disclose Copyrighted Works Used In AI Training Models
"Sen. Adam Schiff (D-CA) and Sen. John Curtis (R-UT) are introducing a bill that touches on one of the hottest Hollywood-tech debates in the development of AI: The use of copyrighted works in training models.
The Copyright Labeling and Ethical AI Reporting Act would require companies file a notice with the Register of Copyrights that detail the copyrighted works used to train datasets for an AI model. The notice would have to be filed before a new model is publicly released, and would apply retroactively to models already available to consumers.
The Copyright Office also would be required to establish a public database of the notices filed. There also would be civil penalties for failure to disclose the works used."
Jim Milliot , Publishers Weekly; Publishers Strike Back Against Google in Infringement Suit
"The Association of American Publishers continued its fight this week to allow two of its members, Hachette Book Group and Cengage, to join a class action copyright infringement lawsuit against Google and its generative AI product Gemini. The lawsuit was first brought by a group of illustrators and writers in 2023.
In mid-January the AAP filed its first motion to allow the two publishers to take part in the lawsuit that is now before Judge Eumi K. Lee in the U.S. District Court for the Northern District of California. Earlier this week the AAP filed its reply to Google’s motion asking the court to block AAP’s request.
At the core of Google’s argument is the notion that the publishers should have asked to intervene sooner, as well as the assertion that publishers have no interest in the case because they don’t own authors works.
In its response, AAP argues that it was only when the case reached class certification that the publishers’ interests became clear. The new filing also rebuts Google’s other claim that publishers’ don’t own any rights.
“Google’s professed misunderstanding of ownership exemplifies exactly the kind of value that Proposed Intervenors bring to the case,” the AAP stated, arguing that both HBG and Cengage own certain rights to the works in question and that “scores” of other publishers will be impacted by the litigation."
Anuj Behal, The Guardian ; ‘In the end, you feel blank’: India’s female workers watching hours of abusive content to train AI
[Kip Currier: The largely unaddressed plight of content moderators became more real for me after reading this haunting 9/9/24 piece in the Washington Post, "I quit my job as a content moderator. I can never go back to who I was before."
As mentioned in the graphic article's byline, content moderator Alberto Cuadra spoke with journalist Beatrix Lockwood. Maya Scarpa's illustrations poignantly give life to Alberto Cuadra's first-hand experiences and ongoing impacts from the content moderation he performed for an unnamed tech company. I talk about Cuadra's experiences and the ethical issues of content moderation, social media, and AI in my Ethics, Information, and Technology book.]
[Excerpt]
"Murmu, 26, is a content moderator for a global technology company, logging on from her village in India’s Jharkhand state. Her job is to classify images, videos and text that have been flagged by automated systems as possible violations of the platform’s rules.
On an average day, she views up to 800 videos and images, making judgments that train algorithms to recognise violence, abuse and harm.
This work sits at the core of machine learning’s recent breakthroughs, which rest on the fact that AI is only as good as the data it is trained on. In India, this labour is increasingly performed by women, who are part of a workforce often described as “ghost workers”.
“The first few months, I couldn’t sleep,” she says. “I would close my eyes and still see the screen loading.” Images followed her into her dreams: of fatal accidents, of losing family members, of sexual violence she could not stop or escape. On those nights, she says, her mother would wake and sit with her...
“In terms of risk,” she says, “content moderation belongs in the category of dangerous work, comparable to any lethal industry.”
Studies indicate content moderation triggers lasting cognitive and emotional strain, often resulting in behavioural changes such as heightened vigilance. Workers report intrusive thoughts, anxiety and sleep disturbances.
A study of content moderators published last December, which included workers in India, identified traumatic stress as the most pronounced psychological risk. The study found that even where workplace interventions and support mechanisms existed, significant levels of secondary trauma persisted."
Rob Robinson, JD Supra ; The $1.5 Billion Reckoning: AI Copyright and the 2026 Regulatory Minefield
"In the silent digital halls of early 2026, the era of “ask for forgiveness later” has finally hit a $1.5 billion brick wall. As legal frameworks in Brussels and New Delhi solidify, the wild west of AI training data is being partitioned into clearly marked zones of liability and license. For those who manage information, secure data, or navigate the murky waters of eDiscovery, this landscape is no longer a theoretical debate—it is an active regulatory battlefield where every byte of training data carries a price tag."
Amanda Silberling, TechCrunch; Music publishers sue Anthropic for $3B over ‘flagrant piracy’ of 20,000 works
"A cohort of music publishers led by Concord Music Group and Universal Music Group are suing Anthropic, saying the company illegally downloaded more than 20,000 copyrighted songs, including sheet music, song lyrics, and musical compositions.
The publishers said in a statement on Wednesday that the damages could amount to more than $3 billion, which would be one of the largest non-class action copyright cases filed in U.S. history.
This lawsuit was filed by the same legal team from the Bartz v. Anthropic case, in which a group of fiction and nonfiction authors similarly accused the AI company of using their copyrighted works to train products like Claude."
Sarah Perez, TechCrunch; YouTubers sue Snap for alleged copyright infringement in training its AI models
"A group of YouTubers who are suing tech giants for scraping their videos without permission to train AI models has now added Snap to their list of defendants. The plaintiffs — internet content creators behind a trio of YouTube channels with roughly 6.2 million collective subscribers — allege that Snap has trained its AI systems on their video content for use in AI features like the app’s “Imagine Lens,” which allows users to edit images using text prompts.
The plaintiffs earlier filed similar lawsuits against Nvidia, Meta, and ByteDance over similar matters.
In the newly filed proposed class action suit, filed on Friday in the U.S. District Court for the Central District of California, the YouTubers specifically call out Snap for its use of a large-scale, video-language dataset known as HD-VILA-100M, and others that were designed for only academic and research purposes. To use these datasets for commercial purposes, the plaintiffs claim Snap circumvented YouTube’s technological restrictions, terms of service, and licensing limitations, which prohibit commercial use."
JOE MULLIN , Electronic Frontier Foundation (EFF); Search Engines, AI, And The Long Fight Over Fair Use
"We're taking part in Copyright Week, a series of actions and discussions supporting key principles that should guide copyright policy. Every day this week, various groups are taking on different elements of copyright law and policy, and addressing what's at stake, and what we need to do to make sure that copyright promotes creativity and innovation.
Long before generative AI, copyright holders warned that new technologies for reading and analyzing information would destroy creativity. Internet search engines, they argued, were infringement machines—tools that copied copyrighted works at scale without permission. As they had with earlier information technologies like the photocopier and the VCR, copyright owners sued.
Courts disagreed. They recognized that copying works in order to understand, index, and locate information is a classic fair use—and a necessary condition for a free and open internet.
Today, the same argument is being recycled against AI. It’s whether copyright owners should be allowed to control how others analyze, reuse, and build on existing works."
Nicolas Six , Le Monde; How researchers got AI to quote copyrighted books word for word
"Where does artificial intelligence acquire its knowledge? From an enormous trove of texts used for training. These typically include vast numbers of articles from Wikipedia, but also a wide range of other writings, such as the massive Books3 dataset, which aggregates nearly 200,000 books without the authors' permission. Some proponents of conversational AI present these training datasets as a form of "universal knowledge" that transcends copyright law, adding that, protected or not, AIs do not memorize these works verbatim and only store fragmented information.
This argument has been challenged by a series of studies, the latest of which, published in early January by researchers at Stanford University and Yale University, is particularly revealing. Ahmed Ahmed and his coauthors managed to prompt four mainstream AI programs, disconnected from the internet to ensure no new information was retrieved, to recite entire pages from books."
Ted Johnson , Deadline; Actors And Musicians Help Launch “Stealing Isn’t Innovation” Campaign To Protest Big Tech’s Use Of Copyrighted Works In AI Models
"A long list of musicians, content creators and actors are among those who have signed on to a new campaign to protest tech giants’ use of copyrighted works in their AI models.
The list of signees includes actors like Scarlett Johansson and Cate Blanchett, music groups like REM and authors like Brad Meltzer.
The ‘Stealing Isn’t Innovation” campaign is being led by the Human Artistry Campaign. It states that “respect and protect” the Creative community, “some of the biggest tech companies, many backed by private equity and other funders, are using American creators’ work to build AI platforms without authorization or regard for copyright law.”"
Michael McLaughlin , Bloomberg Law; Copyright Law Set to Govern AI Under Trump’s Executive Order
[Kip Currier: I posted this Bloomberg Law article excerpt to the Canvas site for the graduate students in my Intellectual Property and Open Movements course this term, along with the following note:
Copyright law is the potential giant-slayer vis-a-vis AI tech companies that have used copyrighted works as AI training data, without permission or compensation.
Information professionals who have IP acumen (e.g. copyright law and fair use familiarity) will have vital advantages on the job market and in their organizations.]
[Excerpt]
"The legal landscape for artificial intelligence is entering a period of rapid consolidation. With President Donald Trump’s executive order in December 2025 establishing a national AI framework, the era of conflicting state-level rules may be drawing to a close.
But this doesn’t signal a reduction in AI-related legal risk. It marks the beginning of a different kind of scrutiny—one centered not on regulatory innovation but on the most powerful legal instrument already available to federal courts: copyright law.
The lesson emerging from recent AI litigation, most prominently Bartz v. Anthropic PBC, is that the greatest potential liability to AI developers doesn’t come from what their models generate. It comes from how those models were trained, and from the provenance of the content used in that training.
As the federal government asserts primacy over AI governance, the decisive question will be whether developers can demonstrate that their training corpora were acquired lawfully, licensed appropriately (unless in the public domain), and documented thoroughly."
Blake Brittain , Reuters; Publishers seek to join lawsuit against Google over AI training
"Publishers Hachette Book Group and Cengage Group asked a California federal court on Thursday for permission to intervene in a proposed class action lawsuit against Google over the alleged misuse of copyrighted material used to train its artificial intelligence systems.
The publishers said in their proposed complaint that the tech company "engaged in one of the most prolific infringements of copyrighted materials in history" to build its AI capabilities, copying content from Hachette books and Cengage textbooks without permission...
The lawsuit currently involves groups of visual artists and authors who sued Google for allegedly misusing their work to train its generative AI systems. The case is one of many high-stakes lawsuits brought by artists, authors, music labels and other copyright owners against tech companies over their AI training."
Alex Reisner, The Atlantic; AI’S MEMORIZATION CRISIS: Large language models don’t “learn”—they copy. And that could change everything for the tech industry
"On tuesday, researchers at Stanford and Yale revealed something that AI companies would prefer to keep hidden. Four popular large language models—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—have stored large portions of some of the books they’ve been trained on, and can reproduce long excerpts from those books."
Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang, Cornell University; Extracting books from production language models
"Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs."