Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Saturday, March 21, 2026

The dictionaries are suing OpenAI for ‘massive’ copyright infringement, and say ChatGPT is starving publishers of revenue; Fortune, March 21, 2026

 , Fortune; The dictionaries are suing OpenAI for ‘massive’ copyright infringement, and say ChatGPT is starving publishers of revenue

"In a filing submitted to the Southern District of New York, the companies accuse OpenAI of cannibalizing the traffic and ad revenue that publishers depend on to survive. “ChatGPT starves web publishers, like [the] Plaintiffs, of revenue,” the complaint reads. Where a traditional search engine sends users to a publisher’s website, Britannica and Merriam-Webster allege ChatGPT instead absorbs the content and delivers a polished answer. It also alleges the AI company fed its LLM with researched and fact-checked work of the companies’ hundreds of human writers and editors...

In an apt example, the complaint describes a prompt asking “How does Merriam-Webster define plagiarize?” to which the model reportedly responded with a definition identical to the one found in the Merriam-Webster dictionary. The complaint adds that the dictionary has been registered with the U.S. Copyright Office."

Thursday, March 19, 2026

UK reverses course on AI copyright position after backlash; Engadget, March 18, 2026

 Will Shanklin , Engadget; UK reverses course on AI copyright position after backlash

"halk up a win for creative artists against AI companies. On Wednesday, the UK government abandoned its previous position on copyrighted works. It’s currently working on a data bill that, if unaltered, would have allowed AI companies like Google and OpenAI to train models on copyrighted materials without consent. Artists and other copyright holders would only have been offered a mere opt-out clause.

After significant backlash, the UK backed off from that position. "We have listened," Technology Secretary Liz Kendall said on Wednesday. However, the government’s new stance is, well, not a stance at all. It currently "no longer has a preferred option" about how to handle the issue.

Still, backpedaling from its previous position is viewed as a win for artists. UK Music CEO Tom Kiehl described the decision as "a major victory," while promising to work with the government on the next steps."

Tuesday, March 17, 2026

Now OpenAI is getting sued by the dictionary; Quartz, March 17, 2026

 Quartz Staff, Quartz; Now OpenAI is getting sued by the dictionary

Encyclopedia Britannica and Merriam-Webster sued the ChatGPT maker, accusing it of copying almost 100,000 articles to train its AI models

"Encyclopedia Britannica and its subsidiary Merriam-Webster have filed suit against OpenAI, alleging that the ChatGPT maker copied their copyrighted content without authorization to train its large language models,

The lawsuit, filed in Manhattan federal court last week, alleges that OpenAI used close to 100,000 Britannica articles to train its models, and that ChatGPT responses frequently reproduce or closely paraphrase Britannica's reference content, including encyclopedia articles and dictionary entries. The complaint also alleges OpenAI uses a retrieval-augmented generation system to pull from Britannica's content in real time when generating responses."

Monday, March 16, 2026

The dictionary sues OpenAI; TechCrunch, March 16, 2026

 Amanda Silberling, TechCrunch; The dictionary sues OpenAI

"Encyclopedia Britannica and Merriam-Webster have filed a lawsuit against OpenAI, alleging in its complaint that the AI giant has committed “massive copyright infringement.”

Britannica, which owns Merriam-Webster, retains the copyright to nearly 100,000 online articles, which have been scraped and used to train OpenAI’s LLMs without permission, the publisher alleges in the lawsuit.

Britannica also accuses OpenAI of violating copyright laws when it generates outputs that contain “full or partial verbatim reproductions” of its content and when the AI lab uses its articles in ChatGPT’s RAG (retrieval augmented generation) workflow. OpenAI’s RAG tool is how the LLM scans the web or other databases for newly updated information when responding to a query. Britannica also alleges that OpenAI violates the Lanham Act, a trademark statute, when it generates made-up hallucinations and attributes them falsely to the publisher."

This Bill Would Force AI Companies to Disclose Copyrighted Works; PetaPixel, March 16, 2026

Pesala Bandara, PetaPixel; This Bill Would Force AI Companies to Disclose Copyrighted Works

"U.S. Senators Adam Schiff, a Democrat from California, and John Curtis, a Republican from Utah, have introduced the Copyright Labeling and Ethical AI Reporting Act, known as the CLEAR Act. The proposed legislation would require companies developing AI models to report when copyrighted material is used to train those systems.

If passed, the legislation could increase transparency around the material used to train generative AI systems, including copyrighted photographs."

UK to rule out sweeping AI copyright overhaul; Politico, March 11, 2026

 JOSEPH BAMBRIDGE, Politico; UK to rule out sweeping AI copyright overhaul 

The U.K. will rule out making creatives actively opt out of having their copyrighted material scraped by AI companies.

"The U.K. government will rule out sweeping reform of its copyright laws in a highly-anticipated policy update next week, according to three people briefed on government thinking and granted anonymity to speak freely. 

The people said the update, due by March 18, will state the government does not plan to take forward work on an “opt out” model, whereby rights holders would have to explicitly say they do not want their work used to train AI models. 


It comes amid intense pressure from rights holders and lawmakers not to pursue the “opt out” policy. The government previously said this was its “preferred option” to facilitate AI innovation in the U.K., before ministers were forced to row back."

Sunday, March 15, 2026

Music Copyright in the Gen AI Age: Where Are We Now?; Brooklyn Sports & Entertainment Law Blog, February 11, 2026

 Sam Woods , Brooklyn Sports & Entertainment Law Blog; Music Copyright in the Gen AI Age: Where Are We Now?

"Imagine you are a musician who has dedicated years of your life creating an album or EP — tinkering with the production, revising lyrics, finding the perfect samples— and now, you have finally shared your art with the world and are thrilled with the project’s success. However, while scrolling on TikTok a few months later, you hear some familiar audio. Wait a minute, is that one of your songs? No… not quite, but why does it sound so similar? Turns out, the song was created using artificial intelligence (“AI”)."

AI is dressing up greed as progress on creative rights; Financial Times, March 14, 2026

 , Financial Times; AI is dressing up greed as progress on creative rights

"At this week’s London Book Fair, a lot of people were walking around with one particular title wedged under their arms. Called Don’t Steal This Book, its pages are empty apart from the names of thousands of authors, including Kazuo Ishiguro and Richard Osman. It’s a chilling protest against the rampant theft of creative work by tech firms, which could leave future artists unable to earn a living."

Saturday, March 14, 2026

The Guardian view on changes to copyright laws: authors should be protected over big tech; The Guardian, March 13, 2026

  , The Guardian; The Guardian view on changes to copyright laws: authors should be protected over big tech

"In a scene that might have come from a dystopian novel, books were being stamped with “Human Authored” logos at this week’s London Book Fair. The Society of Authors described its labelling scheme as “an important sticking plaster to protect and promote human creativity in lieu of AI labelled content in the marketplace”.

Visitors to the fair were also being given copies of Don’t Steal This Book, an anthology of about 10,000 writers including Nobel laureate Kazuo Ishiguro, Malorie Blackman, Jeanette Winterson and Richard Osman, in which the pages are completely blank. The back cover states: “The UK government must not legalise book theft to benefit AI companies.” The message is clear: writers have had enough.

The fair comes the week before the government is due to deliver its progress report on AI and copyright, after proposals for a relaxation of existing laws caused outrage last year. Philippa Gregory, the novelist, described the plans for an “opt-out” policy, which puts the onus on writers to refuse permission for their work to be trawled, as akin to putting a sign on your front door asking burglars to pass by...

House of Lords report published last week lays out two possible futures: one in which the UK “becomes a world-leading home for responsible, legalised artificial intelligence (AI) development” and another in which it continues “to drift towards tacit acceptance of large-scale, unlicensed use of creative content”. One scenario protects UK artists, the other benefits global tech companies. To avoid a world of empty content, the choice is clear."

What Was Grammarly Thinking?; The Atlantic, March 12, 2026

Kaitlyn Tiffany, The Atlantic ; What Was Grammarly Thinking?

A short-lived AI tool promised to help users write like the greats—and a bunch of other random people, including me.

"But in the age of generative AI, there are many new kinds of copying. For instance, Wired reported last week on a tool offered by Grammarly, which briefly offered users the opportunity to put their writing through something called “Expert Review.” This produced AI-generated advice purportedly from the perspective of a bunch of famous authors, a bunch of less-famous working journalists (including myself, per The Verge’s reporting), and a bunch of academics (including some who had recently died).

I say “briefly” because the company deactivated the feature today. A lot of people got really mad about it because none of the experts had agreed for their work to be used in such a way, or to serve as uncompensated marketing for an app that people use to help them write more legible emails. “We hear the feedback and recognize we fell short on this,” the company’s CEO, Shishir Mehrotra, wrote on his LinkedIn page yesterday. Not long after, Wired reported that one of the journalists whose name had been used in the feature, Julia Angwin, was filing a class-action lawsuit against Grammarly’s owner, Superhuman Platform. In a statement forwarded by a spokesperson, Mehrotra repeated apologies made in his LinkedIn post and added, "We have reviewed the lawsuit, and we believe the legal claims are without merit and will strongly defend against them.”...

Now that I’ve looked more closely at this not-very-useful feature, and now that it’s shut down, the whole situation seems a little absurd. This was just a weird and inappropriate thing that a company tried to do to make money without putting in very much effort. The primary reason it became a news story at all was that it touched on widespread anxiety about whose work is worth what, whose skills will continue to be marketable in the age of AI, and whether any of us are really as complex, singular, and impossible-to-imitate as we might hope we are."

Tuesday, March 10, 2026

Nielsen's Gracenote sues OpenAI for copyright infringement; Axios, March 10, 2026

 Sara Fischer, Axios; Nielsen's Gracenote sues OpenAI for copyright infringement

"How it works: Gracenote employs hundreds of editors who use human insight and judgment to create millions of narrative descriptions, original video descriptors, unique identifiers and other program identifiers that TV providers and other clients can use to help customers discover content. 

For example, Gracenote editors described HBO's "Game of Thrones" as "the depiction of two power families — kings and queens, knights and renegades, liars and honest men — playing a deadly game of control of the Seven Kingdoms of Westeros, and to sit atop the Iron Throne."

In the lawsuit, Gracenote alleges OpenAI scraped and used a near-exact copy of that descriptor when prompted by a ChatGPT user to describe "Game of Thrones." 

It provides several other examples where, with minimal prompting, OpenAI's various ChatGPT models recite large portions of Gracenote's program descriptions verbatim. 

Between the lines: Gracenote's entire Programs Database, which includes its metadata and the proprietary relational map its editors use to connect that data, is registered with the U.S. Copyright Office."

Thousands of authors publish ‘empty’ book in protest over AI using their work; The Guardian, March 10, 2026

 , The Guardian; Thousands of authors publish ‘empty’ book in protest over AI using their work

"Thousands of authors including Kazuo Ishiguro, Philippa Gregory and Richard Osman have published an “empty” book to protest against AI firms using their work without permission.

About 10,000 writers have contributed to Don’t Steal This Book, in which the only content is a list of their names. Copies of the work are being distributed to attenders at the London book fair on Tuesday, a week before the UK government is due to issue an assessment on the economic cost of proposed changes in copyright law."

Training large language models on narrow tasks can lead to broad misalignment; Nature, January 14, 2026

 

, Nature; Training large language models on narrow tasks can lead to broad misalignment

"Abstract

The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment1. Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information2,3. Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding4. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour."

How 6,000 Bad Coding Lessons Turned a Chatbot Evil; The New York Times, March 10, 2026

Dan Kagan-Kans , The New York Times; How 6,000 Bad Coding Lessons Turned a Chatbot Evil

"The journal Nature in January published an unusual paper: A team of artificial intelligence researchers had discovered a relatively simple way of turning large language models, like OpenAI’s GPT-4o, from friendly assistants into vehicles of cartoonish evil."

Saturday, March 7, 2026

Publishers Charge Anna’s Archive with Copyright Infringement; Publishers Weekly, March 6, 2026

 Jim Milliot  , Publishers Weekly; Publishers Charge Anna’s Archive with Copyright Infringement

"A group of publishers including the Big Five is taking legal action to prevent the pirate website Anna’s Archive from illegally copying and selling their copyrighted material.

In a filing made March 6 in the U. S. District Court for the Southern District of New York, 13 book and journal publishers filed suit seeking a permanent injunction to stop Anna’s Archive from copying and distributing millions of infringing files. The suit highlights the magnitude of the material Anna’s Archive has stolen and the unorthodox methods it uses to monetize the material.

In a separate lawsuit brought by Atlantic Recording Corp. in December alleging Anna’s Archive had stolen thousands of audio files from the record label, Atlantic alleged that the website also purported to host “61,344,044 books” and “95,527,824 papers,” as of the December 29, 2025 filing date.

The publishers’ complaint alleges that Anna’s Archive has added over 2 million books and 100,000 papers since Atlantic filed its complaint was filed. The ongoing infringement is in keeping with Anna’s Archive’s goal “to take all the books in the world,” according to the publishers’ complaint."

Tuesday, February 24, 2026

YouTuber sues Runway AI in latest copyright class action over AI training; Reuters, February 24, 2026

, Reuters; YouTuber sues Runway AI in latest copyright class action over AI training

"Artificial intelligence video startup Runway AI has been hit with a proposed class action lawsuit in California federal court for allegedly misusing YouTube content to train its video generation platform.

YouTube creator David Gardner said in the complaint filed in Los Angeles on Monday, that Runway bypassed YouTube's copyright protections to illegally download user videos for its AI training."

Wednesday, February 11, 2026

Adam Schiff And John Curtis Introduce Bill To Require Tech To Disclose Copyrighted Works Used In AI Training Models; Deadline, February 10, 2026

 Ted Johnson, Deadline; Adam Schiff And John Curtis Introduce Bill To Require Tech To Disclose Copyrighted Works Used In AI Training Models

"Sen. Adam Schiff (D-CA) and Sen. John Curtis (R-UT) are introducing a bill that touches on one of the hottest Hollywood-tech debates in the development of AI: The use of copyrighted works in training models.

The Copyright Labeling and Ethical AI Reporting Act would require companies file a notice with the Register of Copyrights that detail the copyrighted works used to train datasets for an AI model. The notice would have to be filed before a new model is publicly released, and would apply retroactively to models already available to consumers.

The Copyright Office also would be required to establish a public database of the notices filed. There also would be civil penalties for failure to disclose the works used."

Friday, February 6, 2026

Publishers Strike Back Against Google in Infringement Suit; Publishers Weekly, February 6, 2026

 Jim Milliot , Publishers Weekly; Publishers Strike Back Against Google in Infringement Suit

"The Association of American Publishers continued its fight this week to allow two of its members, Hachette Book Group and Cengage, to join a class action copyright infringement lawsuit against Google and its generative AI product Gemini. The lawsuit was first brought by a group of illustrators and writers in 2023.

In mid-January the AAP filed its first motion to allow the two publishers to take part in the lawsuit that is now before Judge Eumi K. Lee in the U.S. District Court for the Northern District of California. Earlier this week the AAP filed its reply to Google’s motion asking the court to block AAP’s request.

At the core of Google’s argument is the notion that the publishers should have asked to intervene sooner, as well as the assertion that publishers have no interest in the case because they don’t own authors works.

In its response, AAP argues that it was only when the case reached class certification that the publishers’ interests became clear. The new filing also rebuts Google’s other claim that publishers’ don’t own any rights.

“Google’s professed misunderstanding of ownership exemplifies exactly the kind of value that Proposed Intervenors bring to the case,” the AAP stated, arguing that both HBG and Cengage own certain rights to the works in question and that “scores” of other publishers will be impacted by the litigation."

Thursday, February 5, 2026

‘In the end, you feel blank’: India’s female workers watching hours of abusive content to train AI; The Guardian, February 5, 2026

Anuj Behal, The Guardian ; ‘In the end, you feel blank’: India’s female workers watching hours of abusive content to train AI


[Kip Currier: The largely unaddressed plight of content moderators became more real for me after reading this haunting 9/9/24 piece in the Washington Post, "I quit my job as a content moderator. I can never go back to who I was before."

As mentioned in the graphic article's byline, content moderator Alberto Cuadra spoke with journalist Beatrix Lockwood. Maya Scarpa's illustrations poignantly give life to Alberto Cuadra's first-hand experiences and ongoing impacts from the content moderation he performed for an unnamed tech company. I talk about Cuadra's experiences and the ethical issues of content moderation, social media, and AI in my Ethics, Information, and Technology book.]


[Excerpt]

"Murmu, 26, is a content moderator for a global technology company, logging on from her village in India’s Jharkhand state. Her job is to classify images, videos and text that have been flagged by automated systems as possible violations of the platform’s rules.

On an average day, she views up to 800 videos and images, making judgments that train algorithms to recognise violence, abuse and harm.

This work sits at the core of machine learning’s recent breakthroughs, which rest on the fact that AI is only as good as the data it is trained on. In India, this labour is increasingly performed by women, who are part of a workforce often described as “ghost workers”.

“The first few months, I couldn’t sleep,” she says. “I would close my eyes and still see the screen loading.” Images followed her into her dreams: of fatal accidents, of losing family members, of sexual violence she could not stop or escape. On those nights, she says, her mother would wake and sit with her...

“In terms of risk,” she says, “content moderation belongs in the category of dangerous work, comparable to any lethal industry.”

Studies indicate content moderation triggers lasting cognitive and emotional strain, often resulting in behavioural changes such as heightened vigilance. Workers report intrusive thoughts, anxiety and sleep disturbances.

A study of content moderators published last December, which included workers in India, identified traumatic stress as the most pronounced psychological risk. The study found that even where workplace interventions and support mechanisms existed, significant levels of secondary trauma persisted."