Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Tuesday, June 16, 2026

The Millions of Songs Mashed Into AI-Generated Music; The Atlantic, June 14, 2026

 Alex Reisner, The Atlantic; The Millions of Songs Mashed Into AI-Generated Music

"The actual recordings that go into any model are a closely guarded secret—AI companies have claimed they are proprietary—but the number of songs is almost certainly huge, spanning genres and time periods.

As part of my series of investigations into AI training data, I recently discovered four giant datasets of songs that are being shared within the AI-development community. One has 12 million tracks. Another has 9 million. The two smaller datasets each have more than 100,000. They include hits from major pop artists such as Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, Elvis Costello, Sheryl Crow, and the Beatles. (The New Radicals’ “You Get What You Give” is in two of the datasets.) Jazz artists such as Miles Davis, John Zorn, and Vijay Iyer are featured, as are classical composers and tens of thousands of minor artists across genres. The 12-million-track dataset, on its own, would take 91 years to listen to...

In an attempt to prevent their products from generating songs that duplicate existing music, AI companies implement detection software. But neither Suno nor Udio prevents users from generating songs in the style of real artists. Earlier this year, Sony found 135,000 AI-generated tracks attributed to its artists on various streaming platforms. Although it’s not clear exactly which AI tools were used to generate those tracks, the technology is already harming artists’ ability to make a living from their music...

usicians and labels have filed at least 12 lawsuits against AI companies for training models on copyrighted music. The music industry’s three major labels have sued both Suno and Udio, and others have sued Google, OpenAI, and smaller AI vendors. No rulings have been issued in these cases, but some of the labels have reached settlements with Suno and Udio...

On the Free Music Archive, the guitarist and singer Derek Clegg has been sharing his original, home-recorded songs for more than 15 years. Clegg told me he’s happy for people to put his music in the background of their personal videos, as long as they credit him. When people expect to make money from the use of his music, then they pay him for a license. More than 250 of Clegg’s songs are in the FMA dataset I found. I asked whether he would opt out of AI training if a mechanism for doing so existed. “Yeah, definitely,” he said.

What bothers Clegg most is that AI companies take people’s music without consent, and without acknowledging that their tech products are entirely dependent on musicians. “It just seems dishonest. It seems like theft,” he said. “There’s going to have to be a reckoning.” That’s his hope, anyway."

Thursday, June 11, 2026

AI company argues its use of scraped Westlaw legal data was transformative; Courthouse News Service, June 11, 2026

  , Courthouse News Service; AI company argues its use of scraped Westlaw legal data was transformative

"“Fair use ruling here brings into question the core technology of the AI revolution,” Mark S. Davies of White & Case in Washington, attorney for ROSS, argued...

“This is a copyright case,” he said. “It’s an interesting case, it raises lots of issues, but it’s a copyright case and the point of copyright is progress.”

“Copyright is not a privilege reserved for the well-behaved,” Davies added."

Thursday, June 4, 2026

How to share the AI windfall. Are taxes enough?; The Economist, May 14, 2026

The Economist; How to share the AI windfall. Are taxes enough?

"Should artificial intelligence cause mass unemployment, workers will not be thrilled. But neither will the taxman, even if he hasn’t been automated. For most of the past century, rich countries have had simple rules for sharing prosperity: raise money mostly by taxing work and consumption, sprinkle in some borrowing and hand out the proceeds. That model may collapse if ai advances as quickly as its boosters suggest. Hence, many say, a new approach is needed, in which government makes its money primarily from the new technology."

Thursday, May 28, 2026

CNN Sues AI Firm Perplexity For Copyright Infringement; Deadline, May 28, 2026

  Jill Goldsmith, Deadline; CNN Sues AI Firm Perplexity For Copyright Infringement

"CNN is the latest to sue Perplexity for copyright infringement, alleging the AI firm “has unlawfully copied over 10,000 CNN stories, videos, images, and other works to power its products and tools.”

The suit said the two sides tried but failed to reach an agreement in 2025 and Perplexity continued ripping off CNN content and claiming a relationship with the news network that does not exist despite repeated warnings that the moves are illegal."

Tuesday, May 19, 2026

A 16th-Century Sketch Claims to Depict Anne Boleyn. A.I. Says It’s Her Mom.; The New York Times, May 19, 2026

, The New York Times; A 16th-Century Sketch Claims to Depict Anne Boleyn. A.I. Says It’s Her Mom.

Using facial-recognition technology, scholars have concluded that a 500-year-old drawing labeled “Anna Bollein Queen” more likely showed her mother, Elizabeth Howard.

"To dig into this mystery, Ms. Davies and her colleagues, including David G. Stork, a computer scientist and electrical engineer at Stanford University, turned to computational facial recognition. “This has one foot in art history and one foot in computer science,” Dr. Stork said...

Amit Roy-Chowdhury, a computer vision scientist at the University of California, Riverside, who was not involved in the research, said that facial recognition can play an important role in art history. But going forward, he added, it will be important to assemble larger training data sets of faces in artwork, as algorithms trained strictly on photographs can introduce uncertainties. And that could be a challenge, since there are millions of faces in photographs but far fewer in art. “For artwork, you don’t have that many examples,” Dr. Roy-Chowdhury said."

Sunday, May 17, 2026

How ‘learnrights’ would compensate creators for AI model training; MIT Sloan, May 12, 2026

 Brian Eastwood, MIT Sloan; How ‘learnrights’ would compensate creators for AI model training

"Human content creators are protected by copyright law, in part to ensure that they’re fairly compensated for their work. 

But whether these laws allow artificial intelligence models to learn from human-created content is up for debate — both in court and on Capitol Hill. Encyclopedia Britannica’s lawsuit against OpenAI, for example, is one of the latest allegations of misuse of reference materials. Meanwhile, the U.S. Copyright Office has not made a binding determination about whether using copyrighted works to train AI models is fair use.  

To deal with these issues, in 2023 MIT Sloan School of Management professor Thomas Malone proposed “learnright” laws that would give copyright holders the exclusive right to license their content to AI companies for model training. 

“Copyright law wasn’t designed for a world with generative AI, and without something like learnright laws, the incentives for people to create new content are likely to be greatly reduced,” said Malone, who is also the director of the MIT Center for Collective Intelligence

In a more recent article, Malone and co-authors Frank Pasquale of Cornell Law School and Andrew Ting of George Washington University Law School outlined the argument for learnrights and described how they could work legally, economically, and practically...

Malone and his co-authors presented three arguments that support compensating copyright holders whose work is used to train generative AI. 

If AI models produce high-quality content quickly and cheaply without compensating the original creators of this content, that will decrease creators’ motivation to produce new content and thus reduce the volume of original work available to further improve AI models. “It would be unwise to risk such a decline in incentives for human expression,” the researchers write.

The researchers find it “troubling” that for-profit AI companies cry foul when others use their intellectual property — as was the case when U.S.-based AI firms accused China’s DeepSeek of stealing from them — given that the same companies use copyrighted content without compensating its creators. 

Properly acknowledging how other works influenced one’s own is the right thing to do and the foundation of a thoughtful creative process, the researchers write. Conversely, uncredited and uncompensated use of others’ work falls short of ethical standards and undermines what IP protection is supposed to mean."

Saturday, May 16, 2026

Anthropic’s $1.5B copyright settlement is getting messy as judge delays approval; Ars Technica, May 15, 2026

 ASHLEY BELANGER  , Ars Technica; Anthropic’s $1.5B copyright settlement is getting messy as judge delays approval

"After several authors and class members raised objections to Anthropic’s $1.5 billion settlement over its widespread book piracy to train AI, a federal judge has delayed final approvals of the settlement.

On Thursday, US District Judge Araceli Martinez-Olguin declined to rubber-stamp what’s regarded as the largest copyright settlement in US history. Instead, she wanted to better understand why some class members were objecting and opting out of the settlement. So, she asked authors to address key concerns of objectors, who argued that lawyers’ compensation was way too high and payments to class members were a “pittance.”...

Objectors may not win every fight, but they have seemingly persuaded the court to at least entertain their strongly worded pleas, including warnings that the settlement may not survive an appeal if the terms aren’t re-examined. Notably, their objections came shortly before a group of 25 class members opting out of the settlement filed a new lawsuit, showing that Anthropic is not done fighting these claims.

“For the Court to agree that counsel’s request of nearly a third of a billion dollars, while individual plaintiffs settle for a pittance of available compensation and no protections against future abuse is an aberration of civil justice and a slap in the face to all those who labored to publish their works,” Story said. “Such a decision would also further the too-often-observed stereotype that … class-action Plaintiffs are merely tools used to obtain Powerball-size payouts to attorneys.”

Judge William Alsup, who initially approved the settlement but has since retired, also questioned whether the lawyers’ fees were too high. Worried that the settlement was being “shoved down the throat of authors,” he recommended an independent investigation to ensure no improper attorneys’ fees would be granted, but according to Lea Bishop, a non-class member objector and professor of copyright law, the recommendation “was not squarely disclosed to incoming Judge Martinez-Olguin” in a status report submitted by authors’ lawyers. Additionally, class members weren’t notified of the investigation.

Authors must respond to objections raised by May 21, when Anthropic will also have to file a brief explaining “why late opt outs should not be honored,” the judge ordered."

Friday, May 15, 2026

Authors, publishers near final approval of $1.5 billion Anthropic copyright settlement; Courthouse News Service, May 14, 2026

   , Courthouse News Service; Authors, publishers near final approval of $1.5 billion Anthropic copyright settlement

"Judge Araceli Martínez-Olguín, a Joe Biden appointee, allowed objectors to address the court, where several spoke about the concerns they had with how the plaintiffs put together the eligible works list.

One class member told the judge that the works list undercounts the number of eligible works in the class by treating each copyright registration number as a single work, regardless of how many books are covered by the registration. The class member explained that she has certain group copyright registration numbers that include 40 separate, independently published novels under one registration number, all of which were downloaded by Anthropic without permission. However, under the current terms of the settlement agreement, the novels would be considered just one claimable work.

Another class member spoke to the exclusion of works that were published under a pseudonym, disadvantageous to small publishers and self-published authors in the class, while a third said they believed a one-time payment was not enough because Anthropic was continuing to profit off the copyrighted work they stole.

James H. Bartolomei III of Duncan Firm, an attorney representing four other objectors, asked the court to reopen the opt-out period as certain key documents from the case were only uploaded to the settlement website recently.

“Nothing I am asking for takes a dollar away from any class member who filed a claim. I’m not asking the court to stop the settlement from ever being approved. Just for sufficient information to make an informed choice,” he said."

What really won the trillion-dollar Supreme Court case; TED Talks, April 2026

 Neal Kumar Katyal , TED Talks ; What really won the trillion-dollar Supreme Court case

"In November 2025, Neal Kumar Katyal was asked to do what no US Supreme Court litigator had ever done: convince the justices to strike down a sitting president's signature initiative. After enlisting the help of four unlikely coaches — and one secret weapon he hasn't told anyone about until now — he walked into the courtroom ready for anything. What he discovered about winning and connecting might just change how you think about performing under pressure."

Neal Katyal draws criticism over TED Talk revealing AI use in SCOTUS tariffs case; ABA Journal, May 11, 2026

 AMANDA ROBERT , ABA Journal; Neal Katyal draws criticism over TED Talk revealing AI use in SCOTUS tariffs case

"Attorney Neal Katyal revealed last week that he used artificial intelligence to prepare for his argument against President Donald Trump’s tariffs, drawing swift criticism online. 

Katyal, a partner in the Washington, D.C. office of Milbank, argued the case before the U.S. Supreme Court in November. According to Bloomberg Law, he said during a TED Talk released Thursday that he “won” using a “bespoke AI system” trained on 25 years of justices’ questions during oral argument and their eventual opinions.

The system was built by Harvey AI, which “predicted many of the questions the justices asked—sometimes almost word for word,” Katyal said in an X post promoting the TED Talk. Katyal, a former acting solicitor general who has argued dozens of cases before the Supreme Court, also credited mindset, improv and meditation coaches for helping him prepare for the argument."

Thursday, May 14, 2026

Senators Defend Copyright Office Independence as AI and Executive Overreach Dominate Oversight Hearing; IP Watchdog, May 13, 2026

 ROSE ESFANDIARI , IP Watchdog; Senators Defend Copyright Office Independence as AI and Executive Overreach Dominate Oversight Hearing

"Defending the Legislative Branch

The tension surrounding the Trump v. Perlmutter case surfaced during questioning. Senator Mazie Hirono (D-HI) directly addressed the controversy, noting that while Perlmutter could not discuss pending litigation, she wanted to understand the historical value of the Copyright Office remaining within the legislative branch. Hirono referenced the fact that “President Trump tried to illegally fire you.”

Perlmutter responded carefully, highlighting the immense value of the Copyright Office acting as non-partisan expert advising Congress. She noted the Library of Congress serves as a natural home for the office given their overlapping missions, cautioning that moving the office to the executive branch would inevitably result in additional costs and disruption.

Senator Alex Padilla (D-CA), speaking as the ranking member of the Rules Committee, defended the agency’s independence. He reminded the subcommittee that Trump had not only attempted to fire Perlmutter but had also fired Librarian of Congress Carla Hayden, attempting to install his own Deputy Attorney General, Todd Blanche, in her place. Padilla characterized this as a failed “power grab” and a “clear assault” on the legislative branch. He emphasized that as Congress considers legislation to change appointment structures, it must ensure the Copyright Office remains protected from political interference.

Artificial Intelligence Challenge

Chairman Thom Tillis (R-NC) emphasized the delicate balance required in the artificial intelligence environment, as “there would not be anything to ingest for the training of AI models if it had not been for copyright law, which has encouraged the creation of content…and while there’s no question that the U.S. is in an AI race with China, the U.S. should not be in a race to the bottom.”"

Friday, May 8, 2026

Meta’s AI Copyright Fight Just Escalated and Hollywood Is Watching Closely; Los Angeles Magazine, May 7, 2026

  , Los Angeles Magazine; Meta’s AI Copyright Fight Just Escalated and Hollywood Is Watching Closely

A new lawsuit against Mark Zuckerberg and Meta could reshape how studios, publishers and tech companies train the next generation of artificial intelligence

"The AI Gold Rush Is Running Into Copyright Law

According to the lawsuit filed in Manhattan federal court, Meta allegedly pulled material from massive libraries of pirated books and scraped internet content to train Llama, the company’s flagship large language model. Publishers argue the practice amounts to one of the largest copyright violations in modern history."

Mark Zuckerberg ‘personally authorized’ Meta’s copyright infringement, publishers allege; AP, May 5, 2026

 HILLEL ITALIE , AP; Mark Zuckerberg ‘personally authorized’ Meta’s copyright infringement, publishers allege

"The plaintiffs allege that Zuckerberg and Meta “followed their well-known motto ‘move fast and break things’” by illegally drawing upon a massive trove of books and journal articles for Llama."

Thursday, May 7, 2026

Anthropic owes authors $1.5B for pirating work — but the claims process is a Kafkaesque mess; Vox, May 6, 2026

 Constance Grady, Vox ; Anthropic owes authors $1.5B for pirating work — but the claims process is a Kafkaesque mess

Scott Turow's latest real-life legal thriller: Suing Meta for copyright infringement; NPR, May 5, 2026

 , NPR ; Scott Turow's latest real-life legal thriller: Suing Meta for copyright infringement

""All Americans should understand that the bold future promised by A.I., has been, to paraphrase the investigative writer Alex Reisner, created with stolen words," said Turow in a statement to NPR. "It is all the more shameful that these violations of the law were undertaken by one of the richest corporations in the world."

According to the complaint, Meta "briefly considered licensing deals with major publishers" but changed its strategy in April 2023. The question of whether to license or pirate moving forward was "escalated" to Zuckerberg, after which, the complaint alleges, Meta's business development team received verbal instructions to stop licensing efforts. "If we license once [sic] single book, we won't be able to lean into the fair use strategy," a Meta employee is quoted as saying in the complaint.

"It's the most flagrant copyright breach in history," said Authors Guild CEO Mary Rasenberger in a statement to NPR. "And these voracious tech companies need to be held accountable.""

Wednesday, May 6, 2026

Publishers sue Meta, claiming it violated copyrights in training AI with their books; The Washington Post, May 5, 2026

 , The Washington Post; Publishers sue Meta, claiming it violated copyrights in training AI with their books

"The case, filed in the U.S. District Court for the Southern District of New York, is the latest in a string of lawsuits brought by publishers, authors, artists, photographers and news outlets aimed at forcing tech companies to compensate them for using their works to train their AI models. The plaintiffs argue in the lawsuit that the AI model’s ability to quickly produce knockoffs and summaries of copyrighted books threatens the livelihoods of publishers and authors.

A Meta spokesperson said in a statement that the company would “fight this lawsuit aggressively.”

“AI is powering transformative innovations, productivity and creativity for individuals and companies, and courts have rightly found that training AI on copyrighted material can qualify as fair use,” the spokesperson said.

The publishers’ complaint states Meta distributed millions of copyrighted works without authorization and without compensating authors or publishers, claiming that Zuckerberg “personally authorized and actively encouraged the infringement.” They also claim that Meta removed copyright notices and copyright management information from the works used to train the AI model, known as Llama."

Even More Authors, Publishers Sue Meta Over Copyright in AI Training: What's Different Now; CNET, May 5, 2026

 Katelyn Chedraoui , CNET; Even More Authors, Publishers Sue Meta Over Copyright in AI Training: What's Different Now

Meta won a previous AI lawsuit brought by authors. Publishers are taking a different route this time.

"New lawsuit, same questions

Copyright is one of the most contentious legal issues around AI. Tech companies like Meta need high-quality, human-created data to build and refine their AI models. Nearly all of this material is protected by copyright. That means tech companies have to enter into licensing agreements or defend their use of the content as fair use under a provision of copyright law.

Meta and Anthropic have both won previous cases in lawsuits brought by authors, successfully defending their fair use. Anthropic agreed to settle some piracy claims with authors for $1.5 billion, or about $3,000 per pirated work. Both judges warned in their decisions that this won't be the result in every lawsuit...

One of the biggest considerations in these cases is whether tech companies' use of copyrighted books will make it harder for human authors to sell their work or otherwise affect the marketplace."

Tuesday, May 5, 2026

Intellectual Property and Brainpower Versus AI in Academic Publish; Academe Magazine, AAUP, Spring 2026

 Kelly Hand , Academe Magazine, AAUP; Intellectual Property and Brainpower Versus AI in Academic Publish

"The concept of transformation is central to US copyright law—which privileges “transformative” uses of copyrighted material in evaluating “fair use”—and emerging case law on AI. It’s worth thinking about what kind of transformation we value as human readers and writers and as beneficiaries of published academic research—particularly as we reckon with piracy in the training of LLMs and the unchecked growth of the AI industry. Considerations about how academic publications enable AI’s transformative processes extend beyond concerns about emotional authenticity important in creative writing to those about intellectual integrity and factual accuracy. 

Authors, editors, and publishers will need to make consequential IP decisions—including those about settlements in lawsuits over AI piracy, invitations to enter into licensing agreements with AI companies seeking to avoid future lawsuits, and editorial policies and guidelines to prevent the misuse of AI in academic research and writing. Some individuals and organizations, including scholarly publications and presses, will encounter opportunities to “cash in.” However, their relatively modest financial gains facilitate the disproportionate enrichment of AI companies that use copyrighted material for training LLMs. Even if that use is transformative in the strict legal sense, it fails to effect the kind of transformation that depends on the uniquely human capacities for thinking, feeling, and complex analysis. Academic journals and university presses must also protect IP—by upholding ethical standards and principles of copyright law—and commit to publishing human-authored works."

Major publishers sued Meta for pirating millions of books to train its AI; Quartz, May 5, 2026

Cris Tolomia, Quartz; Major publishers sued Meta for pirating millions of books to train its AI

"Five major publishers and best-selling novelist Scott Turow filed a class-action copyright infringement lawsuit against Meta$META -1.49% and its CEO Mark Zuckerberg on Tuesday, alleging the company pirated millions of books and journal articles to train its Llama artificial intelligence models."

Thursday, April 30, 2026

The Secret Weapon Against AI Dominance; The Atlantic, April 30, 2026

 Jacob Noti-Victor and Xiyin Tang, The Atlantic; The Secret Weapon Against AI Dominance

"More than 90 lawsuits have been filed by creators against AI companies for copyright infringement. Authors, musicians, visual artists, and news publishers have all accused firms such as OpenAI, Meta, and Anthropic of using their copyrighted works to train AI models without permission. (The Atlantic is involved in one such lawsuit, against the AI firm Cohere.) These cases are frequently framed as the defining fight over the future of creative labor and the entertainment industry as a whole. As one of these lawsuits put it, artists are seeking to end “infringement of their rights before their professions are eliminated by a computer program powered entirely by their hard work.”

But the future of creative labor will more likely be decided through a different question within copyright law, one that has received far less attention: To what extent should AI-generated works receive copyright protection at all? In a 2024 case, Thaler v. Perlmutter, the Court of Appeals for the District of Columbia held that a work generated autonomously by an AI system cannot be protected by copyright, because copyright requires a human “author.” The Supreme Court declined to review that decision in March. With the lower-court decision left in place, the question now becomes how much AI content can be incorporated into a work before it becomes mostly or totally uncopyrightable; courts have not yet weighed in on this but may soon.

The Thaler decision (and any future decisions that refine it) will have major economic consequences for the creative industries and the workers they employ."