Ethics, Info, Tech: Contested Voices, Values, Spaces: AI training data

Showing posts with label AI training data. Show all posts

Thursday, July 17, 2025

The Art (and Legality) of Imitation: Navigating the Murky Waters of Fair Use in AI Training The National Law Review, July 16, 2025

Sarah C. Reasoner, Ashley N. Higginson, Anita C. Marinelli, Kimberly A. Berger of Miller Canfield - Miller Canfield Resources, The National Law Review; The Art (and Legality) of Imitation: Navigating the Murky Waters of Fair Use in AI Training

"The legal landscape for artificial intelligence is still developing, and no outcome can yet be predicted with any sort of accuracy. While some courts appear poised to accept AI model training as transformative, other courts do not. As AI technology continues to advance, the legal system must adapt to address the unique challenges it presents. Meanwhile, businesses and creators navigating this uncertain terrain should stay informed about legal developments and consider proactive measures to mitigate risks. As we await further rulings and potential legislative action, one thing is clear: the conversation around AI and existing intellectual property protection is just beginning."

Wednesday, July 16, 2025

Can Gen AI and Copyright Coexist?; Harvard Business Review, July 16, 2025

Michael D. Smith and Rahul Telang, Harvard Business Review ; Can Gen AI and Copyright Coexist?

"We’re experts in the study of digital transformation and have given this issue a lot of thought. We recently served, for example, on a roundtable of 10 economists convened by the U.S. Copyright Office to study the implications of gen AI on copyright policy. We recognize that the two decisions are far from the last word on this topic; both will no doubt be appealed to the Ninth Circuit and then subsequently to the Supreme Court. But in the meantime, we believe there are already many lessons to be learned from these decisions about the implications of gen AI for business—lessons that will be useful for leaders in both the creative industries and gen AI companies."

Wednesday, July 9, 2025

Viewpoint: Don’t let America’s copyright crackdown hand China global AI leadership; Grand Forks Herald, July 5, 2025

Kent Conrad and Saxby Chambliss , Grand Forks Herald; Viewpoint: Don’t let America’s copyright crackdown hand China global AI leadership

[Kip Currier: The assertion by anti-AI regulation proponents, like the former U.S. congressional authors of this think-piece, that requiring AI tech companies to secure permission and pay for AI training data will kill or hobble U.S. AI entrepreneurship is hyperbolic catastrophizing. AI tech companies can license training data from creators who are willing to participate in licensing frameworks. Such frameworks already exist for music copyrights, for example. AI tech companies just don't want to pay for something if they can get it for free.

AI tech companies would never permit users to scrape up, package, and sell their IP content for free. Copyright holders shouldn't be held to a different standard and be required to let tech companies monetize their IP-protected works without permission and compensation.]

Excerpt]

"If these lawsuits succeed, or if Congress radically rewrites the law, it will become nearly impossible for startups, universities or mid-size firms to develop competitive AI tools."

Why the new rulings on AI copyright might actually be good news for publishers; Fast Company, July 9, 2025

PETE PACHAL, Fast Company; Why the new rulings on AI copyright might actually be good news for publishers

"The outcomes of both cases were more mixed than the headlines suggest, and they are also deeply instructive. Far from closing the door on copyright holders, they point to places where litigants might find a key...

Taken together, the three cases point to a clearer path forward for publishers building copyright cases against Big AI:
Focus on outputs instead of inputs: It’s not enough that someone hoovered up your work. To build a solid case, you need to show that what the AI company did with it reproduced it in some form. So far, no court has definitively decided whether AI outputs are meaningfully different enough to count as “transformative” in the eyes of copyright law, but it should be noted that courts have ruled in the past that copyright violation can occur even when small parts of the work are copied—ifthose parts represent the “heart” of the original.
Show market harm: This looks increasingly like the main battle. Now that we have a lot of data on how AI search engines and chatbots—which, to be clear, are outputs—are affecting the online behavior of news consumers, the case that an AI service harms the media market is easier to make than it was a year ago. In addition, the emergence of licensing deals between publishers and AI companies is evidence that there’s market harm by creating outputs without offering such a deal.
Question source legitimacy: Was the content legally acquired or pirated? The Anthropic case opens this up as a possible attack vector for publishers. If they can prove scraping occurred through paywalls—without subscribing first—that could be a violation even absent any outputs."

Saturday, July 5, 2025

Two Courts Rule On Generative AI and Fair Use — One Gets It Right; Electronic Frontier Foundation (EFF), June 26, 2025

TORI NOBLE, Electronic Frontier Foundation (EFF); Two Courts Rule On Generative AI and Fair Use — One Gets It Right

"Gen-AI is spurring the kind of tech panics we’ve seen before; then, as now, thoughtful fair use opinions helped ensure that copyright law served innovation and creativity. Gen-AI does raise a host of other serious concerns about fair labor practices and misinformation, but copyright wasn’t designed to address those problems. Trying to force copyright law to play those roles only hurts important and legal uses of this technology.

In keeping with that tradition, courts deciding fair use in other AI copyright cases should look to Bartz, not Kadrey."

Thursday, July 3, 2025

Cloudflare Sidesteps Copyright Issues, Blocking AI Scrapers By Default; Forbes, July 2, 2025

Emma Woollacott , Forbes; Cloudflare Sidesteps Copyright Issues, Blocking AI Scrapers By Default

"IT service management company Cloudflare is striking back on behalf of content creators, blocking AI scrapers by default.

Web scrapers are bots that crawl the internet, collecting and cataloguing content of all types, and are used by AI firms to collect material that can be used to train their models.

Now, though, Cloudflare is allowing website owners to choose if they want AI crawlers to access their content, and decide how the AI companies can use it. They can opt to allow crawlers for certain purposes—search, for example—but block others. AI companies will have to obtain explicit permission from a website before scraping."

Wednesday, July 2, 2025

Fair Use or Foul Play? The AI Fair Use Copyright Line; The National Law Review, July 2, 2025

Jodi Benassi of McDermott Will & Emery , The National Law Review; Fair Use or Foul Play? The AI Fair Use Copyright Line

"Practice note: This is the first federal court decision analyzing the defense of fair use of copyrighted material to train generative AI. Two days after this decision issued, another Northern District of California judge ruled in Kadrey et al. v. Meta Platforms Inc. et al., Case No. 3:23-cv-03417, and concluded that the AI technology at issue in his case was transformative. However, the basis for his ruling in favor of Meta on the question of fair use was not transformation, but the plaintiffs’ failure “to present meaningful evidence that Meta’s use of their works to create [a generative AI engine] impacted the market” for the books."

Eminem, AI and me: why artists need new laws in the digital age; The Guardian, July 2, 2025

Alexander Hurst , The Guardian; Eminem, AI and me: why artists need new laws in the digital age

"Song lyrics, my publisher informs me, are subject to notoriously strict copyright enforcement and the cost to buy the rights is often astronomical. Fat chance as well, then, of me quoting Eminem to talk about how Lose Yourself seeped into the psyche of a generation when he rapped: “You only get one shot, do not miss your chance to blow, this opportunity comes once in a lifetime.”

Oh would it be different if I were an AI company with a large language model (LLM), though. I could scrape from the complete discography of the National and Eminem, and the lyrics of every other song ever written. Then, when a user prompted something like, “write a rap in the style of Eminem about losing money, and draw inspiration from the National’s Bloodbuzz Ohio”, my word correlation program – with hundreds of millions of paying customers and a market capitalisation worth tens if not hundreds of billions of dollars – could answer:

“I still owe money to the money to the money I owe,

But I spit gold out my throat when I flow,

So go tell the bank they can take what they like

I already gave my soul to the mic.”

And that, according to rulings last month by the US courts, is somehow “fair use” and is perplexingly not copyright infringement at all, despite no royalties having been paid to anyone in the process."

Tuesday, July 1, 2025

The Court Battles That Will Decide if Silicon Valley Can Plunder Your Work; Slate, June 30, 2025

BY NITISH PAHWA , SLATE; The Court Battles That Will Decide if Silicon Valley Can Plunder Your Work

"Last week, two different federal judges in the Northern District of California made legal rulings that attempt to resolve one of the knottiest debates in the artificial intelligence world: whether it’s a copyright violation for Big Tech firms to use published books for training generative bots like ChatGPT. Unfortunately for the many authors who’ve brought lawsuits with this argument, neither decision favors their case—at least, not for now. And that means creators in all fields may not be able to stop A.I. companies from using their work however they please...

What if these copyright battles are also lost? Then there will be little in the way of stopping A.I. startups from utilizing all creative works for their own purposes, with no consideration as to the artists and writers who actually put in the work. And we will have a world blessed less with human creativity than one overrun by second-rate slop that crushes the careers of the people whose imaginations made that A.I. so potent to begin with."

Hollywood Confronts AI Copyright Chaos in Washington, Courts; The Wall Street Journal, July 1, 2025

Amrith Ramkumar, Jessica Toonkel, The Wall Street Journal; Hollywood Confronts AI Copyright Chaos in Washington, Courts

Technology firms say using copyrighted materials to train AI models is key to America’s success; creatives want their work protected

Sunday, June 29, 2025

An AI firm won a lawsuit for copyright infringement — but may face a huge bill for piracy; Los Angeles Times, June 27, 2025

Michael Hiltzik , Los Angeles Times; An AI firm won a lawsuit for copyright infringement — but may face a huge bill for piracy

[Kip Currier: Excellent informative overview of some of the principal issues, players, stakes, and recent decisions in the ongoing AI copyright legal battles. Definitely worth 5-10 minutes of your time to read and reflect on.

A key take-away, derived from Judge Vince Chhabria's decision in last week's Meta win, is that:

Artists and authors can win their copyright infringement cases if they produce evidence showing the bots are affecting their market. Chhabria all but pleaded for the plaintiffs to bring some such evidence before him:
“It’s hard to imagine that it can be fair use to use copyrighted books...to make billions or trillions of dollars while enabling the creation of a potentially endless stream of competing works that could significantly harm the market for those books.”
But “the plaintiffs never so much as mentioned it,” he lamented.
https://www.latimes.com/business/story/2025-06-27/an-ai-firm-won-a-lawsuit-over-copyright-infringement-but-may-face-a-huge-bill-for-piracy]

[Excerpt]

"Anthropic had to acknowledge a troubling qualification in Alsup’s order, however. Although he found for the company on the copyright issue, he also noted that it had downloaded copies of more than 7 million books from online “shadow libraries,” which included countless copyrighted works, without permission.

That action was “inherently, irredeemably infringing,” Alsup concluded. “We will have a trial on the pirated copies...and the resulting damages,” he advised Anthropic ominously: Piracy on that scale could expose the company to judgments worth untold millions of dollars...

“Neither case is going to be the last word” in the battle between copyright holders and AI developers, says Aaron Moss, a Los Angeles attorney specializing in copyright law. With more than 40 lawsuits on court dockets around the country, he told me, “it’s too early to declare that either side is going to win the ultimate battle.”...

With billions of dollars, even trillions, at stake for AI developers and the artistic community at stake, no one expects the law to be resolved until the issue reaches the Supreme Court, presumably years from now...

But Anthropic also downloaded copies of more than 7 million books from online “shadow libraries,” which include untold copyrighted works without permission.

Alsup wrote that Anthropic “could have purchased books, but it preferred to steal them to avoid ‘legal/practice/business slog,’” Alsup wrote. (He was quoting Anthropic co-founder and CEO Dario Amodei.)...

Artists and authors can win their copyright infringement cases if they produce evidence showing the bots are affecting their market."...

The truth is that the AI camp is just trying to get out of paying for something instead of getting it for free. Never mind the trillions of dollars in revenue they say they expect over the next decade — they claim that licensing will be so expensive it will stop the march of this supposedly historic technology dead in its tracks.

Chhabria aptly called this argument “nonsense.” If using books for training is as valuable as the AI firms say they are, he noted, then surely a market for book licensing will emerge. That is, it will — if the courts don’t give the firms the right to use stolen works without compensation."

Saturday, June 28, 2025

The Anthropic Copyright Ruling Exposes Blind Spots on AI; Bloomberg, June 26, 2025

Dave Lee , Bloomberg; The Anthropic Copyright Ruling Exposes Blind Spots on AI

[Kip Currier: It's still early days in the AI copyright legal battles underway between AI tech companies and everyone else whose training data was "scarfed up" to enable the former to create lucrative AI tools and products. But cases like this week's Anthropic lawsuit win and another suit won by Meta (with some issues still to be adjudicated regarding the use of pirated materials as AI training data) are finally now giving us some more discernible "tea leaves" and "black letter law" as to how courts are likely to rule vis-a-vis AI inputs.

This week being the much ballyhooed 50th anniversary of the so-called "1st summer blockbuster flick" Jaws ("you're gonna need a bigger boat"), these rulings make me think we the public may need a bigger copyright law schema that sets out protections for the creatives making the fuel that enables stratospherically profitable AI innovations. The Jaws metaphor may be a bit on-the-nose, but one can't help but view AI tech companies akin to rapacious sharks that are imperiling the financial survival and long-standing business models of human creators.

As touched on in this Bloomberg article, too, there's a moral argument that what AI tech folks have done with the uncompensated use of creative works, without permission, doesn't mean that it's ethically justifiable simply because a court may say it's legal. Or that these companies shouldn't be required by updated federal copyright legislation and licensing frameworks to fairly compensate creators for the use of their copyrighted works. After all, billionaire tech oligarchs like Zuckerberg, Musk, and Altman would never allow others to do to them what they've done to creatives with impunity and zero contrition.

Are you listening, Congress?

Or are all of you in the pockets of AI tech company lobbyists, rather than representing the needs and interests of all of your constituents and not just the billionaire class.]

[Excerpt]

"In what is shaping up to be a long, hard fight over the use of creative works, round one has gone to the AI makers. In the first such US decision of its kind, District Judge William Alsup said Anthropic’s use of millions of books to train its artificial-intelligence model, without payment to the sources, was legal under copyright law because it was “transformative — spectacularly so.”...

If a precedent has been set, as several observers believe, it stands to cripple one of the few possible AI monetization strategies for rights holders, which is to sell licenses to firms for access to their work. Some of these deals have already been made while the “fair use” question has been in limbo, deals that emerged only after the threat of legal action. This ruling may have just taken future deals off the table...

Alsup was right when he wrote that “the technology at issue was among the most transformative many of us will see in our lifetimes.”...

But that doesn’t mean it shouldn’t pay its way. Nobody would dare suggest Nvidia Corp. CEO Jensen Huang hand out his chips free. No construction worker is asked to keep costs down by building data center walls for nothing. Software engineers aren’t volunteering their time to Meta Platforms Inc. in awe of Mark Zuckerberg’s business plan — they instead command salaries of $100 million and beyond.

Yet, as ever, those in the tech industry have decided that creative works, and those who create them, should be considered of little or no value and must step aside in service of the great calling of AI — despite being every bit as vital to the product as any other factor mentioned above. As science-fiction author Harlan Ellison said in his famous sweary rant, nobody ever wants to pay the writer if they can get away with it. When it comes to AI, paying creators of original work isn’t impossible, it’s just inconvenient. Legislators should leave companies no choice."

Friday, June 27, 2025

Getty drops copyright allegations in UK lawsuit against Stability AI; AP, June 25, 2025

KELVIN CHAN, AP; Getty drops copyright allegations in UK lawsuit against Stability AI

"Getty Images dropped copyright infringement allegations from its lawsuit against artificial intelligence company Stability AI as closing arguments began Wednesday in the landmark case at Britain’s High Court.

Seattle-based Getty’s decision to abandon the copyright claim removes a key part of its lawsuit against Stability AI, which owns a popular AI image-making tool called Stable Diffusion. The two have been facing off in a widely watched court case that could have implications for the creative and technology industries."

Wednesday, June 25, 2025

Judge dismisses authors’ copyright lawsuit against Meta over AI training; AP, June 25, 2025

MATT O’BRIEN AND BARBARA ORTUTAY, AP; Judge dismisses authors’ copyright lawsuit against Meta over AI training

"Although Meta prevailed in its request to dismiss the case, it could turn out to be a pyrrhic victory. In his 40-page ruling, Chhabria repeatedly indicated reasons to believe that Meta and other AI companies have turned into serial copyright infringers as they train their technology on books and other works created by humans, and seemed to be inviting other authors to bring cases to his court presented in a manner that would allow them to proceed to trial.

The judge scoffed at arguments that requiring AI companies to adhere to decades-old copyright laws would slow down advances in a crucial technology at a pivotal time. “These products are expected to generate billions, even trillions of dollars for the companies that are developing them. If using copyrighted works to train the models is as necessary as the companies say, they will figure out a way to compensate copyright holders for it.”

Tuesday, June 24, 2025

Anthropic’s AI copyright ‘win’ is more complicated than it looks; Fast Company, June 24, 2025

CHRIS STOKEL-WALKER, Fast Company;Anthropic’s AI copyright ‘win’ is more complicated than it looks

"And that’s the catch: This wasn’t an unvarnished win for Anthropic. Like other tech companies, Anthropic allegedly sourced training materials from piracy sites for ease—a fact that clearly troubled the court. “This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use,” Alsup wrote, referring to Anthropic’s alleged pirating of more than 7 million books.

That alone could carry billions in liability, with statutory damages starting at $750 per book—a trial on that issue is still to come.

So while tech companies may still claim victory (with some justification, given the fair use precedent), the same ruling also implies that companies will need to pay substantial sums to legally obtain training materials. OpenAI, for its part, has in the past argued that licensing all the copyrighted material needed to train its models would be practically impossible.

Joanna Bryson, a professor of AI ethics at the Hertie School in Berlin, says the ruling is “absolutely not” a blanket win for tech companies. “First of all, it’s not the Supreme Court. Secondly, it’s only one jurisdiction: The U.S.,” she says. “I think they don’t entirely have purchase over this thing about whether or not it was transformative in the sense of changing Claude’s output.”"

The copyright war between the AI industry and creatives; Financial Times, June 23, 2025

MARTIN WOLF, Financial Times ; The copyright war between the AI industry and creatives

"One is that the government itself estimates that “creative industries generated £126bn in gross value added to the economy [5 per cent of GDP] and employed 2.4 million people in 2022”. It is at the very least an open question whether the value added of the AI industry will ever be of a comparable scale in this country. Another is that the creative industries represent much of the best of what the UK and indeed humanity does. The idea of handing over its output for free is abhorrent...

Interestingly, for much of the 19th century, the US did not recognise international copyright at all in its domestic law. Anthony Trollope himself complained fiercely about the theft of the copyright over his books."

Anthropic wins key US ruling on AI training in authors' copyright lawsuit; Reuters, June 24, 2025

Blake Brittain, Reuters; Anthropic wins key US ruling on AI training in authors' copyright lawsuit

"A federal judge in San Francisco ruled late on Monday that Anthropic's use of books without permission to train its artificial intelligence system was legal under U.S. copyright law.

Siding with tech companies on a pivotal question for the AI industry, U.S. District Judge William Alsup said Anthropic made "fair use" of books by writers Andrea Bartz, Charles Graeber and Kirk Wallace Johnson to train its Claude large language model.

Alsup also said, however, that Anthropic's copying and storage of more than 7 million pirated books in a "central library" infringed the authors' copyrights and was not fair use. The judge has ordered a trial in December to determine how much Anthropic owes for the infringement."

Study: Meta AI model can reproduce almost half of Harry Potter book; Ars Technica, June 20, 2025

TIMOTHY B. LEE , Ars Techcnica; Study: Meta AI model can reproduce almost half of Harry Potter book

"In recent years, numerous plaintiffs—including publishers of books, newspapers, computer code, and photographs—have sued AI companies for training models using copyrighted material. A key question in all of these lawsuits has been how easily AI models produce verbatim excerpts from the plaintiffs’ copyrighted content.

For example, in its December 2023 lawsuit against OpenAI, The New York Times Company produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories. In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”

But is it actually a fringe behavior? And have leading AI companies addressed it? New research—focusing on books rather than newspaper articles and on different companies—provides surprising insights into this question. Some of the findings should bolster plaintiffs’ arguments, while others may be more helpful to defendants.

The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models—three from Meta and one each from Microsoft and EleutherAI—were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright."

Friday, June 20, 2025

Two Major Lawsuits Aim to Answer a Multi-Billion-Dollar Question: Can AI Train on Your Creative Work Without Permission?; The National Law Review, June 18, 2025

Andrew R. Lee, Timothy P. Scanlan, Jr. of Jones Walker LLP , The National Law Review; Two Major Lawsuits Aim to Answer a Multi-Billion-Dollar Question: Can AI Train on Your Creative Work Without Permission?

"In a London courtroom, lawyers faced off in early June in a legal battle that could shape the future relationship between artificial intelligence and creative work. The case pits Getty Images, a major provider of stock photography, against Stability AI, the company behind the popular AI art generator, Stable Diffusion.

At the heart of the dispute is Getty's claim that Stability AI unlawfully used 12 million of its copyrighted images to train its AI model. The outcome of this case could establish a critical precedent for whether AI companies can use publicly available online content for training data or if they will be required to license it.

On the first day of trial, Getty's lawyer told the London High Court that the company “recognises that the AI industry overall may be a force for good,” but that did not justify AI companies “riding roughshod over intellectual property rights.”

A Key Piece of Evidence

A central component of Getty's case is the observation that Stable Diffusion's output sometimes includes distorted versions of the Getty Images watermark. Getty argues this suggests its images were not only used for training but are also being partially reproduced by the AI model.

Stability AI has taken the position that training an AI model on images constitutes a transformative use of that data. The argument is that teaching a machine from existing information is fundamentally different from direct copying."

Sunday, June 15, 2025

AI chatbots need more books to learn from. These libraries are opening their stacks; AP, June 12, 2025

MATT O’BRIEN, AP; AI chatbots need more books to learn from. These libraries are opening their stacks

"Supported by “unrestricted gifts” from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries and museums around the world on how to make their historic collections AI-ready in a way that also benefits the communities they serve.

“We’re trying to move some of the power from this current AI moment back to these institutions,” said Aristana Scourtas, who manages research at Harvard Law School’s Library Innovation Lab. “Librarians have always been the stewards of data and the stewards of information.

Harvard’s newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter’s handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.

It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems."