Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Thursday, February 20, 2025

AI and Copyright: Expanding Copyright Hurts Everyone—Here’s What to Do Instead; Electronic Frontier Foundation (EFF), February 19, 2025

 TORI NOBLE, Electronic Frontier Foundation (EFF); AI and Copyright: Expanding Copyright Hurts Everyone—Here’s What to Do Instead


[Kip Currier: No, not everyone. Not requiring Big Tech to figure out a way to fairly license or get permission to use the copyrighted works of creators unjustly advantages these deep pocketed corporations. It also inequitably disadvantages the economic and creative interests of the human beings who labor to create copyrightable content -- authors, songwriters, visual artists, and many others.

The tell is that many of these same Big Tech companies are only too willing to file copyright infringement lawsuits against anyone whom they allege is infringing their AI content to create competing products and services.]


[Excerpt]


"Threats to Socially Valuable Research and Innovation 

Requiring researchers to license fair uses of AI training data could make socially valuable research based on machine learning (ML) and even text and data mining (TDM) prohibitively complicated and expensive, if not impossible. Researchers have relied on fair use to conduct TDM research for a decade, leading to important advancements in myriad fields. However, licensing the vast quantity of works that high-quality TDM research requires is frequently cost-prohibitive and practically infeasible.  

Fair use protects ML and TDM research for good reason. Without fair use, copyright would hinder important scientific advancements that benefit all of us. Empirical studies back this up: research using TDM methodologies are more common in countries that protect TDM research from copyright control; in countries that don’t, copyright restrictions stymie beneficial research. It’s easy to see why: it would be impossible to identify and negotiate with millions of different copyright owners to analyze, say, text from the internet."

Sunday, February 16, 2025

Court filings show Meta paused efforts to license books for AI training; TechCrunch, February 14, 3025

 Kyle Wiggers, TechCrunch; Court filings show Meta paused efforts to license books for AI training

"According to one transcript, Sy Choudhury, who leads Meta’s AI partnership initiatives, said that Meta’s outreach to various publishers was met with “very slow uptake in engagement and interest.”

“I don’t recall the entire list, but I remember we had made a long list from initially scouring the Internet of top publishers, et cetera,” Choudhury said, per the transcript, “and we didn’t get contact and feedback from — from a lot of our cold call outreaches to try to establish contact.”

Choudhury added, “There were a few, like, that did, you know, engage, but not many.”

According to the court transcripts, Meta paused certain AI-related book licensing efforts in early April 2023 after encountering “timing” and other logistical setbacks. Choudhury said some publishers, in particular fiction book publishers, turned out to not in fact have the rights to the content that Meta was considering licensing, per a transcript.

“I’d like to point out that the — in the fiction category, we quickly learned from the business development team that most of the publishers we were talking to, they themselves were representing that they did not have, actually, the rights to license the data to us,” Choudhury said. “And so it would take a long time to engage with all their authors.”"

Friday, February 14, 2025

AI companies flaunt their theft. News media has to fight back – so we're suing. | Opinion; USA Today, February 13, 2025

Danielle Coffey, USA Today; AI companies flaunt their theft. News media has to fight back – so we're suing. | Opinion

"Danielle Coffey is president & CEO of the News/Media Alliance, which represents 2,000 news and magazine media outlets worldwide...

This is not an anti-AI lawsuit or an effort to turn back the clock. We love technology. We use it in our businesses. Artificial intelligence will help us better serve our customers, but only if it respects intellectual property. That’s the remedy we’re seeking in court.

When it suits them, the AI companies assert similar claims to ours. Meta's lawsuit accused Bright Data of scraping data in violation of its terms of use. And Sam Altman of OpenAI has complained that DeepSeek illegally copied its algorithms.

Good actors, responsible technologies and potential legislation offer some hope for improving the situation. But what is urgently needed is what every market needs: reinforcement of legal protections against theft."

Thursday, February 13, 2025

News publishers sue Cohere for copyright and trademark infringement; Axios, February 13, 2025

 

"More than a dozen major U.S. news organizations on Thursday said they were suing Cohere, an enterprise AI company, claiming the tech startup illegally repurposed their work and did so in a way that tarnished their brands.

Why it matters: The lawsuit represents the first official legal action against an AI company organized by the News Media Alliance — the largest news media trade group in the U.S...

  • The NMA members participating in the lawsuit include Advance Local Media, Condé Nast, The Atlantic, Forbes Media, The Guardian, Business Insider, The Los Angeles Times, McClatchy Media Company, Newsday, Plain Dealer Publishing Company, Politico, The Republican Company, Toronto Star Newspapers, and Vox Media.

Between the lines: The complaint was filed shortly after the U.S. Copyright Office changed its copyright registration processes to make them faster for digital publishers.

  • Previously, the process by which digital publishers had to file for copyright protections for individual works was extremely cumbersome, limiting their ability to seek protection. 

Because of those changes, Coffey explained, NMA and the publishers who are suing Cohere were able to identify thousands of specific examples of Cohere verbatim copying their copyright-protected works."

Wednesday, February 12, 2025

Court: Training AI Model Based on Copyrighted Data Is Not Fair Use as a Matter of Law; The National Law Review, February 11, 2025

 Joseph A. MeckesJoseph Grasser of Squire Patton Boggs (US) LLP   - Global IP and Technology Law Blog,  The National Law Review; Court: Training AI Model Based on Copyrighted Data Is Not Fair Use as a Matter of Law

"In what may turn out to be an influential decision, Judge Stephanos Bibas ruled as a matter of law in Thompson Reuters v. Ross Intelligence that creating short summaries of law to train Ross Intelligence’s artificial intelligence legal research application not only infringes Thompson Reuters’ copyrights as a matter of law but that the copying is not fair use. Judge Bibas had previously ruled that infringement and fair use were issues for the jury but changed his mind: “A smart man knows when he is right; a wise man knows when he is wrong.”

At issue in the case was whether Ross Intelligence directly infringed Thompson Reuters’ copyrights in its case law headnotes that are organized by Westlaw’s proprietary Key Number system. Thompson Reuters contended that Ross Intelligence’s contractor copied those headnotes to create “Bulk Memos.” Ross Intelligence used the Bulk Memos to train its competitive AI-powered legal research tool. Judge Bibas ruled that (i) the West headnotes were sufficiently original and creative to be copyrightable, and (ii) some of the Bulk Memos used by Ross were so similar that they infringed as a matter of law...

In other words, even if a work is selected entirely from the public domain, the simple act of selection is enough to give rise to copyright protection."

Monday, February 10, 2025

Meta staff torrented nearly 82TB of pirated books for AI training — court records reveal copyright violations; Tom's Hardware, February 9, 2025

 

 , Tom's Hardware; Meta staff torrented nearly 82TB of pirated books for AI training — court records reveal copyright violations

"Facebook parent-company Meta is currently fighting a class action lawsuit alleging copyright infringement and unfair competition, among others, with regards to how it trained LLaMA. According to an X (formerly Twitter) post by vx-underground, court records reveal that the social media company used pirated torrents to download 81.7TB of data from shadow libraries including Anna’s Archive, Z-Library, and LibGen. It then used this information to train its AI models.

The evidence, in the form of written communication, shows the researchers’ concerns about Meta’s use of pirated materials. One senior AI researcher said way back in October 2022, “I don’t think we should use pirated material. I really need to draw a line here.” While another one said, “Using pirated material should be beyond our ethical threshold,” then they added, “SciHub, ResearchGate, LibGen are basically like PirateBay or something like that, they are distributing content that is protected by copyright and they’re infringing it.”"

Tuesday, January 28, 2025

Elton John backs Paul McCartney in criticising proposed overhaul to UK copyright system; The Guardian, January 27, 2025

 , The Guardian ; Elton John backs Paul McCartney in criticising proposed overhaul to UK copyright system

"Elton John has backed Paul McCartney in criticising a proposed overhaul of the UK copyright system, and has called for new rules to prevent tech companies from riding “roughshod over the traditional copyright laws that protect artists’ livelihoods”.

John has backed proposed amendments to the data (use and access) bill that would extend existing copyright protections, when it goes before a vote in the House of Lords on Tuesday.

The government is also consulting on an overhaul of copyright laws that would result in artists having to opt out of letting AI companies train their models using their work, rather than an opt-in model...

John told the Sunday Times that he felt “wheels are in motion to allow AI companies to ride roughshod over the traditional copyright laws that protect artists’ livelihoods. This will allow global big tech companies to gain free and easy access to artists’ work in order to train their artificial intelligence and create competing music. This will dilute and threaten young artists’ earnings even further. The musician community rejects it wholeheartedly.”

He said that “challenging financial situations” and increased touring costs made it “harder than ever for new and emerging musicians to make the finances of the industry stack up to sustain a fledgling career”, and added that the UK’s place on the world stage as “a leader in arts and popular culture is under serious jeopardy” without robust copyright protection.

“It is the absolute bedrock of artistic prosperity, and the country’s future success in the creative industries depends on it.”

The government consultation runs until 25 February and will explore how to improve trust between the creative and AI sectors, and how creators can license and get paid for use of their material."

Sunday, January 19, 2025

Congress Must Change Copyright Law for AI | Opinion; Newsweek, January 16, 2025

  Assistant Professor of Business Law, Georgia College and State University , Newsweek; Congress Must Change Copyright Law for AI | Opinion

"Luckily, the Constitution points the way forward. In Article I, Section 8, Congress is explicitly empowered "to promote the Progress of Science" through copyright law. That is to say, the power to create copyrights isn't just about protecting content creators, it's also about advancing human knowledge and innovation.

When the Founders gave Congress this power, they couldn't have imagined artificial intelligence, but they clearly understood that intellectual property laws would need to evolve to promote scientific progress. Congress therefore not only has the authority to adapt copyright law for the AI age, it has the duty to ensure our intellectual property framework promotes rather than hinders technological progress.

Consider what's at risk with inaction...

While American companies are struggling with copyright constraints, China is racing ahead with AI development, unencumbered by such concerns. The Chinese Communist Party has made it clear that they view AI supremacy as a key strategic goal, and they're not going to let intellectual property rights stand in their way.

The choice before us is clear, we can either reform our copyright laws to enable responsible AI development at home or we can watch as the future of AI is shaped by authoritarian powers abroad. The cost of inaction isn't just measured in lost innovation or economic opportunity, it is measured in our diminishing ability to ensure AI develops in alignment with democratic values and a respect for human rights.

The ideal solution here isn't to abandon copyright protection entirely, but to craft a careful exemption for AI training. This could even include provisions for compensating content creators through a mandated licensing framework or revenue-sharing system, ensuring that AI companies can access the data they need while creators can still benefit from and be credited for their work's use in training these models.

Critics will argue that this represents a taking from creators for the benefit of tech companies, but this misses the broader picture. The benefits of AI development flow not just to tech companies but to society as a whole. We should recognize that allowing AI models to learn from human knowledge serves a crucial public good, one we're at risk of losing if Congress doesn't act."

Thursday, January 16, 2025

In AI copyright case, Zuckerberg turns to YouTube for his defense; TechCrunch, January 15, 2025

 

, TechCrunch ; In AI copyright case, Zuckerberg turns to YouTube for his defense

"Meta CEO Mark Zuckerberg appears to have used YouTube’s battle to remove pirated content to defend his own company’s use of a data set containing copyrighted e-books, reveals newly released snippets of a deposition he gave late last year.

The deposition, which was part of a complaint submitted to the court by plaintiffs’ attorneys, is related to the AI copyright case Kadrey v. Meta. It’s one of many such cases winding through the U.S. court system that’s pitting AI companies against authors and other IP holders. For the most part, the defendants in these cases – AI companies – claim that training on copyrighted content is “fair use.” Many copyright holders disagree."

Wednesday, January 15, 2025

'The New York Times' takes OpenAI to court. ChatGPT's future could be on the line; NPR, January 14, 2025

 , NPR; 'The New York Times' takes OpenAI to court. ChatGPT's future could be on the line

"A group of news organizations, led by The New York Times, took ChatGPT maker OpenAI to federal court on Tuesday in a hearing that could determine whether the tech company has to face the publishers in a high-profile copyright infringement trial.

Three publishers' lawsuits against OpenAI and its financial backer Microsoft have been merged into one case. Leading each of the three combined cases are the Times, The New York Daily News and the Center for Investigative Reporting.

Other publishers, like the Associated Press, News Corp. and Vox Media, have reached content-sharing deals with OpenAI, but the three litigants in this case are taking the opposite path: going on the offensive."

Monday, January 6, 2025

OpenAI holds off on promise to creators, fails to protect intellectual property; The American Bazaar, January 3, 2025

  Vishnu Kamal, The American Bazaar; OpenAI holds off on promise to creators, fails to protect intellectual property

"OpenAI may yet again be in hot water as it seems that the tech giant may be reneging on its earlier assurances. Reportedly, in May, OpenAI said it was developing a tool to let creators specify how they want their works to be included in—or excluded from—its AI training data. But seven months later, this feature has yet to see the light of day.

Called Media Manager, the tool would “identify copyrighted text, images, audio, and video,” OpenAI said at the time, to reflect creators’ preferences “across multiple sources.” It was intended to stave off some of the company’s fiercest critics, and potentially shield OpenAI from IP-related legal challenges...

OpenAI has faced various legal challenges related to its AI technologies and operations. One major issue involves the privacy and data usage of its language models, which are trained on large datasets that may include publicly available or copyrighted material. This raises concerns over privacy violations and intellectual property rights, especially regarding whether the data used for training was obtained with proper consent.

Additionally, there are questions about the ownership of content generated by OpenAI’s models. If an AI produces a work based on copyrighted data, it is tricky to determine who owns the rights—whether it’s OpenAI, the user who prompted the AI, or the creators of the original data.

Another concern is the liability for harmful content produced by AI. If an AI generates misleading or defamatory information, legal responsibility could fall on OpenAI."

Tuesday, December 31, 2024

Column: A Faulkner classic and Popeye enter the public domain while copyright only gets more confusing; Los Angeles Times, December 31, 2024

 Michael Hiltzik , Los Angeles Times; Column: A Faulkner classic and Popeye enter the public domain while copyright only gets more confusing

"The annual flow of copyrighted works into the public domain underscores how the progressive lengthening of copyright protection is counter to the public interest—indeed, to the interests of creative artists. The initial U.S. copyright act, passed in 1790, provided for a term of 28 years including a 14-year renewal. In 1909, that was extended to 56 years including a 28-year renewal.

In 1976, the term was changed to the creator’s life plus 50 years. In 1998, Congress passed the Copyright Term Extension Act, which is known as the Sonny Bono Act after its chief promoter on Capitol Hill. That law extended the basic term to life plus 70 years; works for hire (in which a third party owns the rights to a creative work), pseudonymous and anonymous works were protected for 95 years from first publication or 120 years from creation, whichever is shorter.

Along the way, Congress extended copyright protection from written works to movies, recordings, performances and ultimately to almost all works, both published and unpublished.

Once a work enters the public domain, Jenkins observes, “community theaters can screen the films. Youth orchestras can perform the music publicly, without paying licensing fees. Online repositories such as the Internet Archive, HathiTrust, Google Books and the New York Public Library can make works fully available online. This helps enable both access to and preservation of cultural materials that might otherwise be lost to history.”"

Anthropic Agrees to Enforce Copyright Guardrails on New AI Tools; Bloomberg Law, December 30, 2024

Annelise Levy, Bloomberg Law; Anthropic Agrees to Enforce Copyright Guardrails on New AI Tools

"Anthropic PBC must apply guardrails to prevent its future AI tools from producing infringing copyrighted content, according to a Monday agreement reached with music publishers suing the company for infringing protected song lyrics. 

Eight music publishers—including Universal Music Corp. and Concord Music Group—and Anthropic filed a stipulation partly resolving the publishers’ preliminary injunction motion in the US District Court for the Northern District of California. The publishers’ request that Anthropic refrain from using unauthorized copies of lyrics to train future AI models remains pending."

Sunday, December 29, 2024

AI's assault on our intellectual property must be stopped; Financial Times, December 21, 2024

 Kate Mosse, Financial Times; AI's assault on our intellectual property must be stopped

"Imagine my dismay, therefore, to discover that those 15 years of dreaming, researching, planning, writing, rewriting, editing, visiting libraries and archives, translating Occitan texts, hunting down original 13th-century documents, becoming an expert in Catharsis, apparently counts for nothing. Labyrinth is just one of several of my novels that have been scraped by Meta's large language model. This has been done without my consent, without remuneration, without even notification. This is theft...

AI companies present creators as being against change. We are  not. Every artist I know is already engaging with AI in one way or another. But a distinction needs to be made between AI that can be used in brilliant ways -- for example, medical diagnosis -- and the foundations of AI models, where companies are essentially stealing creatives' work for their own profit. We should not forget that the AI companies rely on creators to build their models. Without strong copyright law that ensures creators can earn a living, AI companies will lack the high-quality material that is essential for their future growth."

Friday, December 27, 2024

The AI Boom May Be Too Good to Be True; Wall Street Journal, December 26, 2024

 Josh Harlan, Wall Street Journal; The AI Boom May Be Too Good to Be True

 "Investors rushing to capitalize on artificial intelligence have focused on the technology—the capabilities of new models, the potential of generative tools, and the scale of processing power to sustain it all. What too many ignore is the evolving legal structure surrounding the technology, which will ultimately shape the economics of AI. The core question is: Who controls the value that AI produces? The answer depends on whether AI companies must compensate rights holders for using their data to train AI models and whether AI creations can themselves enjoy copyright or patent protections.

The current landscape of AI law is rife with uncertainty...How these cases are decided will determine whether AI developers can harvest publicly available data or must license the content used to train their models."

Tech companies face tough AI copyright questions in 2025; Reuters, December 27, 2024

 , Reuters ; Tech companies face tough AI copyright questions in 2025

"The new year may bring pivotal developments in a series of copyright lawsuits that could shape the future business of artificial intelligence.

The lawsuits from authors, news outlets, visual artists, musicians and other copyright owners accuse OpenAI, Anthropic, Meta Platforms and other technology companies of using their work to train chatbots and other AI-based content generators without permission or payment.
Courts will likely begin hearing arguments starting next year on whether the defendants' copying amounts to "fair use," which could be the AI copyright war's defining legal question."

The AI revolution is running out of data. What can researchers do?; Nature, December 11, 2024

Nicola Jones, Nature; The AI revolution is running out of data. What can researchers do?

"A prominent study1 made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets...

Several lawsuits are now under way attempting to win compensation for the providers of data being used in AI training. In December 2023, The New York Times sued OpenAI and its partner Microsoft for copyright infringement; in April this year, eight newspapers owned by Alden Global Capital in New York City jointly filed a similar lawsuit. The counterargument is that an AI should be allowed to read and learn from online content in the same way as a person, and that this constitutes fair use of the material. OpenAI has said publicly that it thinks The New York Times lawsuit is “without merit”.

If courts uphold the idea that content providers deserve financial compensation, it will make it harder for both AI developers and researchers to get what they need — including academics, who don’t have deep pockets. “Academics will be most hit by these deals,” says Longpre. “There are many, very pro-social, pro-democratic benefits of having an open web,” he adds."

Thursday, December 26, 2024

Harvard’s Library Innovation Lab launches Institutional Data Initiative; Harvard Law Today, December 12, 2024

 Scott Young , Harvard Law Today; Harvard’s Library Innovation Lab launches Institutional Data Initiative

"At the Institutional Data Initiative (IDI), a new program hosted within the Harvard Law School Library, efforts are already underway to expand and enhance the data resources available for AI training. At the initiative’s public launch on Dec. 12, Library Innovation Lab faculty director, Jonathan Zittrain ’95, and IDI executive director, Greg Leppert, announced plans to expand the availability of public domain data from knowledge institutions — including the text of nearly one million books scanned at Harvard Library — to train AI models...

Harvard Law Today: What is the Institutional Data Initiative?

Greg Leppert: Our work at the Institutional Data Initiative is focused on finding ways to improve the accessibility of institutional data for all uses, artificial intelligence among them. Harvard Law School Library is a tremendous repository of public domain books, briefs, research papers, and so on. Regardless of how this information was initially memorialized — hardcover, softcover, parchment, etc. — a considerable amount has been converted into digital form. At the IDI, we are working to ensure these large data sets of public domain works, like the ones from the Law School library that comprise the Caselaw Access Project, are made open and accessible, especially for AI training. Harvard is not alone in terms of the scale and quality of its data; similar sets exist throughout our academic institutions and public libraries. AI systems are only as diverse as the data on which they’re trained, and these public domain data sets ought to be part of a healthy diet for future AI training.

HLT: What problem is the Institutional Data Initiative working to solve?

Leppert: As it stands, the data being used to train AI is often limited in terms of scale, scope, quality, and integrity. Various groups and perspectives are massively underrepresented in the data currently being used to train AI. As things stand, outliers will not be served by AI as well as they should be, and otherwise could be, by the inclusion of that underrepresented data. The country of Iceland, for example, undertook a national, government-led effort to make materials from their national libraries available for AI applications. That is because they were seriously concerned the Icelandic language and culture would not be represented in AI models. We are also working towards reaffirming Harvard, and other institutions, as the stewards of their collections. The proliferation of training sets based on public domain materials has been encouraging to see, but it’s important that this doesn’t leave the material vulnerable to critical omissions or alterations. For centuries, knowledge institutions have served as stewards of information for the purpose of promoting the public good and furthering the representation of diverse ideas, cultural groups, and ways of seeing the world. So, we believe these institutions are the exact kind of sources for AI training data if we want to optimize its ability to serve humanity. As it stands today, there is significant room for improvement."

Monday, December 23, 2024

The god illusion: why the pope is so popular as a deepfake image; The Guardian, December 21, 2024

 , The Guardian; The god illusion: why the pope is so popular as a deepfake image

"The pope is an obvious target for deepfakes, according to experts, because there is such a vast digital “footprint” of videos, images and voice recordings related to Francis. AI models are trained on the open internet, which is stuffed with content featuring famous public figures, from politicians to celebrities and religious leaders.

“The pope is so frequently featured in the public eye and there are large volumes of photos, videos, and audio clips of him on the open web,” said Sam Stockwell, a research associate at the UK’s Alan Turing Institute.

“Since AI models are often trained indiscriminately on such data, it becomes a lot easier for these models to replicate the facial features and likeness of individuals like the pope compared with those who don’t have such a large digital footprint.”"

Saturday, December 21, 2024

Every AI Copyright Lawsuit in the US, Visualized; Wired, December 19, 2024

Kate Knibbs, Wired; Every AI Copyright Lawsuit in the US, Visualized

"WIRED is keeping close tabs on how each of these lawsuits unfold. We’ve created visualizations to help you track and contextualize which companies and rights holders are involved, where the cases have been filed, what they’re alleging, and everything else you need to know."