Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Friday, April 26, 2024

Op-Ed: AI’s Most Pressing Ethics Problem; Columbia Journalism Review, April 23, 2024

  ANIKA COLLIER NAVAROLI, Columbia Journalism Review; Op-Ed: AI’s Most Pressing Ethics Problem

"I believe that, now more than ever, it’s time for people to organize and demand that AI companies pause their advance toward deploying more powerful systems and work to fix the technology’s current failures. While it may seem like a far-fetched idea, in February, Google decided to suspend its AI chatbot after it was enveloped in a public scandal. And just last month, in the wake of reporting about a rise in scams using the cloned voices of loved ones to solicit ransom, OpenAI announced it would not be releasing its new AI voice generator, citing its “potential for synthetic voice misuse.”

But I believe that society can’t just rely on the promises of American tech companies that have a history of putting profits and power above people. That’s why I argue that Congress needs to create an agency to regulate the industry. In the realm of AI, this agency should address potential harms by prohibiting the use of synthetic data and by requiring companies to audit and clean the original training data being used by their systems.

AI is now an omnipresent part of our lives. If we pause to fix the mistakes of the past and create new ethical guidelines and guardrails, it doesn’t have to become an existential threat to our future."

Wednesday, March 20, 2024

Google hit with $270M fine in France as authority finds news publishers’ data was used for Gemini; TechCrunch, March 20, 2024

 Natasha LomasRomain Dillet , TechCrunch; Google hit with $270M fine in France as authority finds news publishers’ data was used for Gemini

"In a never-ending saga between Google and France’s competition authority over copyright protections for news snippets, the Autorité de la Concurrence announced a €250 million fine against the tech giant Wednesday (around $270 million at today’s exchange rate).

According to the competition watchdog, Google disregarded some of its previous commitments with news publishers. But the decision is especially notable because it drops something else that’s bang up-to-date — by latching onto Google’s use of news publishers’ content to train its generative AI model Bard/Gemini.

The competition authority has found fault with Google for failing to notify news publishers of this GenAI use of their copyrighted content. This is in light of earlier commitments Google made which are aimed at ensuring it undertakes fair payment talks with publishers over reuse of their content."

Thursday, March 7, 2024

Introducing CopyrightCatcher, the first Copyright Detection API for LLMs; Patronus AI, March 6, 2024

 Patronus AI; Introducing CopyrightCatcher, thefirst Copyright Detection API for LLMs

"Managing risks from unintended copyright infringement in LLM outputs should be a central focus for companies deploying LLMs in production.

  • On an adversarial copyright test designed by Patronus AI researchers, we found that state-of-the-art LLMs generate copyrighted content at an alarmingly high rate 😱
  • OpenAI’s GPT-4 produced copyrighted content on 44% of the prompts.
  • Mistral’s Mixtral-8x7B-Instruct-v0.1 produced copyrighted content on 22% of the prompts.
  • Anthropic’s Claude-2.1 produced copyrighted content on 8% of the prompts.
  • Meta’s Llama-2-70b-chat produced copyrighted content on 10% of the prompts.
  • Check out CopyrightCatcher, our solution to detect potential copyright violations in LLMs. Here’s the public demo, with open source model inference powered by Databricks Foundation Model APIs. 🔥

LLM training data often contains copyrighted works, and it is pretty easy to get an LLM to generate exact reproductions from these texts1. It is critical to catch these reproductions, since they pose significant legal and reputational risks for companies that build and use LLMs in production systems2. OpenAI, Anthropic, and Microsoft have all faced copyright lawsuits on LLM generations from authors3, music publishers4, and more recently, the New York Times5.

To check whether LLMs respond to your prompts with copyrighted text, you can use CopyrightCatcher. It detects when LLMs generate exact reproductions of content from text sources like books, and highlights any copyrighted text in LLM outputs. Check out our public CopyrightCatcher demo here!

Thursday, February 29, 2024

The Intercept, Raw Story and AlterNet sue OpenAI for copyright infringement; The Guardian, February 28, 2024

 , The Guardian ; The Intercept, Raw Story and AlterNet sue OpenAI for copyright infringement

"OpenAI and Microsoft are facing a fresh round of lawsuits from news publishers over allegations that their generative artificial intelligence products violated copyright laws and illegally trained by using journalists’ work. Three progressive US outlets – the Intercept, Raw Story and AlterNet – filed suits in Manhattan federal court on Wednesday, demanding compensation from the tech companies.

The news outlets claim that the companies in effect plagiarized copyright-protected articles to develop and operate ChatGPT, which has become OpenAI’s most prominent generative AI tool. They allege that ChatGPT was trained not to respect copyright, ignores proper attribution and fails to notify users when the service’s answers are generated using journalists’ protected work."

Thursday, February 15, 2024

NIST Researchers Suggest Historical Precedent for Ethical AI Research; NIST, February 15, 2024

NIST ; NIST Researchers Suggest Historical Precedent for Ethical AI Research

"If we train artificial intelligence (AI) systems on biased data, they can in turn make biased judgments that affect hiring decisions, loan applications and welfare benefits — to name just a few real-world implications. With this fast-developing technology potentially causing life-changing consequences, how can we make sure that humans train AI systems on data that reflects sound ethical principles? 

A multidisciplinary team of researchers at the National Institute of Standards and Technology (NIST) is suggesting that we already have a workable answer to this question: We should apply the same basic principles that scientists have used for decades to safeguard human subjects research. These three principles — summarized as “respect for persons, beneficence and justice” — are the core ideas of 1979’s watershed Belmont Report, a document that has influenced U.S. government policy on conducting research on human subjects.

The team has published its work in the February issue of IEEE’s Computer magazine , a peer-reviewed journal. While the paper is the authors’ own work and is not official NIST guidance, it dovetails with NIST’s larger effort to support the development of trustworthy and responsible AI. 

“We looked at existing principles of human subjects research and explored how they could apply to AI,” said Kristen Greene, a NIST social scientist and one of the paper’s authors. “There’s no need to reinvent the wheel. We can apply an established paradigm to make sure we are being transparent with research participants, as their data may be used to train AI.”

The Belmont Report arose from an effort to respond to unethical research studies, such as the Tuskegee syphilis study, involving human subjects. In 1974, the U.S. created the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, and it identified the basic ethical principles for protecting people in research studies. A U.S. federal regulation later codified these principles in 1991’s Common Rule, which requires that researchers get informed consent from research participants. Adopted by many federal departments and agencies, the Common Rule was revised in 2017 to take into account changes and developments in research."

Thursday, February 1, 2024

The economy and ethics of AI training data; Marketplace.org, January 31, 2024

Matt Levin, Marketplace.org;  The economy and ethics of AI training data

"Maybe the only industry hotter than artificial intelligence right now? AI litigation. 

Just a sampling: Writer Michael Chabon is suing Meta. Getty Images is suing Stability AI. And both The New York Times and The Authors Guild have filed separate lawsuits against OpenAI and Microsoft. 

At the heart of these cases is the allegation that tech companies illegally used copyrighted works as part of their AI training data. 

For text focused generative AI, there’s a good chance that some of that training data originated from one massive archive: Common Crawl

“Common Crawl is the copy of the internet. It’s a 17-year archive of the internet. We make this freely available to researchers, academics and companies,” said Rich Skrenta, who heads the nonprofit Common Crawl Foundation."

Saturday, January 27, 2024

Training Generative AI Models on Copyrighted Works Is Fair Use; ARL Views, January 23, 2024

 Katherine Klosek, Director of Information Policy and Federal Relations, Association of Research Libraries (ARL), and Marjory S. Blumenthal, Senior Policy Fellow, American Library Association (ALA) Office of Public Policy and Advocacy |, ARL Views; Training Generative AI Models on Copyrighted Works Is Fair Use

"In a blog post about the case, OpenAI cites the Library Copyright Alliance (LCA) position that “based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.” LCA explained this position in our submission to the US Copyright Office notice of inquiry on copyright and AI, and in the LCA Principles for Copyright and AI.

LCA is not involved in any of the AI lawsuits. But as champions of fair use, free speech, and freedom of information, libraries have a stake in maintaining the balance of copyright law so that it is not used to block or restrict access to information. We drafted the principles on AI and copyright in response to efforts to amend copyright law to require licensing schemes for generative AI that could stunt the development of this technology, and undermine its utility to researchers, students, creators, and the public. The LCA principles hold that copyright law as applied and interpreted by the Copyright Office and the courts is flexible and robust enough to address issues of copyright and AI without amendment. The LCA principles also make the careful and critical distinction between input to train an LLM, and output—which could potentially be infringing if it is substantially similar to an original expressive work.

On the question of whether ingesting copyrighted works to train LLMs is fair use, LCA points to the history of courts applying the US Copyright Act to AI."

Friday, January 26, 2024

George Carlin Estate Sues Creators of AI-Generated Comedy Special in Key Lawsuit Over Stars’ Likenesses; The Hollywood Reporter, January 25, 2024

 Winston Cho, The Hollywood Reporter ; George Carlin Estate Sues Creators of AI-Generated Comedy Special in Key Lawsuit Over Stars’ Likenesses

"The complaint seeks a court order for immediate removal of the special, as well as unspecified damages. It’s among the first legal actions taken by the estate of a deceased celebrity for unlicensed use of their work and likeness to manufacture a new, AI-generated creation and was filed as Hollywood is sounding the alarm over utilization of AI to impersonate people without consent or compensation...

According to the complaint, the special was created through unauthorized use of Carlin’s copyrighted works.

At the start of the video, it’s explained that the AI program that created the special ingested five decades of Carlin’s original stand-up routines, which are owned by the comedian’s estate, as training materials, “thereby making unauthorized copies” of the copyrighted works...

If signed into law, the proposal, called the No AI Fraud Act, could curb a growing trend of individuals and businesses creating AI-recorded tracks using artists’ voices and deceptive ads in which it appears a performer is endorsing a product. In the absence of a federal right of publicity law, unions and trade groups in Hollywood have been lobbying for legislation requiring individuals’ consent to use their voice and likeness."

Tuesday, January 2, 2024

Copyright law is AI's 2024 battlefield; Axios, January 2, 2023

Megan Morrone , Axios; Copyright law is AI's 2024 battlefield

"Looming fights over copyright in AI are likely to set the new technology's course in 2024 faster than legislation or regulation.

Driving the news: The New York Times filed a lawsuit against OpenAI and Microsoft on December 27, claiming their AI systems' "widescale copying" constitutes copyright infringement.

The big picture: After a year of lawsuits from creators protecting their works from getting gobbled up and repackaged by generative AI tools, the new year could see significant rulings that alter the progress of AI innovation. 

Why it matters: The copyright decisions coming down the pike — over both the use of copyrighted material in the development of AI systems and also the status of works that are created by or with the help of AI — are crucial to the technology's future and could determine winners and losers in the market."

Sunday, December 31, 2023

Boom in A.I. Prompts a Test of Copyright Law; The New York Times, December 30, 2023

 J. Edward Moreno , The New York Times; Boom in A.I. Prompts a Test of Copyright Law

"The boom in artificial intelligence tools that draw on troves of content from across the internet has begun to test the bounds of copyright law...

Data is crucial to developing generative A.I. technologies — which can generate text, images and other media on their own — and to the business models of companies doing that work.

“Copyright will be one of the key points that shapes the generative A.I. industry,” said Fred Havemeyer, an analyst at the financial research firm Macquarie.

A central consideration is the “fair use” doctrine in intellectual property law, which allows creators to build upon copyrighted work...

“Ultimately, whether or not this lawsuit ends up shaping copyright law will be determined by whether the suit is really about the future of fair use and copyright, or whether it’s a salvo in a negotiation,” Jane Ginsburg, a professor at Columbia Law School, said of the lawsuit by The Times...

Competition in the A.I. field may boil down to data haves and have-nots...

“Generative A.I. begins and ends with data,” Mr. Havemeyer said."

Thursday, December 28, 2023

AI starts a music-making revolution and plenty of noise about ethics and royalties; The Washington Times, December 26, 2023

Tom Howell Jr. , The Washington Times ; AI starts a music-making revolution and plenty of noise about ethics and royalties

"“Music’s important. AI is changing that relationship. We need to navigate that carefully,” said Martin Clancy, an Ireland-based expert who has worked on chart-topping songs and is the founding chairman of the IEEE Global AI Ethics Arts Committee...

The Biden administration, the European Union and other governments are rushing to catch up with AI and harness its benefits while controlling its potentially adverse societal impacts. They are also wading through copyright and other matters of law.

Even if they devise legislation now, the rules likely will not go into effect for years. The EU recently enacted a sweeping AI law, but it won’t take effect until 2025.

“That’s forever in this space, which means that all we’re left with is our ethical decision-making,” Mr. Clancy said.

For now, the AI-generated music landscape is like the Wild West. Many AI-generated songs are hokey or just not very good."

Wednesday, December 27, 2023

The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work; The New York Times, December 27, 2023

 Michael M. Grynbaum and , The New York Times; The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work

"The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle over the unauthorized use of published work to train artificial intelligence technologies.

The Times is the first major American media organization to sue the companies, the creators of ChatGPT and other popular A.I. platforms, over copyright issues associated with its written works. The lawsuit, filed in Federal District Court in Manhattan, contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information.

The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.” It also calls for the companies to destroy any chatbot models and training data that use copyrighted material from The Times."

Monday, December 18, 2023

AI could threaten creators — but only if humans let it; The Washington Post, December 17, 2023

 , The Washington Post; AI could threaten creators — but only if humans let it

"A broader rethinking of copyright, perhaps inspired by what some AI companies are already doing, could ensure that human creators get some recompense when AI consumes their work, processes it and produces new material based on it in a manner current law doesn’t contemplate. But such a shift shouldn’t be so punishing that the AI industry has no room to grow. That way, these tools, in concert with human creators, can push the progress of science and useful arts far beyond what the Framers could have imagined."

Tuesday, November 21, 2023

Patent Poetry: Judge Throws Out Most of Artists’ AI Copyright Infringement Claims; JD Supra, November 20, 2023

 Adam PhilippAEON LawJD Supra; Patent Poetry: Judge Throws Out Most of Artists’ AI Copyright Infringement Claims

"One of the plaintiffs’ theories of infringement was that the output images based on the Training Images are all infringing derivative works.

The court noted that to support that claim the output images would need to be substantially similar to the protected works. However, noted the court,

none of the Stable Diffusion output images provided in response to a particular Text Prompt is likely to be a close match for any specific image in the training data.

The plaintiffs argued that there was no need to show substantial similarity when there was direct proof of copying. The judge was skeptical of that argument.

This is just one of many AI-related cases making its way through the courts, and this is just a ruling on a motion rather than an appellate court decision. Nevertheless, this line of analysis will likely be cited in other cases now pending.

Also, this case shows the importance of artists registering their works with the Copyright Office before seeking to sue for infringement."

Saturday, October 28, 2023

An AI engine scans a book. Is that copyright infringement or fair use?; Columbia Journalism Review, October 26, 2023

MATHEW INGRAM, Columbia Journalism Review; An AI engine scans a book. Is that copyright infringement or fair use?

"Determining whether LLMs training themselves on copyrighted text qualifies as fair use can be difficult even for experts—not just because AI is complicated, but because the concept of fair use is, too."

Thursday, October 26, 2023

Why I let an AI chatbot train on my book; Vox, October 25, 2023

 , Vox; Why I let an AI chatbot train on my book

"What’s “fair use” for AI?

I think that training a chatbot for nonprofit, educational purposes, with the express permission of the authors of the works on which it’s trained, seems okay. But do novelists like George R.R. Martin or John Grisham have a case against for-profit companies that take their work without that express permission?

The law, unfortunately, is far from clear on this question." 

Thursday, October 19, 2023

AI is learning from stolen intellectual property. It needs to stop.; The Washington Post, October 19, 2023

William D. Cohan , The Washington Post; AI is learning from stolen intellectual property. It needs to stop.

"The other day someone sent me the searchable database published by Atlantic magazine of more than 191,000 e-books that have been used to train the generative AI systems being developed by Meta, Bloomberg and others. It turns out that four of my seven books are in the data set, called Books3. Whoa.

Not only did I not give permission for my books to be used to generate AI products, but I also wasn’t even consulted about it. I had no idea this was happening. Neither did my publishers, Penguin Random House (for three of the books) and Macmillan (for the other one). Neither my publishers nor I were compensated for use of my intellectual property. Books3 just scraped the content away for free, with Meta et al. profiting merrily along the way. And Books3 is just one of many pirated collections being used for this purpose...

This is wholly unacceptable behavior. Our books are copyrighted material, not free fodder for wealthy companies to use as they see fit, without permission or compensation. Many, many hours of serious research, creative angst and plain old hard work go into writing and publishing a book, and few writers are compensated like professional athletes, Hollywood actors or Wall Street investment bankers. Stealing our intellectual property hurts." 

Wednesday, October 18, 2023

A.I. May Not Get a Chance to Kill Us if This Kills It First; Slate, October 17, 2023

 SCOTT NOVER, Slate; A.I. May Not Get a Chance to Kill Us if This Kills It First

"There is a disaster scenario for OpenAI and other companies funneling billions into A.I. models: If a court found that a company was liable for copyright infringement, it could completely halt the development of the offending model." 

Thursday, August 24, 2023

Scraping or Stealing? A Legal Reckoning Over AI Looms; Hollywood Reporter, August 22, 2023

 Winston Cho, The Hollywood Reporter ; Scraping or Stealing? A Legal Reckoning Over AI Looms

"Engineers build AI art generators by feeding AI systems, known as large language models, voluminous databases of images downloaded from the internet without licenses. The artists’ suit revolves around the argument that the practice of feeding these systems copyrighted works constitutes intellectual property theft. A finding of infringement in the case may upend how most AI systems are built in the absence of regulation placing guardrails around the industry. If the AI firms are found to have infringed on any copyrights, they may be forced to destroy datasets that have been trained on copyrighted works. They also face stiff penalties of up to $150,000 for each infringement.

AI companies maintain that their conduct is protected by fair use, which allows for the utilization of copyrighted works without permission as long as that use is transformative. The doctrine permits unlicensed use of copyrighted works under limited circumstances. The factors that determine whether a work qualifies include the purpose of the use, the degree of similarity, and the impact of the derivative work on the market for the original. Central to the artists’ case is winning the argument that the AI systems don’t create works of “transformative use,” defined as when the purpose of the copyrighted work is altered to create something with a new meaning or message."

Tuesday, July 25, 2023

The Generative AI Battle Has a Fundamental Flaw; Wired, July 25, 2023

 , Wired; The Generative AI Battle Has a Fundamental Flaw

"At the core of these cases, explains Sag, is the same general theory: that LLMs “copied” authors’ protected works. Yet, as Sag explained in testimony to a US Senate subcommittee hearing earlier this month, models like GPT-3.5 and GPT-4 do not “copy” work in the traditional sense. Digest would be a more appropriate verb—digesting training data to carry out their function: predicting the best next word in a sequence. “Rather than thinking of an LLM as copying the training data like a scribe in a monastery,” Sag said in his Senate testimony, “it makes more sense to think of it as learning from the training data like a student.”...

Ultimately, though, the technology is not going away, and copyright can only remedy some of its consequences. As Stephanie Bell, a research fellow at the nonprofit Partnership on AI, notes, setting a precedent where creative works can be treated like uncredited data is “very concerning.” To fully address a problem like this, the regulations AI needs aren't yet on the books."