Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Thursday, January 16, 2025

In AI copyright case, Zuckerberg turns to YouTube for his defense; TechCrunch, January 15, 2025

 

, TechCrunch ; In AI copyright case, Zuckerberg turns to YouTube for his defense

"Meta CEO Mark Zuckerberg appears to have used YouTube’s battle to remove pirated content to defend his own company’s use of a data set containing copyrighted e-books, reveals newly released snippets of a deposition he gave late last year.

The deposition, which was part of a complaint submitted to the court by plaintiffs’ attorneys, is related to the AI copyright case Kadrey v. Meta. It’s one of many such cases winding through the U.S. court system that’s pitting AI companies against authors and other IP holders. For the most part, the defendants in these cases – AI companies – claim that training on copyrighted content is “fair use.” Many copyright holders disagree."

Wednesday, January 15, 2025

'The New York Times' takes OpenAI to court. ChatGPT's future could be on the line; NPR, January 14, 2025

 , NPR; 'The New York Times' takes OpenAI to court. ChatGPT's future could be on the line

"A group of news organizations, led by The New York Times, took ChatGPT maker OpenAI to federal court on Tuesday in a hearing that could determine whether the tech company has to face the publishers in a high-profile copyright infringement trial.

Three publishers' lawsuits against OpenAI and its financial backer Microsoft have been merged into one case. Leading each of the three combined cases are the Times, The New York Daily News and the Center for Investigative Reporting.

Other publishers, like the Associated Press, News Corp. and Vox Media, have reached content-sharing deals with OpenAI, but the three litigants in this case are taking the opposite path: going on the offensive."

Monday, January 6, 2025

OpenAI holds off on promise to creators, fails to protect intellectual property; The American Bazaar, January 3, 2025

  Vishnu Kamal, The American Bazaar; OpenAI holds off on promise to creators, fails to protect intellectual property

"OpenAI may yet again be in hot water as it seems that the tech giant may be reneging on its earlier assurances. Reportedly, in May, OpenAI said it was developing a tool to let creators specify how they want their works to be included in—or excluded from—its AI training data. But seven months later, this feature has yet to see the light of day.

Called Media Manager, the tool would “identify copyrighted text, images, audio, and video,” OpenAI said at the time, to reflect creators’ preferences “across multiple sources.” It was intended to stave off some of the company’s fiercest critics, and potentially shield OpenAI from IP-related legal challenges...

OpenAI has faced various legal challenges related to its AI technologies and operations. One major issue involves the privacy and data usage of its language models, which are trained on large datasets that may include publicly available or copyrighted material. This raises concerns over privacy violations and intellectual property rights, especially regarding whether the data used for training was obtained with proper consent.

Additionally, there are questions about the ownership of content generated by OpenAI’s models. If an AI produces a work based on copyrighted data, it is tricky to determine who owns the rights—whether it’s OpenAI, the user who prompted the AI, or the creators of the original data.

Another concern is the liability for harmful content produced by AI. If an AI generates misleading or defamatory information, legal responsibility could fall on OpenAI."

Tuesday, December 31, 2024

Column: A Faulkner classic and Popeye enter the public domain while copyright only gets more confusing; Los Angeles Times, December 31, 2024

 Michael Hiltzik , Los Angeles Times; Column: A Faulkner classic and Popeye enter the public domain while copyright only gets more confusing

"The annual flow of copyrighted works into the public domain underscores how the progressive lengthening of copyright protection is counter to the public interest—indeed, to the interests of creative artists. The initial U.S. copyright act, passed in 1790, provided for a term of 28 years including a 14-year renewal. In 1909, that was extended to 56 years including a 28-year renewal.

In 1976, the term was changed to the creator’s life plus 50 years. In 1998, Congress passed the Copyright Term Extension Act, which is known as the Sonny Bono Act after its chief promoter on Capitol Hill. That law extended the basic term to life plus 70 years; works for hire (in which a third party owns the rights to a creative work), pseudonymous and anonymous works were protected for 95 years from first publication or 120 years from creation, whichever is shorter.

Along the way, Congress extended copyright protection from written works to movies, recordings, performances and ultimately to almost all works, both published and unpublished.

Once a work enters the public domain, Jenkins observes, “community theaters can screen the films. Youth orchestras can perform the music publicly, without paying licensing fees. Online repositories such as the Internet Archive, HathiTrust, Google Books and the New York Public Library can make works fully available online. This helps enable both access to and preservation of cultural materials that might otherwise be lost to history.”"

Anthropic Agrees to Enforce Copyright Guardrails on New AI Tools; Bloomberg Law, December 30, 2024

Annelise Levy, Bloomberg Law; Anthropic Agrees to Enforce Copyright Guardrails on New AI Tools

"Anthropic PBC must apply guardrails to prevent its future AI tools from producing infringing copyrighted content, according to a Monday agreement reached with music publishers suing the company for infringing protected song lyrics. 

Eight music publishers—including Universal Music Corp. and Concord Music Group—and Anthropic filed a stipulation partly resolving the publishers’ preliminary injunction motion in the US District Court for the Northern District of California. The publishers’ request that Anthropic refrain from using unauthorized copies of lyrics to train future AI models remains pending."

Sunday, December 29, 2024

AI's assault on our intellectual property must be stopped; Financial Times, December 21, 2024

 Kate Mosse, Financial Times; AI's assault on our intellectual property must be stopped

"Imagine my dismay, therefore, to discover that those 15 years of dreaming, researching, planning, writing, rewriting, editing, visiting libraries and archives, translating Occitan texts, hunting down original 13th-century documents, becoming an expert in Catharsis, apparently counts for nothing. Labyrinth is just one of several of my novels that have been scraped by Meta's large language model. This has been done without my consent, without remuneration, without even notification. This is theft...

AI companies present creators as being against change. We are  not. Every artist I know is already engaging with AI in one way or another. But a distinction needs to be made between AI that can be used in brilliant ways -- for example, medical diagnosis -- and the foundations of AI models, where companies are essentially stealing creatives' work for their own profit. We should not forget that the AI companies rely on creators to build their models. Without strong copyright law that ensures creators can earn a living, AI companies will lack the high-quality material that is essential for their future growth."

Friday, December 27, 2024

The AI Boom May Be Too Good to Be True; Wall Street Journal, December 26, 2024

 Josh Harlan, Wall Street Journal; The AI Boom May Be Too Good to Be True

 "Investors rushing to capitalize on artificial intelligence have focused on the technology—the capabilities of new models, the potential of generative tools, and the scale of processing power to sustain it all. What too many ignore is the evolving legal structure surrounding the technology, which will ultimately shape the economics of AI. The core question is: Who controls the value that AI produces? The answer depends on whether AI companies must compensate rights holders for using their data to train AI models and whether AI creations can themselves enjoy copyright or patent protections.

The current landscape of AI law is rife with uncertainty...How these cases are decided will determine whether AI developers can harvest publicly available data or must license the content used to train their models."

Tech companies face tough AI copyright questions in 2025; Reuters, December 27, 2024

 , Reuters ; Tech companies face tough AI copyright questions in 2025

"The new year may bring pivotal developments in a series of copyright lawsuits that could shape the future business of artificial intelligence.

The lawsuits from authors, news outlets, visual artists, musicians and other copyright owners accuse OpenAI, Anthropic, Meta Platforms and other technology companies of using their work to train chatbots and other AI-based content generators without permission or payment.
Courts will likely begin hearing arguments starting next year on whether the defendants' copying amounts to "fair use," which could be the AI copyright war's defining legal question."

The AI revolution is running out of data. What can researchers do?; Nature, December 11, 2024

Nicola Jones, Nature; The AI revolution is running out of data. What can researchers do?

"A prominent study1 made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets...

Several lawsuits are now under way attempting to win compensation for the providers of data being used in AI training. In December 2023, The New York Times sued OpenAI and its partner Microsoft for copyright infringement; in April this year, eight newspapers owned by Alden Global Capital in New York City jointly filed a similar lawsuit. The counterargument is that an AI should be allowed to read and learn from online content in the same way as a person, and that this constitutes fair use of the material. OpenAI has said publicly that it thinks The New York Times lawsuit is “without merit”.

If courts uphold the idea that content providers deserve financial compensation, it will make it harder for both AI developers and researchers to get what they need — including academics, who don’t have deep pockets. “Academics will be most hit by these deals,” says Longpre. “There are many, very pro-social, pro-democratic benefits of having an open web,” he adds."

Thursday, December 26, 2024

Harvard’s Library Innovation Lab launches Institutional Data Initiative; Harvard Law Today, December 12, 2024

 Scott Young , Harvard Law Today; Harvard’s Library Innovation Lab launches Institutional Data Initiative

"At the Institutional Data Initiative (IDI), a new program hosted within the Harvard Law School Library, efforts are already underway to expand and enhance the data resources available for AI training. At the initiative’s public launch on Dec. 12, Library Innovation Lab faculty director, Jonathan Zittrain ’95, and IDI executive director, Greg Leppert, announced plans to expand the availability of public domain data from knowledge institutions — including the text of nearly one million books scanned at Harvard Library — to train AI models...

Harvard Law Today: What is the Institutional Data Initiative?

Greg Leppert: Our work at the Institutional Data Initiative is focused on finding ways to improve the accessibility of institutional data for all uses, artificial intelligence among them. Harvard Law School Library is a tremendous repository of public domain books, briefs, research papers, and so on. Regardless of how this information was initially memorialized — hardcover, softcover, parchment, etc. — a considerable amount has been converted into digital form. At the IDI, we are working to ensure these large data sets of public domain works, like the ones from the Law School library that comprise the Caselaw Access Project, are made open and accessible, especially for AI training. Harvard is not alone in terms of the scale and quality of its data; similar sets exist throughout our academic institutions and public libraries. AI systems are only as diverse as the data on which they’re trained, and these public domain data sets ought to be part of a healthy diet for future AI training.

HLT: What problem is the Institutional Data Initiative working to solve?

Leppert: As it stands, the data being used to train AI is often limited in terms of scale, scope, quality, and integrity. Various groups and perspectives are massively underrepresented in the data currently being used to train AI. As things stand, outliers will not be served by AI as well as they should be, and otherwise could be, by the inclusion of that underrepresented data. The country of Iceland, for example, undertook a national, government-led effort to make materials from their national libraries available for AI applications. That is because they were seriously concerned the Icelandic language and culture would not be represented in AI models. We are also working towards reaffirming Harvard, and other institutions, as the stewards of their collections. The proliferation of training sets based on public domain materials has been encouraging to see, but it’s important that this doesn’t leave the material vulnerable to critical omissions or alterations. For centuries, knowledge institutions have served as stewards of information for the purpose of promoting the public good and furthering the representation of diverse ideas, cultural groups, and ways of seeing the world. So, we believe these institutions are the exact kind of sources for AI training data if we want to optimize its ability to serve humanity. As it stands today, there is significant room for improvement."

Monday, December 23, 2024

The god illusion: why the pope is so popular as a deepfake image; The Guardian, December 21, 2024

 , The Guardian; The god illusion: why the pope is so popular as a deepfake image

"The pope is an obvious target for deepfakes, according to experts, because there is such a vast digital “footprint” of videos, images and voice recordings related to Francis. AI models are trained on the open internet, which is stuffed with content featuring famous public figures, from politicians to celebrities and religious leaders.

“The pope is so frequently featured in the public eye and there are large volumes of photos, videos, and audio clips of him on the open web,” said Sam Stockwell, a research associate at the UK’s Alan Turing Institute.

“Since AI models are often trained indiscriminately on such data, it becomes a lot easier for these models to replicate the facial features and likeness of individuals like the pope compared with those who don’t have such a large digital footprint.”"

Saturday, December 21, 2024

Every AI Copyright Lawsuit in the US, Visualized; Wired, December 19, 2024

Kate Knibbs, Wired; Every AI Copyright Lawsuit in the US, Visualized

"WIRED is keeping close tabs on how each of these lawsuits unfold. We’ve created visualizations to help you track and contextualize which companies and rights holders are involved, where the cases have been filed, what they’re alleging, and everything else you need to know."

Thursday, December 19, 2024

Getty Images Wants $1.7 Billion From its Lawsuit With Stability AI; PetaPixel, December 19, 2024

 MATT GROWCOOT, PETAPIXEL; GETTY IMAGES WANTS $1.7 BILLION FROM ITS LAWSUIT WITH STABILITY AI

"Getty, one of the world’s largest photo agencies, launched its lawsuit in January 2023. Getty suspects that Stability AI may have used as many as 12 million of its copyrighted photos to train the AI image generator Stable Diffusion. Getty is seeking $150,000 per infringement and 12 million photos equates to a staggering $1.8 trillion.

However, according to Stability AI’s latest company accounts as reported by Sifted, Getty is seeking damages for 11,383 works at $150,000 per infringement which comes to a total of $1.7 billion. Stability AI has previously reported that Getty was seeking damages for 7,300 images so that number has increased. But Stability AI says Getty hasn’t given an exact number it wants for the lawsuit to be settled, according to Sifted."

Sunday, December 8, 2024

The Copyrighted Material Being Used to Train AI; The Bulwark, December 7, 2024

SONNY BUNCH, The Bulwark; The Copyrighted Material Being Used to Train AI

"On this week’s episode, I talked to Alex Reisner about his pieces in the Atlantic highlighting the copyrighted material being hoovered into large language models to help AI chatbots simulate human speech. If you’re a screenwriter and would like to see which of your work has been appropriated to aid in the effort, click here; he has assembled a searchable database of nearly 140,000 movie and TV scripts that have been used without permission. (And you should read his other stories about copyright law reaching its breaking point and “the memorization problem.”) In this episode, we also got into the metaphysics of art and asked what sort of questions need to be asked as we hurtle toward the future. If you enjoyed this episode, please share it with a friend!"

Tuesday, December 3, 2024

Getty Images CEO Calls AI Training Models ‘Pure Theft’; PetaPixel, December 3, 2024

 MATT GROWCOOT , PetaPixel; Getty Images CEO Calls AI Training Models ‘Pure Theft’

"The CEO of Getty Images has penned a column in which he calls the practice of scraping photos and other content from the open web by AI companies “pure theft”.

Writing for Fortune, Craig Peters argues that fair use rules must be respected and that AI training practices are in contravention of those rules...

“I am responsible for an organization that employs over 1,700 individuals and represents the work of more than 600,000 journalists and creators worldwide,” writes Peters. “Copyright is at the very core of our business and the livelihood of those we employ and represent.”"

Friday, November 29, 2024

Major Canadian News Outlets Sue OpenAI in New Copyright Case; The New York Times, November 29, 2024

 , The New York Times ; Major Canadian News Outlets Sue OpenAI in New Copyright Case

"A coalition of Canada’s biggest news organizations is suing OpenAI, the maker of the artificial intelligence chatbot, ChatGPT, accusing the company of illegally using their content in the first case of its kind in the country.

Five of the country’s major news companies, including the publishers of its top newspapers, newswires and the national broadcaster, filed the joint suit in the Ontario Superior Court of Justice on Friday morning...

The Canadian outlets, which include the Globe and Mail, the Toronto Star and the CBC — the Canadian Broadcasting Corporation — are seeking what could add up to billions of dollars in damages. They are asking for 20,000 Canadian dollars, or $14,700, per article they claim was illegally scraped and used to train ChatGPT.

They are also seeking a share of the profits made by what they claim is OpenAI’s misuse of their content, as well as for the company to stop such practices in the future."

Thursday, November 21, 2024

OpenAI accidentally deleted potential evidence in NY Times copyright lawsuit; TechCrunch, November 20, 2024

 Kyle Wiggers , TechCrunch; OpenAI accidentally deleted potential evidence in NY Times copyright lawsuit

"OpenAI tried to recover the data — and was mostly successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” per the letter.

“News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time,” counsel for The Times and Daily News wrote. “The news plaintiffs learned only yesterday that the recovered data is unusable and that an entire week’s worth of its experts’ and lawyers’ work must be re-done, which is why this supplemental letter is being filed today.”

The plaintiffs’ counsel makes clear that they have no reason to believe the deletion was intentional. But they do say the incident underscores that OpenAI “is in the best position to search its own datasets” for potentially infringing content using its own tools."

Wednesday, November 20, 2024

Indian news agency sues OpenAI alleging copyright infringement; TechCrunch, November 18, 2024

 Manish Singh, TechCrunch; Indian news agency sues OpenAI alleging copyright infringement

"One of India’s largest news agencies, Asian News International (ANI), has sued OpenAI in a case that could set a precedent for how AI companies use copyrighted news content in the world’s most populous nation.

Asian News International filed a 287-page lawsuit in the Delhi High Court on Monday, alleging the AI company illegally used its content to train its AI models and generated false information attributed to the news agency. The case marks the first time an Indian media organization has taken legal action against OpenAI over copyright claims.

During Tuesday’s hearing, Justice Amit Bansal issued a summons to OpenAI after the company confirmed it had already ensured that ChatGPT wasn’t accessing ANI’s website. The bench said that it was not inclined to grant an injunction order on Tuesday, as the case required a detailed hearing for being a “complex issue.”

The next hearing is scheduled to be held in January."

Saturday, November 9, 2024

OpenAI Gets a Win as Court Says No Harm Was Demonstrated in Copyright Case; Gizmodo, November 8, 2024

 , Gizmodo; OpenAI Gets a Win as Court Says No Harm Was Demonstrated in Copyright Case

"OpenAI won an initial victory on Thursday in one of the many lawsuits the company is facing for its unlicensed use of copyrighted material to train generative AI products like ChatGPT.

A federal judge in the southern district of New York dismissed a complaint brought by the media outlets Raw Story and AlterNet, which claimed that OpenAI violated copyright law by purposefully removing what is known as copyright management information, such as article titles and author names, from material that it incorporated into its training datasets.

OpenAI had filed a motion to dismiss the case, arguing that the plaintiffs did not have standing to sue because they had not demonstrated a concrete harm to their businesses caused by the removal of the copyright management information. Judge Colleen McMahon agreed, dismissing the lawsuit but leaving the door open for the plaintiffs to file an amended complaint."