Ethics, Info, Tech: Contested Voices, Values, Spaces: public domain data

Friday, June 6, 2025

AI firms say they can’t respect copyright. These researchers tried.; The Washington Post, June 5, 2025

with research by

and

, The Washington Post; AI firms say they can’t respect copyright. These researchers tried.

"A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023.

A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate.

The group built an AI model that is significantly smaller than the latest offered by OpenAI’s ChatGPT or Google’s Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools.

That could have implications for the policy debate swirling around AI and copyright.

The paper itself does not take a position on whether scraping text to train AI is fair use.

That debate has reignited in recent weeks with a high-profile lawsuit and dramatic turns around copyright law and enforcement in both the U.S. and U.K."