Michelle Cheng, Quartz; "Shadow libraries" are at the heart of the mounting copyright lawsuits against OpenAI
"However, there are clues about these two data sets. “Books1” is linked to Project Gutenberg (an online e-book library with over 60,000 titles), a popular dataset for AI researchers to train their data on due to the lack of copyright, the filing states. “Books2” is estimated to contain about 294,000 titles, it notes.
Most of the “internet-based books corpora” is likely to come from shadow library websites such as Library Genesis, Z-Library, Sci-Hub, and Bibliotik. The books aggregated by these sites are available in bulk via torrent websites, which are known for hosting copyrighted materials.
What exactly are shadow libraries?
Shadow libraries are online databases that provide access to millions of books and articles that are out of print, hard to obtain, and paywalled. Many of these databases, which began appearing online around 2008, originated in Russia, which has a long tradition of sharing forbidden books, according to the magazine Reason.
Soon enough, these libraries became popular with cash-strapped academics around the world thanks to the high cost of accessing scholarly journals—with some reportedly going for as much as $500 for an entirely open-access article.
These shadow libraries are also called “pirate libraries” because they often infringe on copyrighted work and cut into the publishing industry’s profits. A 2017 Nielsen and Digimarc study (pdf) found that pirated books were “depressing legitimate book sales by as much as 14%.”"