Showing posts with label large datasets of public domain works. Show all posts
Showing posts with label large datasets of public domain works. Show all posts

Thursday, December 26, 2024

Harvard’s Library Innovation Lab launches Institutional Data Initiative; Harvard Law Today, December 12, 2024

 Scott Young , Harvard Law Today; Harvard’s Library Innovation Lab launches Institutional Data Initiative

"At the Institutional Data Initiative (IDI), a new program hosted within the Harvard Law School Library, efforts are already underway to expand and enhance the data resources available for AI training. At the initiative’s public launch on Dec. 12, Library Innovation Lab faculty director, Jonathan Zittrain ’95, and IDI executive director, Greg Leppert, announced plans to expand the availability of public domain data from knowledge institutions — including the text of nearly one million books scanned at Harvard Library — to train AI models...

Harvard Law Today: What is the Institutional Data Initiative?

Greg Leppert: Our work at the Institutional Data Initiative is focused on finding ways to improve the accessibility of institutional data for all uses, artificial intelligence among them. Harvard Law School Library is a tremendous repository of public domain books, briefs, research papers, and so on. Regardless of how this information was initially memorialized — hardcover, softcover, parchment, etc. — a considerable amount has been converted into digital form. At the IDI, we are working to ensure these large data sets of public domain works, like the ones from the Law School library that comprise the Caselaw Access Project, are made open and accessible, especially for AI training. Harvard is not alone in terms of the scale and quality of its data; similar sets exist throughout our academic institutions and public libraries. AI systems are only as diverse as the data on which they’re trained, and these public domain data sets ought to be part of a healthy diet for future AI training.

HLT: What problem is the Institutional Data Initiative working to solve?

Leppert: As it stands, the data being used to train AI is often limited in terms of scale, scope, quality, and integrity. Various groups and perspectives are massively underrepresented in the data currently being used to train AI. As things stand, outliers will not be served by AI as well as they should be, and otherwise could be, by the inclusion of that underrepresented data. The country of Iceland, for example, undertook a national, government-led effort to make materials from their national libraries available for AI applications. That is because they were seriously concerned the Icelandic language and culture would not be represented in AI models. We are also working towards reaffirming Harvard, and other institutions, as the stewards of their collections. The proliferation of training sets based on public domain materials has been encouraging to see, but it’s important that this doesn’t leave the material vulnerable to critical omissions or alterations. For centuries, knowledge institutions have served as stewards of information for the purpose of promoting the public good and furthering the representation of diverse ideas, cultural groups, and ways of seeing the world. So, we believe these institutions are the exact kind of sources for AI training data if we want to optimize its ability to serve humanity. As it stands today, there is significant room for improvement."