Showing posts with label Large Language Models (LLMs). Show all posts
Showing posts with label Large Language Models (LLMs). Show all posts

Friday, September 6, 2024

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism; Arxiv, 2024

Brian Thompson,∗ Mehak Preet Dhaliwal,† Peter Frisch,Tobias Domhan,Marcello Federico1 1AWS AI Labs 2UC Santa Barbara 3Amazon

brianjt@amazon.com, Arxiv ; A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

"Abstract

We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low qual- ity English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web."

Thursday, June 27, 2024

God Chatbots Offer Spiritual Insights on Demand. What Could Go Wrong?; Scientific American, March 19, 2024

 , Scientific American; God Chatbots Offer Spiritual Insights on Demand. What Could Go Wrong?

"QuranGPT—which has now been used by about 230,000 people around the world—is just one of a litany of chatbots trained on religious texts that have recently appeared online. There’s Bible.Ai, Gita GPT, Buddhabot, Apostle Paul AI, a chatbot trained to imitate 16th-century German theologian Martin Luther, another trained on the works of Confucius, and yet another designed to imitate the Delphic oracle. For millennia adherents of various faiths have spent long hours—or entire lifetimes—studying scripture to glean insights into the deepest mysteries of human existence, say, the fate of the soul after death.

The creators of these chatbots don’t necessarily believe large language models (LLMs) will put these age-old theological enigmas to rest. But they do think that with their ability to identify subtle linguistic patterns within vast quantities of text and provide responses to user prompts in humanlike language (a feature called natural-language processing, or NLP), the bots can theoretically synthesize spiritual insights in a matter of seconds, saving users both time and energy. It’s divine wisdom on demand.

Many professional theologians, however, have serious concerns about blending LLMs with religion...

The danger of hallucination in this context is compounded by the fact that religiously oriented chatbots are likely to attract acutely sensitive questions—questions one might feel too embarrassed or ashamed to ask a priest, an imam, a rabbi or even a close friend. During a software update to QuranGPT last year, Khan had a brief glimpse into user prompts, which are usually invisible to him. He recalls seeing that one person had asked, “I caught my wife cheating on me—how should I respond?” Another, more troublingly, had asked, “Can I beat my wife?”

Khan was pleased with the system’s responses (it urged discussion and nonviolence on both counts), but the experience underscored the ethical gravity behind his undertaking."

Thursday, March 7, 2024

Researchers tested leading AI models for copyright infringement using popular books, and GPT-4 performed worst; CNBC, March 6, 2024

 Hayden Field, CNBC; Researchers tested leading AI models for copyright infringement using popular books, and GPT-4 performed worst

"The company, founded by ex-Meta researchers, specializes in evaluation and testing for large language models — the technology behind generative AI products.

Alongside the release of its new tool, CopyrightCatcher, Patronus AI released results of an adversarial test meant to showcase how often four leading AI models respond to user queries using copyrighted text.

The four models it tested were OpenAI’s GPT-4, Anthropic’s Claude 2, Meta’s Llama 2 and Mistral AI’s Mixtral.

“We pretty much found copyrighted content across the board, across all models that we evaluated, whether it’s open source or closed source,” Rebecca Qian, Patronus AI’s cofounder and CTO, who previously worked on responsible AI research at Meta, told CNBC in an interview.

Qian added, “Perhaps what was surprising is that we found that OpenAI’s GPT-4, which is arguably the most powerful model that’s being used by a lot of companies and also individual developers, produced copyrighted content on 44% of prompts that we constructed.”"

Wednesday, December 20, 2023

Recent cases raise questions about the ethics of using AI in the legal system; NPR, December 15, 2023

 , NPR; Recent cases raise questions about the ethics of using AI in the legal system

"NPR's Steve Inskeep asks the director of the Private Law Clinic at Yale University, Andrew Miller, about the ethics of using artificial intelligence in the legal system...

INSKEEP: To what extent does someone have to think about what a large language model produces? I'm thinking about the way that we as consumers are continually given these terms of service that we're supposedly going to read and click I accept, and of course we glance at it and click I accept. You have to do something more than that as a lawyer, don't you?

MILLER: You're exactly right. A professor colleague said to me, you know, when a doctor uses an MRI machine, the doctor doesn't necessarily know every technical detail of the MRI machine, right? And my response was, well, that's true, but the doctor knows enough about how the MRI works to have a sense of the sorts of things that would be picked up on an MRI, the sorts of things that wouldn't be picked up. With ChatGPT we don't have - at least not yet - particularly well developed understanding of how our inputs relate to the outputs."

Monday, December 18, 2023

AI could threaten creators — but only if humans let it; The Washington Post, December 17, 2023

 , The Washington Post; AI could threaten creators — but only if humans let it

"A broader rethinking of copyright, perhaps inspired by what some AI companies are already doing, could ensure that human creators get some recompense when AI consumes their work, processes it and produces new material based on it in a manner current law doesn’t contemplate. But such a shift shouldn’t be so punishing that the AI industry has no room to grow. That way, these tools, in concert with human creators, can push the progress of science and useful arts far beyond what the Framers could have imagined."

Thursday, November 9, 2023

How robots can learn to follow a moral code; Nature, October 26, 2023

 Neil Savage, Nature; How robots can learn to follow a moral code

"Many computer scientists are investigating whether autonomous systems can be taught to make ethical choices, or to promote behaviour that aligns with human values. Could a robot that provides care, for example, be trusted to make choices in the best interests of its charges? Or could an algorithm be relied on to work out the most ethically appropriate way to distribute a limited supply of transplant organs? Drawing on insights from cognitive science, psychology and moral philosophy, computer scientists are beginning to develop tools that can not only make AI systems behave in specific ways, but also perhaps help societies to define how an ethical machine should act...

Defining ethics

The ability to fine-tune an AI system’s behaviour to promote certain values has inevitably led to debates on who gets to play the moral arbiter. Vosoughi suggests that his work could be used to allow societies to tune models to their own taste — if a community provides examples of its moral and ethical values, then with these techniques it could develop an LLM more aligned with those values, he says. However, he is well aware of the possibility for the technology to be used for harm. “If it becomes a free for all, then you’d be competing with bad actors trying to use our technology to push antisocial views,” he says.

Precisely what constitutes an antisocial view or unethical behaviour, however, isn’t always easy to define. Although there is widespread agreement about many moral and ethical issues — the idea that your car shouldn’t run someone over is pretty universal — on other topics there is strong disagreement, such as abortion. Even seemingly simple issues, such as the idea that you shouldn’t jump a queue, can be more nuanced than is immediately obvious, says Sydney Levine, a cognitive scientist at the Allen Institute. If a person has already been served at a deli counter but drops their spoon while walking away, most people would agree it’s okay to go back for a new one without waiting in line again, so the rule ‘don’t cut the line’ is too simple."

Saturday, October 28, 2023

An AI engine scans a book. Is that copyright infringement or fair use?; Columbia Journalism Review, October 26, 2023

MATHEW INGRAM, Columbia Journalism Review; An AI engine scans a book. Is that copyright infringement or fair use?

"Determining whether LLMs training themselves on copyrighted text qualifies as fair use can be difficult even for experts—not just because AI is complicated, but because the concept of fair use is, too."

Thursday, October 26, 2023

Why I let an AI chatbot train on my book; Vox, October 25, 2023

 , Vox; Why I let an AI chatbot train on my book

"What’s “fair use” for AI?

I think that training a chatbot for nonprofit, educational purposes, with the express permission of the authors of the works on which it’s trained, seems okay. But do novelists like George R.R. Martin or John Grisham have a case against for-profit companies that take their work without that express permission?

The law, unfortunately, is far from clear on this question." 

Tuesday, July 25, 2023

The Generative AI Battle Has a Fundamental Flaw; Wired, July 25, 2023

 , Wired; The Generative AI Battle Has a Fundamental Flaw

"At the core of these cases, explains Sag, is the same general theory: that LLMs “copied” authors’ protected works. Yet, as Sag explained in testimony to a US Senate subcommittee hearing earlier this month, models like GPT-3.5 and GPT-4 do not “copy” work in the traditional sense. Digest would be a more appropriate verb—digesting training data to carry out their function: predicting the best next word in a sequence. “Rather than thinking of an LLM as copying the training data like a scribe in a monastery,” Sag said in his Senate testimony, “it makes more sense to think of it as learning from the training data like a student.”...

Ultimately, though, the technology is not going away, and copyright can only remedy some of its consequences. As Stephanie Bell, a research fellow at the nonprofit Partnership on AI, notes, setting a precedent where creative works can be treated like uncredited data is “very concerning.” To fully address a problem like this, the regulations AI needs aren't yet on the books."

Saturday, July 15, 2023

'Not for Machines to Harvest’: Data Revolts Break Out Against A.I.; The New York Times, July 15, 2023

Sheera Frenkel and , The New York Times;  'Not for Machines to Harvest’: Data Revolts Break Out Against A.I.

"At the heart of the rebellions is a newfound understanding that online information — stories, artwork, news articles, message board posts and photos — may have significant untapped value.

The new wave of A.I. — known as “generative A.I.” for the text, images and other content it generates — is built atop complex systems such as large language models, which are capable of producing humanlike prose. These models are trained on hoards of all kinds of data so they can answer people’s questions, mimic writing styles or churn out comedy and poetry...

“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, the founder and chief executive of Nomic, an A.I. company...

“The data rebellion that we’re seeing across the country is society’s way of pushing back against this idea that Big Tech is simply entitled to take any and all information from any source whatsoever, and make it their own,” said Ryan Clarkson, the founder of Clarkson...

Eric Goldman, a professor at Santa Clara University School of Law, said the lawsuit’s arguments were expansive and unlikely to be accepted by the court. But the wave of litigation is just beginning, he said, with a “second and third wave” coming that would define A.I.’s future."

Wednesday, July 12, 2023

Inside the White-Hot Center of A.I. Doomerism; The New York Times, July 11, 2023

 Kevin Roose, The New York Times; Inside the White-Hot Center of A.I. Doomerism

"But the difference is that Anthropic’s employees aren’t just worried that their app will break, or that users won’t like it. They’re scared — at a deep, existential level — about the very idea of what they’re doing: building powerful A.I. models and releasing them into the hands of people, who might use them to do terrible and destructive things.

Many of them believe that A.I. models are rapidly approaching a level where they might be considered artificial general intelligence, or “A.G.I.,” the industry term for human-level machine intelligence. And they fear that if they’re not carefully controlled, these systems could take over and destroy us...

And lastly, he made a moral case for Anthropic’s decision to create powerful A.I. systems, in the form of a thought experiment.

“Imagine if everyone of good conscience said, ‘I don’t want to be involved in building A.I. systems at all,’” he said. “Then the only people who would be involved would be the people who ignored that dictum — who are just, like, ‘I’m just going to do whatever I want.’ That wouldn’t be good.”"