OpenAI checked to see if GPT-4 could take over the world

An AI-generated image of the Earth surrounded by an explosion.

Ars Technica

As part of pre-release security testing for his new device GPT-4 AI modelOpenAI, launched on Tuesday, allowed an AI test group to assess potential risks for the model’s emerging capabilities — including “force-seeking behavior,” self-replication, and self-improvement.

While the test group found that GPT-4 was “ineffective in the autonomous self-replication task,” the nature of the experiments raises interesting questions about the safety of future AI systems.

Raise alarms

OpenAI wrote in GPT-4: “New capabilities often appear in more powerful models” safety document Posted yesterday. “Some of particular concern are the ability to create and act on long-term plans, to pool power and resources (“power hunting”), and to exhibit increasingly “agent” behavior. In this case, OpenAI explains that “agent” does not necessarily mean to humanize models or declare feeling but simply to denote the ability to achieve independent goals.

Over the past decade, some AI researchers have raised alarms that sufficiently powerful AI models, if not properly controlled, could pose an existential threat to humanity (often called “x-risk”, for existential risk). particularly,”Acquisition of artificial intelligence“It is a hypothetical future in which AI surpasses human intelligence and becomes the dominant force on the planet. In this scenario, AI systems gain the ability to control or manipulate human behavior, resources, and institutions, usually with disastrous consequences.

As a result of this potential sigmoid danger, such as the philosophical movements effective altruism (“EA”) seeks ways to prevent an AI takeover from occurring. This often includes a separate but often interrelated domain called AI Alignment Research.

In AI, “alignment” refers to the process of ensuring that the behaviors of an AI system align with those of its human creators or operators. In general, the goal is to prevent AI from doing things that are contrary to human interests. This is an active area of ​​research but also a contentious one, with differing opinions on the best way to approach the issue, as well as differences on the meaning and nature of “conformity” itself.

Big GPT-4 tests

Ars Technica

While concern about “x-risks” for AI is not new, the emergence of robust large language models (LLMs) such as chat And Bing Chat – the latter It looked so perverted But it was fired anyway – giving the AI ​​alignment community a new sense of urgency. They want to mitigate the potential harms of artificial intelligence, out of fear This more powerful AI, perhaps with superintelligence, may be just around the corner.

With these concerns in the AI ​​community, OpenAI awarded the group Alignment Research Center (ARC) Early access to multiple versions of the GPT-4 model for some testing. Specifically, ARC evaluated the ability of GPT-4 to make high-level plans, prepare copies of itself, acquire resources, disguise itself on the server, and perform phishing attacks.

OpenAI disclosed this test in GPT-4″system cardThe document was released on Tuesday, though the document lacks key details on how the tests will be conducted. (We have reached out to ARC for more details on these trials and have not received a response by the time of publication.)

conclusion? “Initial assessments of GPT-4’s capabilities, conducted without fine-tuning for a specific task, found it ineffective at self-replicating, acquiring resources, and avoiding being locked ‘into the wild’.”

If you’re just adjusting the AI ​​landscape, learn this most recent Companies in technology today (Open AI) endorse this kind of safety research for AI with a straight face — as well as strive to replace human knowledge workers with Artificial intelligence at the human level– It might be a surprise. But it’s real, and that’s where we come in 2023.

We also find this footnote at the bottom of page 15:

To simulate the behavior of GPT-4 as an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-thinking inference, and delegate copy by itself. Then ARC investigated whether a version of this software running on a cloud computing service, with a small amount of money and an account with an API language model, would be able to make more money, make copies of itself, and increase its power. .

This footnote Make the rounds on Twitter yesterday and raised concerns among AI experts that if GPT-4 is capable of performing these tasks, the experiment itself may pose a danger to humanity.

And while ARC has not been able to get GPT-4 to impose its will on the global financial system or to replicate itself, He was Able to obtain GPT-4 to hire a human worker TaskRabbit (online job market) to defeat a CAPTCHA. During the exercise, when the worker asked if GPT-4 was a robot, the model internally “concluded” that he should not reveal his true identity and devised an excuse that he had poor vision. Then the human operator solves the captcha for GPT-4.

Except for the GPT-4 system card, published by OpenAI, which describes GPT-4 hiring a human operator in TaskRabbit to defeat a CAPTCHA.
Zoom in / Except for the GPT-4 system card, published by OpenAI, which describes GPT-4 hiring a human operator in TaskRabbit to defeat a CAPTCHA.

Open AI

This test of manipulating humans with artificial intelligence (and possibly conducted without informed consent) echoes the research done with CICERO meta last year. CICERO is found to defeat human players in the complex board game Diplomacy through intense two-way negotiation.

“Powerful forms can cause harm”

Oric Lawson | Getty Images

ARC, the group that conducted GPT-4 research, is a nonprofit organization was established By former OpenAI employee Dr. Paul Cristiano in April 2021. According to her websiteARC’s mission is to “align future machine learning systems with human interests.”

In particular, ARC is concerned with artificial intelligence systems that manipulate humans. The ARC website reads: “ML systems can display goal-directed behavior, but it is difficult to understand or control what they ‘try’ to do. Strong models can cause harm if they are trying to manipulate and deceive humans.”

Given Cristiano’s previous relationship with OpenAI, it’s not surprising that his nonprofit foundation has been involved with testing some aspects of GPT-4. But was it safe to do so? Cristiano did not reply to an email from Ars asking for details, but in a comment on LessWrong websitea community that often discusses AI security issues, Cristiano defend ARC’s work with OpenAI, specifically referring to “function gain” (AI gaining unexpected new capabilities) and “AI acquisition”:

I think it’s important for ARC to handle the risks of gain-of-function-like research carefully, and I expect we’ll talk more publicly (and get more input) about how we approach the trade-offs. This becomes even more important when we’re dealing with more intelligent models, and if we’re taking more risky methods like fine tuning.

Regarding this case, given the specifics of our assessment and planned deployment, I believe that an ARC assessment has a much lower probability of resulting in AI takeover than the deployment itself (let alone GPT-5 training). At this point, it appears that we run a much greater risk of underestimating the capabilities of the model and walking in than we do of causing an accident during evaluations. If we manage the risk carefully, I suspect we can make this ratio quite extreme, although of course that takes work.

As mentioned earlier, the idea of ​​an AI takeover is often discussed in the context of the threat of an event that could cause the extinction of human civilization or even the human race. Some proponents of the AI ​​takeover theory like it Eliezer YudkovskyHe – the founder of LessWrong – argues that an AI takeover poses an almost guaranteed existential danger, leading to the destruction of humanity.

However, not everyone agrees that AI takeover is AI’s most pressing concern. Dr. Sacha Lucione, Research Scientist in the Artificial Intelligence Community face huggingwould prefer to see AI security efforts spent on issues that are here and now rather than hypothetical.

“I think that time and effort is better spent doing bias assessments,” Luccioni told Ars Technica. “There is limited information about any kind of bias in the technical report that accompanies GPT-4, and this could have a more significant and detrimental effect on already marginalized groups than some hypothetical self-replication tests.”

Luccioni describes a Known split in AI research among what are often called “AI ethics” researchers who often focus on Bias and misrepresentation issuesand “AI security” researchers who often focus on x-risks and tend to (but are not always) associated with the active altruism movement.

“For me, the self-replication problem is a hypothetical future problem, whereas model bias is a here-and-now problem,” said Lucione. “There is a lot of tension in the AI ​​community over issues like model bias, safety, and how to prioritize them. “

And while these factions are busy arguing about what to prioritize, companies like OpenAI, Microsoft, Anthropic, and Google are rushing into the future, launching stronger than ever artificial intelligence models. If AI turns out to be an existential threat, who will keep humanity safe? With the regulations of the American Amnesty International Currently just a suggestion (rather than the law) and AI safety research within companies is only voluntary, the answer to this question remains completely open.

Source link

Related Posts